High cpu or memory usage issues

Dec 7, 2021

Logs investigation

  1. Identify which service is using more CPU/Memory. Go through the Access logs from nginx for indentifing the max time taking request. status=200; req_time=346924; rdbms_time=267; rdbms_count=6; authz_time=91; authz_count=3; depsolver_time=146; depsolver_count=1
  2. Find the different response types from the access logs using the below command. awk '{print $9}' /var/log/opscode/nginx/access.log | sort | uniq -c | sort -rn sample output
     424886 200
     106221 404
          2 499
  3. The count of requests per second over the life of the log: cat access.log | awk '{print $4}' | uniq -c sample output
    280 rps
  4. For example considering depsolver is taking more time for response.
  5. Run the fprof for finding which function is taking more time in the erchef console. redbug:start("chef_wm_depsolver:make_json_list", [{print_file, "/tmp/redbug.out"}, {file_size, 150}, {msgs,1}]).
  6. This captures one execution of make_json_list and prints the function args to file.Next, I had to edit the file (redbug.out) to make it a valid erlang term - so removing the function call, and basically just leaving behind the argument I cared about -- the single long list of cookbook versions. After that:
    {ok, [Content|_]} = file:consult("/tmp/redbug.out").`
    % Run fprof to profile the function in question. We'll use the argument data we just captured in redbug as the input.
    fprof:apply(chef_wm_depsolver, make_json_list, [Content, "https://[2600:1f1c:f24:ad01:b300:cfe6:5f15:b905]", 1], 
     [{file, "/tmp/fprof.trace"}]).
  7. This handy little escript converts the trace to callgrind format:

Load testing setup

Chef Server

  1. Setup chef-server & 4 load servers in the AWS console using below AMI's
    Load generator backup
  2. Upgrade chef-server to specific version by following
  3. Create user's and organization in the chef-server using the below commands.
chef-server-ctl org-create test1 test1 > test1_validator.pem
chef-server-ctl user-create testuser1 test test []( password > /home/ubuntu/testuser1.pem
chef-server-ctl org-user-add -a test1 testuser1

Chef Load

  1. Use specific branch of chef-load (
  2. Copy all the users/client keys from chef-server to chef-load for generating load
Copy the pem's to local and then to load servers
scp -i ~/.ssh/aws-shared-chef-infra-server.pem ubuntu@*.pem .

scp -i ~/.ssh/aws-shared-chef-infra-server.pem *.pem ubuntu@
scp -i ~/.ssh/aws-shared-chef-infra-server.pem *.pem ubuntu@
scp -i ~/.ssh/aws-shared-chef-infra-server.pem *.pem ubuntu@
scp -i ~/.ssh/aws-shared-chef-infra-server.pem *.pem ubuntu@
  1. update the chef-load.toml with chef server & other details.
log_file = "chef-load.log"
chef_server_url = "https://[2600:1f1c:f24:ad01:b300:cfe6:5f15:b905]/organizations/test1/"
client_key = "./testuser1.pem"
client_name = "testuser1"
ohai_json_file = "node.json"
chef_environment = "_default"

# assume four chef-load instances for a total of 7000 nodes converging every 15 minutes.

# override on CLI with -n
num_nodes = 1750

# override on CLI with -i 
interval = 15

# override on CLI with -a
num_actions = 0 # For data collector, which is disabled. 

# override on CLI with -p
node_name_prefix = "load4"

# In what frequency (0.0-1.0) of all CCR runs does a node/client get replaced after initial ramp-up of all nodes & clients 
# causing a new node/client to get created on the server. 
# override on CLI with -R
node_replacement_rate = 0

# each node's run list is chosen randomly at the time of the simulated run  from this list. 
# In this case, we repeated twshared_tier to weight for higher frequency of that run list. 
# Future iterations might allow you to specify the weighting directly.

# feel free to add to these these to simulate different node types . I'll be updating
# ours shortly to incorproate the new roles/cookbooks recently provided. 
run_lists = [
  [ "role[fb_base]", "role[fb_middleware]", "role[chef_tier]" ],
  [ "role[fb_base]", "role[fb_middleware]", "role[rsw_tier]" ],
  [ "role[fb_base]", "role[fb_middleware]", "role[rtsw_tier]" ],
  [ "role[fb_base]", "role[biz_tier]" ],
  [ "role[fb_base]", "role[fboss_tier]" ],
  [ "role[fb_base]", "role[sparefullweb_tier]"],
  [ "role[fb_base]", "role[perforce_tier]"  ],
  [ "role[fb_base]", "role[udb_tier]"  ],
  [ "role[fb_base]", "role[eb_tier]" ],
  [ "role[fb_base]", "role[dns_tier]" ],
  [ "role[fb_base]", "role[orderdb_tier]" ],

# never, first, always
download_cookbooks = "always"

# On average download about 1% of the cookbooks resolved from the runlist
# simulating the usual case where only some cookbooks are updated so a
# client seldom needs to download all of them. If download_cookbooks == "first"
# then this is ignored and all cookbooks are downloaded.
# Override on CLI with -C 
download_cookbooks_scale_factor = 0.01

# Sleep this long (seconds) during the client run after retreiving cookbooks and before 
# saving the ndoe, simulating the time client converge activity would take. 
sleep_duration = 0

# Save node for 80% of runs, based on initial rough parsing of healthy logs 
node_save_frequency = 0.8

# api_get_requests is an optional list of API GET requests as URLs that are made during the chef-client run.
# eg "search/node?q=*%253A*&sort=X_CHEF_id_CHEF_X%20asc&start=0"
api_get_requests = [ ]

# chef_version sets the value of the X-Chef-Version HTTP header in API requests sent to the Chef Server.
# It has no effect on the behavior of the run.
chef_version = "13.2.20"

# use client-side key creation instead of server side, which is the default 
# since (I think) 12+
chef_server_creates_client_key = false

# Send data to the Chef server's Reporting service
enable_reporting = false

# Generate Random Data. Not used outside of data sent to data collector
random_data = true

# Generate Liveness Agent Data
liveness_agent = false

  1. Start the load using the below command. For more information please read the chef-load readme file( ./chef-load -c chef-load.toml -i 1 -a 0 -n 10 -p load1a -R .01 start

Chef Server benchmark details

Setup Datadog for realtime metrics