Skip to content

Latest commit

 

History

History
105 lines (88 loc) · 4.27 KB

memory-profiler.md

File metadata and controls

105 lines (88 loc) · 4.27 KB

Memory profiler

Memory profiler can be enabled and managed at runtime by endpoints to track down memory leaks in the Jobs. After the profiler gets off, the report is saved, and it can be downloaded at one of the Job's endpoints. This is intended to remove the need for the Kubernetes operator to be involved in this process. All steps can be done by a developer of a Job. The Job can be instructed to turn the profile on and off, and to download the reports in various forms.

Setup

First, to make the profiler available, you need to set this environment variable:

MEMRAY_PROFILER=true

If you're running it locally, you can call:

export MEMRAY_PROFILER=true

However, if you're deploying the job to Racetrack, here's how to set this env var in a manifest:

name: python-job-profiled
...
runtime_env:
  MEMRAY_PROFILER: true

This will make the Job to start the memray profiler at its startup, make sure to turn it off afterward. This will also bring up new endpoints to manage it.

If you're running the job in Docker, set the write permissions to the job's main directory by adding this to your Dockerfile:

RUN chmod -R a+rw /src/job/

Enabling Memory Leaks View

If you want to turn on the Memory Leaks View in the flamegraph, you can do it by setting this environment flag:

MEMRAY_LEAKS=true
PYTHONMALLOC=malloc

or

runtime_env:
  MEMRAY_LEAKS: true
  PYTHONMALLOC: malloc

Endpoints

Check out SwaggerUI at the main page of a Job for more details and to call these endpoints in a convenient way. (All endpoints may be prepended with /pub/job/JOB_NAME/JOB_VERSION/ prefix)

Having the report file downloaded, you can make any other analysis on it. See what else you can do with the memray report.

Example

A job sample/memory-leak/job.py has the memory leak on purpose. Let's track it down with the memory profiler.

Start the job with the profiler enabled:

MEMRAY_PROFILER=true racetrack_job_runner run sample/memory-leak/job.py

Call the perform endpoint couple of times:

curl -X 'POST' \
  'http://0.0.0.0:7000/pub/job/JOB_NAME/JOB_VERSION/api/v1/perform' \
  -H 'Content-Type: application/json' \
  -d '{}'

Now, stop the profiler session to make the Job save the report file.

curl -X 'POST' 'http://0.0.0.0:7000/pub/job/JOB_NAME/JOB_VERSION/api/v1/profiler/memray/stop'

You can view the Flame Graph report directly in a browser: http://0.0.0.0:7000/pub/job/JOB_NAME/JOB_VERSION/api/v1/profiler/memray/flamegraph and locate the memory leak there.

Or you can download the binary report file from the Job.

curl http://0.0.0.0:7000/pub/job/JOB_NAME/JOB_VERSION/api/v1/profiler/memray/report --output memray-report.bin

and analyze it with various memray reporters:

pip install memray
memray tree memray-report.bin
# or
memray stats memray-report.bin

which contains the output like

🥇 Top 5 largest allocating locations (by size):
	- perform:.../sample/memory-leak/job.py:7 -> 22.888MB
	- get_data:<frozen importlib._bootstrap_external>:1187 -> 6.830MB

which correctly points us to the culprit at line 7 of ./sample/memory-leak/job.py.