Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

set cpu affinity and membind for better oob performance #853

Merged
merged 19 commits into from
Aug 27, 2024

Conversation

rbrugaro
Copy link
Contributor

sets num of threads and cpu affinity to the number of physical cpus in a socket and performs memory binding to single numa node for better oob performance

@rbrugaro
Copy link
Contributor Author

@sywangyi @jiqing-feng please review

@sywangyi
Copy link
Collaborator

sywangyi commented Jul 30, 2024

Hi, generally, I think such logic should not be put in model code. it should be in framework code like transformers. accelerator, deepspeed... also model code should be adapted to variant usage like autoTP for inference. if ipex model is running for 2 TP, this logic is wrong in that case. @yao-matrix

@@ -153,6 +167,18 @@ def __init__(
else:
self._device = torch.device("cpu")

import numa
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should add check if the numa package is available

numa.set_membind([0])
print("affinity", numa.get_affinity(0))
print("membind", numa.get_membind())

Copy link
Collaborator

@sywangyi sywangyi Jul 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also ,if the OMP_NUM_THREADS, membind is already set by external user. should not override the user's configuration.

also Tensor Parallel case should be considered as well.

@jiqing-feng
Copy link
Collaborator

jiqing-feng commented Aug 6, 2024

Hi, generally, I think such logic should not be put in model code. it should be in framework code like transformers. accelerator, deepspeed... also model code should be adapted to variant usage like autoTP for inference. if ipex model is running for 2 TP, this logic is wrong in that case. @yao-matrix

Hi @rbrugaro . The issue still exists. I think you should check the world size (rank number) first to know how many processes you have, and then you can choose the best binding strategy for each rank.

@jiqing-feng
Copy link
Collaborator

We plan to integrate the best strategy of binding cores in ipex backend, so other users can call this function to get the best performance of IPEXModel on CPU. cc @echarlaix

@rbrugaro
Copy link
Contributor Author

rbrugaro commented Aug 7, 2024

@jiqing-feng @sywangyi I updated code to read world_size and rank from environment variables and set good default settings only if unset by user.
50% oob improvement in generation time in 4th Gen Intel(R) Xeon(R) for 1 process (gpt2)
27% oob improvementt in generation time in 4th Gen Intel(R) Xeon(R) for 2 process (gpt2)
Note: No oob improvement seen when 2 process are launched with accelerate launcher.

1 process:

setting OMP_NUM_THREADS to 56
affinity={0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55}, membind = {0}

Screenshot 2024-08-07 at 11 32 04 AM

2 process:

setting OMP_NUM_THREADS to 56
setting OMP_NUM_THREADS to 56
affinity={56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111}, membind = {1}
affinity={0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55}, membind = {0}

Screenshot 2024-08-07 at 11 33 31 AM

Note: if the multi process is launched with accelerate launch text_gen2.py , accelerate launcher sets defaults for number of threads and affinity to number of logical cores and didn't see any improvement. Maybe accelerate OOB defaults need to be updated?

$ accelerate launch text-gen2.py 
2024-08-07 18:29:35,280 - launch.py - accelerate.commands.launch - WARNING - The following values were not passed to `accelerate launch` and had defaults used instead:
	`--num_cpu_threads_per_process` was set to `112` to improve out-of-box performance when training on CPUs
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
My guessed rank = 1
My guessed rank = 0
OMP_NUM_THREADS already set to  112
OMP_NUM_THREADS already set to  112
affinity={56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 215, 216, 217, 218, 219, 220, 221, 222, 223}, membind = {1}
affinity={0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167}, membind = {0}

Screenshot 2024-08-07 at 11 30 12 AM

@jiqing-feng
Copy link
Collaborator

Usually, we only bind physical cores which are 0-55 and 56-111 in your case. The memory bind is also required. It's weird that accelerate launch has no speed-up compared to the default Python command. Do you have any clue about it? cc @sywangyi

@sywangyi
Copy link
Collaborator

@rbrugaro
Copy link
Contributor Author

@sywangyi @jiqing-feng Turned out to be a bug in accelerate that was setting the wrong num_cpu_threads_per_process when the script was launched with accelerate launch myscript.py because it was reading an unset env variable. It was working OK when launched with mpirun -f hostfile -n 4 -ppn 2 accelerate launch myscript.py
I have opened an accelerate PR#3009 for it
Now I can see 15% speed up from numa bindings.

@rbrugaro rbrugaro marked this pull request as ready for review August 13, 2024 16:14
@rbrugaro
Copy link
Contributor Author

@sywangyi @jiqing-feng @echarlaix I have addressed the comments and accelerate PR#3009 has been merged. please review

@jiqing-feng
Copy link
Collaborator

Hi @IlyasMoutawwakil . Could you merge this PR? Thx!

@rbrugaro
Copy link
Contributor Author

rbrugaro commented Aug 20, 2024

@sywangyi @jiqing-feng I pushed one commit that remained in my local. Results above correspond to this. I double checked.
example: 2 nodes, 4 process
rank_per_node=2
rank_id=0
node_id = int((rank_id+1) / rank_per_node) = 0

rank_id=1
node_id = int((rank_id+1) / rank_per_node) = 1

numa.set_membind([node_id])

@sywangyi
Copy link
Collaborator

sywangyi commented Aug 20, 2024

Quote reply

in previous logic, if 2nodes, 4 process in one dev
rank_id, 0, 1, node_id 0
rank_id 2,3, node_id 1

in your modified logic. 2nodes, 4 process in one dev
rank 0, node 0
rank 1, node 1
rank 2, node 1.
rank 3, node 2?

@rbrugaro
Copy link
Contributor Author

You are correct @sywangyi, I had missed it because i was testing with 2 machines, 2 process per node.
I've changed it so the numa node_id is calculated from local_size not world size across machines. I have tested with 2 nodes and num_processes:2,4,8 and numa affinity and binding works as expected.

Comment on lines 149 to 150
import importlib.util
import platform
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not at the top ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved

Comment on lines 155 to 158
import math

import numa
import psutil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

math could be at the top here, numa is checked but not psutil
I think we should continue to use methods like the ones in

def is_ipex_available():

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added is_numa_available to the import_utils.py

Copy link
Member

@IlyasMoutawwakil IlyasMoutawwakil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this utility ! it's really cool !
One question, does this remove the need for numactl numa-args python script.py args or python -m torch.backends.xeon.run_cpu xeon-args script.py args entirely ?

@rbrugaro
Copy link
Contributor Author

@IlyasMoutawwakil thanks for the comments, please check if I have addressed those properly.

This utility assigns a default system configuration that works well for most common scenarios. However, it may not be optimal for every situation or machine configuration, as performance can vary depending on the model, task, and other concurrent processes on the host.
Users who need more granular control will continue to use numactl numa-args python script.py args or
python -m torch.backends.xeon.run_cpu xeon-args script.py args

What we find is that some users may not use numatcl or configure the environment and they miss on performance. The utility is designed so that any environment/system configuration by the user takes precedence but if the user does not specify the settings, the utility will apply a default configuration that improves the oob performance.

Specifically, the utility sets OMP_NUM_THREADS and torch.num_threads to an equal split of the available physical cores, divided across the ranks running on the machine and performs NUMA affinity and memory binding to avoid the overhead of accessing memory on remote NUMA nodes.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

raise ImportError("'numa' module not found, install with 'pip install numa'")
import numa

local_size = get_int_from_env(["MPI_LOCALNRANKS", "OMPI_COMM_WORLD_LOCAL_SIZE", "MV2_COMM_WORLD_LOCAL_SIZE"], 1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's also LOCAL_WORLD_SIZE from torchrun

@rbrugaro
Copy link
Contributor Author

@IlyasMoutawwakil i fixed the last style/quality issue and checks are passing. if you think is ready pls merge. thx for all the help!

@IlyasMoutawwakil IlyasMoutawwakil merged commit 9a18ae0 into huggingface:main Aug 27, 2024
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants