Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LocalPipelineExecutor does not use cpu cores #240

Open
elifssamplespace opened this issue Jul 6, 2024 · 2 comments
Open

LocalPipelineExecutor does not use cpu cores #240

elifssamplespace opened this issue Jul 6, 2024 · 2 comments

Comments

@elifssamplespace
Copy link

I am trying to process a CC dump using the LocalPipelineExecutor. My setup includes 6 files in the dump and a VM with 48 CPU cores. I run the code with 6 tasks and 48 workers, What I expect is that 48 cores should be utilized efficiently. Only 6 cores are actively processing the tasks.

Code:

    executor = LocalPipelineExecutor(
    pipeline=[
        JsonlReader(data_folder=f"{cc_path}/{dump}",text_key="raw_content"),
        URLFilter(),
        GopherRepetitionFilter(language = "tr"),
        GopherQualityFilter( language = "tr"),
        C4QualityFilter(filter_no_terminal_punct=False,
                       language = "tr"),
        C4BadWordsFilter(default_language = "tr"),
        PIIFormatter(),
        JsonlWriter(
            output_folder=f"{out_path}/out-text-process-4/{dump}"
        )
    ],
    logging_dir="logs",
    workers=48,
    tasks=6
)
    executor.run()

How can I use all cores to process data?

@justHungryMan
Copy link
Contributor

Since maximum task is 6, if you try to use 48 workers, only 6 workers get task and run.

@guipenedo
Copy link
Collaborator

Hi, we only multiprocess on the individual file level. So if you have 1 task processing 1 file, giving it more CPUs will not speed up the processing. The way to go faster is to have more (smaller) input files so that you can have more tasks in total

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants