Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spike: design example kedro projects that can be used to assess performance issues #3957

Closed
merelcht opened this issue Jun 17, 2024 · 19 comments
Assignees

Comments

@merelcht
Copy link
Member

Description

Prework for #3866

Context

In order to create example kedro project that can be used to assess performance of Kedro and Kedro-Viz, we need to gather requirements of what defines complex pipelines. Some of the moving parts are number of nodes, number of pipelines and number of datasets, but that might not be all that's required to create a proper "family" of test projects.

Possible Implementation

Good starting point: https://github.com/noklam/kedro-example/blob/master/stress-test-pipeline/src/stress_test_pipeline/pipeline.py

@datajoely
Copy link
Contributor

Heavy dependency imports would be great here too

@merelcht
Copy link
Member Author

Heavy dependency imports would be great here too

Core dependencies of Kedro or just any?

@datajoely
Copy link
Contributor

Sorry I meant things like Pytorch / Tensorflow / Spark / Pandas

@merelcht
Copy link
Member Author

merelcht commented Jul 9, 2024

As mentioned in #3732 large parameter files seem to slow things down.

@noklam
Copy link
Contributor

noklam commented Jul 23, 2024

https://linen-slack.kedro.org/t/22711373/is-there-any-people-want-to-use-the-kedro-vscode-extension-s#9a92d9f4-8083-4e74-acf3-ef811df08201

Gathered some topics from Kedro Slack Archive & Github

Bottleneck:

  • CPU bound
  • Memory bound
  • I/O bound

What is considered slow?

  • Slow start up time for light workload. i.e. (Takes 10 second to prepare for a pipeline that finish in 2 seconds)
  • Benchmarking to non-Kedro pipeline. Training a LLM is slow regardless with/without Kedro, what we should compare here is if Kedro introduce extra penalty.

We could also approach this at a component level first. i.e. How slow is DataCatalog when # of datasets scale up, how slow is pipeline sum when # of pipeline scale up. The outcome of this issue is to create some idea/script that can be re-use and we can benchmark performance in a on-going basis (maybe include in CI or manually trigger from time to time)

@marrrcin
Copy link
Contributor

I upvote the tests for:

  • large parameter files seem
  • large data catalogs (although it can sometimes be mitigated by the dataset factories)
  • pipelines generated in loops, especially Dynamic Pipelines

Some less obvious:

  • modifying catalog on the fly in hooks

@noklam
Copy link
Contributor

noklam commented Aug 5, 2024

I suggest focus on two things:

  1. Kedro vs Without Kedro
  2. Config/Pipeline/Catalog creation time when the # of entry scales.

@noklam
Copy link
Contributor

noklam commented Aug 20, 2024

Speak to @rashidakanchwala today and we conclude that size of the pipeline is usually not the bottleneck for viz, so we will forgo creating project with complex (nested) modular pipeline. There are some evidence that Improve resume pipeline suggestion for SequentialRunner by jmholzer · Pull Request #1795 · kedro-org/kedro · GitHub pipeline usually scales reasonably well with size of node, up to 1000.

This is my initial idea, I would like to tackle this in two parts:

  1. Pipeline stress test
  2. Component stress test

Pipeline stress test

The goal of this is to reduce overhead of setting up realistic complex project. This usually include remote storage, pyspark connection etc.

We can use this as an example:

Component stress test:

  • The main goal for this is to benchmark performance of individual component, this will inform if refactoring work has positive/negative impact. Currently we only check if test pass, so we have no idea if a change may slow down performance. We have done this in the past but usually ad-hoc basis, we should run this regularly (or at least per release).

The direction of this is simple, we want to make measure the change of time against # number of entries. We would start with Datasets and Catalog, as this fits in the DataCatalog2.0 work and will be immediately useful.

  • DataCatalog (test # of datasets with catalog.yml & dataset factory)
  • ConfigLoader(# of parameters)
  • Optional: pipelines generated in loops (Dynamic pipeline)

This can address:

While we are creating the pipeline, we should think about how to scale this in the future (if we have new thing to test, where & how? This may need some flags to turn on/off and documentation)

@astrojuanlu
Copy link
Member

Thanks for the summary @noklam. Just one thought on the Pipeline stress test:

Not sure if astrojuanlu/workshop-from-zero-to-mlops is complex enough (@ravi-kumar-pilla and I played around it a bit and added a pointless PySpark usage, didn't make much of a difference) but in any case

This usually include remote storage, pyspark connection etc.

This sounds OK. Maybe we need a bit more clarity on what this means to create a synthetic project, e.g. test

  • Datasets that are slow to instantiate
  • Slow hooks
  • Slow connections for data loading

Otherwise looking for a "realistic" project might be hard.

About component stress test, the plan sounds good 👍🏼

@ElenaKhaustova
Copy link
Contributor

Thank you, @noklam!

  • I would suggest adding tests for different types of runners to Component stress test;
  • As for DataCatalog, the most important thing is to test it within the pipeline, CLI and separately by simulating scenarios when calling some methods as (add_feed_dict). Where the tests themselves should include different sets and combinations of parameters, datasets and patterns.

@noklam
Copy link
Contributor

noklam commented Aug 20, 2024

@ElenaKhaustova

I would suggest adding tests for different types of runners to Component stress test;

What do you have in mind for stress testing Runners? Generate some dummy node and use different type of runners execute them? Or do we need different type of workload for Runners? I/O bound for ThreadRunner, CPU bound for ParallelRunner?

@ElenaKhaustova
Copy link
Contributor

ElenaKhaustova commented Aug 20, 2024

@ElenaKhaustova

I would suggest adding tests for different types of runners to Component stress test;

What do you have in mind for stress testing Runners? Generate some dummy node and use different type of runners execute them? Or do we need different type of workload for Runners? I/O bound for ThreadRunner, CPU bound for ParallelRunner?

I was thinking of having at least three different pipelines for them—one per runner to stress them: one random for sequential, one with external I/O for ThreadRunner, and one that can be run in parallel (at least several processes) for ParallelRunner. So we can check that their main functionality is not affected by changes and makes sense. That's also useful for upcoming DataCatalog changes to make sure nothing slows down.

@noklam
Copy link
Contributor

noklam commented Aug 22, 2024

It may be interesting to have memory profiling too, it will be helpful to address issues like

@astrojuanlu
Copy link
Member

Yes let's include it.

@noklam
Copy link
Contributor

noklam commented Aug 27, 2024

I've moved this to reivew since the scope of the ticket is about define the scope. There are couple of people review this already, would like to get some opinion from @merelcht.

There are some additional scope from review comments, I'd like to split it into additional ticket to make sure the scope of the ticket doesn't grow too big. Implementation will be carried in #3866, I believe @lrcouto already get started for the pipeline test.

@merelcht
Copy link
Member Author

I've moved this to reivew since the scope of the ticket is about define the scope. There are couple of people review this already, would like to get some opinion from @merelcht.

There are some additional scope from review comments, I'd like to split it into additional ticket to make sure the scope of the ticket doesn't grow too big. Implementation will be carried in #3866, I believe @lrcouto already get started for the pipeline test.

Happy to go forward with the approach of creating a project for pipeline stress testing and separately stress test components. Please go ahead and create follow up tickets. One thing I don't see suggestions on yet is the maintenance model for these testing projects and when and how they'll get run: automatically, before a release, on every PR, etc?

@lrcouto
Copy link
Contributor

lrcouto commented Aug 28, 2024

I think it would be good to create some sort of automated process to run the projects before releases for sure. On every PR, as part of regular CI or similar, I think could be a bit slow or cumbersome.

@noklam
Copy link
Contributor

noklam commented Aug 29, 2024

#4128 (comment)

@merelcht I have opened a new ticket. My current idea is that the test should be easy to run both locally and also as Github action. We may use tag/branch name to conditionally trigger performance. For example, release_xxx and performance_xxx will trigger the CI. We can continue the discussion there.

@noklam noklam closed this as completed Aug 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

No branches or pull requests

7 participants