Spike: design example kedro projects that can be used to assess performance issues #3957

merelcht · 2024-06-17T13:20:01Z

Description

Prework for #3866

Context

In order to create example kedro project that can be used to assess performance of Kedro and Kedro-Viz, we need to gather requirements of what defines complex pipelines. Some of the moving parts are number of nodes, number of pipelines and number of datasets, but that might not be all that's required to create a proper "family" of test projects.

Possible Implementation

Good starting point: https://github.com/noklam/kedro-example/blob/master/stress-test-pipeline/src/stress_test_pipeline/pipeline.py

datajoely · 2024-06-17T13:36:55Z

Heavy dependency imports would be great here too

merelcht · 2024-06-17T14:19:10Z

Heavy dependency imports would be great here too

Core dependencies of Kedro or just any?

datajoely · 2024-06-17T14:49:48Z

Sorry I meant things like Pytorch / Tensorflow / Spark / Pandas

merelcht · 2024-07-09T09:06:31Z

As mentioned in #3732 large parameter files seem to slow things down.

noklam · 2024-07-23T11:12:44Z

https://linen-slack.kedro.org/t/22711373/is-there-any-people-want-to-use-the-kedro-vscode-extension-s#9a92d9f4-8083-4e74-acf3-ef811df08201

Gathered some topics from Kedro Slack Archive & Github

slow s3 glob https://linen-slack.kedro.org/t/22727466/hi-fellows-we-are-using-the-versioned-pandas-parquetdataset-#e411b958-a918-4874-88b2-2396a9d726c3
I/O related issue (i.e. CacheDataset/kedro-accelerator) https://linen-slack.kedro.org/t/16947241/using-the-last-version-of-the-0-18-i-found-out-that-when-you#70c641e4-98d6-48fe-8e6a-2a0a4cee65dd
Slow catalog load https://linen-slack.kedro.org/t/15474644/hey-everyone-im-trying-to-add-some-tests-to-my-project-but-i#4796efa8-d9a0-4c40-b73f-36b15f412225
Slow Config loading Investigate performance of config loading for big projects #3893
Lazy catalog loading items
Import time (heavy import in irrelevant pipeline)
Slow pipeline sum (pipeline creation is inefficient)

Bottleneck:

CPU bound
Memory bound
I/O bound

What is considered slow?

Slow start up time for light workload. i.e. (Takes 10 second to prepare for a pipeline that finish in 2 seconds)
Benchmarking to non-Kedro pipeline. Training a LLM is slow regardless with/without Kedro, what we should compare here is if Kedro introduce extra penalty.

We could also approach this at a component level first. i.e. How slow is DataCatalog when # of datasets scale up, how slow is pipeline sum when # of pipeline scale up. The outcome of this issue is to create some idea/script that can be re-use and we can benchmark performance in a on-going basis (maybe include in CI or manually trigger from time to time)

marrrcin · 2024-07-24T07:22:53Z

I upvote the tests for:

large parameter files seem
large data catalogs (although it can sometimes be mitigated by the dataset factories)
pipelines generated in loops, especially Dynamic Pipelines

Some less obvious:

modifying catalog on the fly in hooks

noklam · 2024-08-05T11:01:26Z

I suggest focus on two things:

Kedro vs Without Kedro
Config/Pipeline/Catalog creation time when the # of entry scales.

noklam · 2024-08-20T13:27:00Z

Speak to @rashidakanchwala today and we conclude that size of the pipeline is usually not the bottleneck for viz, so we will forgo creating project with complex (nested) modular pipeline. There are some evidence that Improve resume pipeline suggestion for SequentialRunner by jmholzer · Pull Request #1795 · kedro-org/kedro · GitHub pipeline usually scales reasonably well with size of node, up to 1000.

This is my initial idea, I would like to tackle this in two parts:

Pipeline stress test
Component stress test

Pipeline stress test

The goal of this is to reduce overhead of setting up realistic complex project. This usually include remote storage, pyspark connection etc.

We can use this as an example:

https://github.com/astrojuanlu/workshop-from-zero-to-mlops
This could help to address issue like:
[spike] Improve Kedro CLI startup time #1476
Refactor CLI with lazy subcommands and deferring imports kedro-viz#1920
Lightweight Kedro Viz Experimentation using AST kedro-viz#1966

Component stress test:

The main goal for this is to benchmark performance of individual component, this will inform if refactoring work has positive/negative impact. Currently we only check if test pass, so we have no idea if a change may slow down performance. We have done this in the past but usually ad-hoc basis, we should run this regularly (or at least per release).

The direction of this is simple, we want to make measure the change of time against # number of entries. We would start with Datasets and Catalog, as this fits in the DataCatalog2.0 work and will be immediately useful.

DataCatalog (test # of datasets with catalog.yml & dataset factory)
ConfigLoader(# of parameters)
Optional: pipelines generated in loops (Dynamic pipeline)

This can address:

Investigate performance of config loading for big projects #3893
DataCatalog 2.0

While we are creating the pipeline, we should think about how to scale this in the future (if we have new thing to test, where & how? This may need some flags to turn on/off and documentation)

astrojuanlu · 2024-08-20T14:37:02Z

Thanks for the summary @noklam. Just one thought on the Pipeline stress test:

Not sure if astrojuanlu/workshop-from-zero-to-mlops is complex enough (@ravi-kumar-pilla and I played around it a bit and added a pointless PySpark usage, didn't make much of a difference) but in any case

This usually include remote storage, pyspark connection etc.

This sounds OK. Maybe we need a bit more clarity on what this means to create a synthetic project, e.g. test

Datasets that are slow to instantiate
Slow hooks
Slow connections for data loading

Otherwise looking for a "realistic" project might be hard.

About component stress test, the plan sounds good 👍🏼

ElenaKhaustova · 2024-08-20T15:12:32Z

Thank you, @noklam!

I would suggest adding tests for different types of runners to Component stress test;
As for DataCatalog, the most important thing is to test it within the pipeline, CLI and separately by simulating scenarios when calling some methods as (add_feed_dict). Where the tests themselves should include different sets and combinations of parameters, datasets and patterns.

noklam · 2024-08-20T15:31:57Z

@ElenaKhaustova

I would suggest adding tests for different types of runners to Component stress test;

What do you have in mind for stress testing Runners? Generate some dummy node and use different type of runners execute them? Or do we need different type of workload for Runners? I/O bound for ThreadRunner, CPU bound for ParallelRunner?

ElenaKhaustova · 2024-08-20T17:15:11Z

@ElenaKhaustova

I would suggest adding tests for different types of runners to Component stress test;

What do you have in mind for stress testing Runners? Generate some dummy node and use different type of runners execute them? Or do we need different type of workload for Runners? I/O bound for ThreadRunner, CPU bound for ParallelRunner?

I was thinking of having at least three different pipelines for them—one per runner to stress them: one random for sequential, one with external I/O for ThreadRunner, and one that can be run in parallel (at least several processes) for ParallelRunner. So we can check that their main functionality is not affected by changes and makes sense. That's also useful for upcoming DataCatalog changes to make sure nothing slows down.

noklam · 2024-08-22T14:20:16Z

It may be interesting to have memory profiling too, it will be helpful to address issues like

Potentially inefficient memory usage #4111

astrojuanlu · 2024-08-22T15:45:55Z

Yes let's include it.

noklam · 2024-08-27T11:00:51Z

I've moved this to reivew since the scope of the ticket is about define the scope. There are couple of people review this already, would like to get some opinion from @merelcht.

There are some additional scope from review comments, I'd like to split it into additional ticket to make sure the scope of the ticket doesn't grow too big. Implementation will be carried in #3866, I believe @lrcouto already get started for the pipeline test.

merelcht · 2024-08-27T15:50:17Z

I've moved this to reivew since the scope of the ticket is about define the scope. There are couple of people review this already, would like to get some opinion from @merelcht.

There are some additional scope from review comments, I'd like to split it into additional ticket to make sure the scope of the ticket doesn't grow too big. Implementation will be carried in #3866, I believe @lrcouto already get started for the pipeline test.

Happy to go forward with the approach of creating a project for pipeline stress testing and separately stress test components. Please go ahead and create follow up tickets. One thing I don't see suggestions on yet is the maintenance model for these testing projects and when and how they'll get run: automatically, before a release, on every PR, etc?

lrcouto · 2024-08-28T18:56:49Z

I think it would be good to create some sort of automated process to run the projects before releases for sure. On every PR, as part of regular CI or similar, I think could be a bit slow or cumbersome.

noklam · 2024-08-29T12:28:03Z

#4128 (comment)

@merelcht I have opened a new ticket. My current idea is that the test should be easy to run both locally and also as Github action. We may use tag/branch name to conditionally trigger performance. For example, release_xxx and performance_xxx will trigger the CI. We can continue the discussion there.

noklam · 2024-08-29T13:48:18Z

Follow up ticket:

merelcht mentioned this issue Jun 17, 2024

[Stress Testing] - Create example projects to assess Kedro performance for complex pipelines #3866

Open

merelcht added this to the Improve Developer Experience milestone Jun 17, 2024

github-actions bot mentioned this issue Jul 1, 2024

Monthly issue metrics report #3975

Open

merelcht mentioned this issue Jul 9, 2024

Remove the copying hack and add proper params querying capabilities in the DataCatalog #3732

Closed

7 tasks

merelcht assigned noklam and lrcouto Jul 22, 2024

This was referenced Aug 29, 2024

[Stress Testing] - Data Catalog and Config Loader #4125

Closed

[Stress Testing] - Runners #4127

Open

noklam closed this as completed Aug 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spike: design example kedro projects that can be used to assess performance issues #3957

Spike: design example kedro projects that can be used to assess performance issues #3957

merelcht commented Jun 17, 2024

datajoely commented Jun 17, 2024

merelcht commented Jun 17, 2024

datajoely commented Jun 17, 2024

merelcht commented Jul 9, 2024

noklam commented Jul 23, 2024 •

edited

Loading

marrrcin commented Jul 24, 2024

noklam commented Aug 5, 2024

noklam commented Aug 20, 2024

astrojuanlu commented Aug 20, 2024

ElenaKhaustova commented Aug 20, 2024

noklam commented Aug 20, 2024

ElenaKhaustova commented Aug 20, 2024 •

edited

Loading

noklam commented Aug 22, 2024

astrojuanlu commented Aug 22, 2024

noklam commented Aug 27, 2024

merelcht commented Aug 27, 2024

lrcouto commented Aug 28, 2024

noklam commented Aug 29, 2024

noklam commented Aug 29, 2024

Spike: design example kedro projects that can be used to assess performance issues #3957

Spike: design example kedro projects that can be used to assess performance issues #3957

Comments

merelcht commented Jun 17, 2024

Description

Context

Possible Implementation

datajoely commented Jun 17, 2024

merelcht commented Jun 17, 2024

datajoely commented Jun 17, 2024

merelcht commented Jul 9, 2024

noklam commented Jul 23, 2024 • edited Loading

marrrcin commented Jul 24, 2024

noklam commented Aug 5, 2024

noklam commented Aug 20, 2024

Pipeline stress test

Component stress test:

astrojuanlu commented Aug 20, 2024

ElenaKhaustova commented Aug 20, 2024

noklam commented Aug 20, 2024

ElenaKhaustova commented Aug 20, 2024 • edited Loading

noklam commented Aug 22, 2024

astrojuanlu commented Aug 22, 2024

noklam commented Aug 27, 2024

merelcht commented Aug 27, 2024

lrcouto commented Aug 28, 2024

noklam commented Aug 29, 2024

noklam commented Aug 29, 2024

noklam commented Jul 23, 2024 •

edited

Loading

ElenaKhaustova commented Aug 20, 2024 •

edited

Loading