Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DO NOT MERGE - Pipeline performance test project #4154

Closed
wants to merge 24 commits into from

Conversation

lrcouto
Copy link
Contributor

@lrcouto lrcouto commented Sep 10, 2024

Description

Kedro project made to simulate delays and latency in specific points of a Kedro pipeline. Pass the desired delays in seconds using the --params flag. For example:

kedro run --params=hook_delay=5,dataset_load_delay=5,file_save_delay=5

Development notes

Developer Certificate of Origin

We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a Signed-off-by line in the commit message. See our wiki for guidance.

If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commits and allow the DCO check to pass.

Checklist

  • Read the contributing guidelines
  • Signed off each commit with a Developer Certificate of Origin (DCO)
  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added a description of this change in the RELEASE.md file
  • Added tests to cover my changes
  • Checked if this change will affect Kedro-Viz, and if so, communicated that with the Viz team

Signed-off-by: Laura Couto <laurarccouto@gmail.com>
Signed-off-by: Laura Couto <laurarccouto@gmail.com>
Signed-off-by: Laura Couto <laurarccouto@gmail.com>
Signed-off-by: Laura Couto <laurarccouto@gmail.com>
Signed-off-by: Laura Couto <laurarccouto@gmail.com>
@lrcouto lrcouto marked this pull request as ready for review September 12, 2024 00:10
Copy link
Contributor

@noklam noklam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I am not able to run the pipeline with missing data so I just quickly review it on a high level.

Can you add some description do the PR explaining how to use this pipeline test. I see that most of the pipeline here are mocking with sleep, why did you end up going with this implementation?

For example, if I want to answer the question, does Kedro run too slow when it needs to connect to a database, what command should I run?

@@ -0,0 +1,20 @@
# What is this for?

This folder should be used to store configuration files used by Kedro or by separate tools.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any specific configuration needed to be documented? Otherwise I think we can remove this from our project

@@ -0,0 +1,98 @@
# performance-test
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's more helpful to document how this project should be used, otherwise I suggest removing it as these template doesn't add much information for us.

Comment on lines 32 to 40
def register_pipelines(self) -> Dict[str, Pipeline]:
from performance_test.pipelines.expense_analysis import (
pipeline as expense_analysis_pipeline,
)

return {
"__default__": expense_analysis_pipeline.create_pipeline(),
"expense_analysis": expense_analysis_pipeline.create_pipeline(),
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this belongs to pipeline_registry.py?

notebook
ruff~=0.1.8
scikit-learn~=1.5.1; python_version >= "3.9"
scikit-learn<=1.4.0,>=1.0; python_version < "3.9"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pyspark should probably be here

@noklam
Copy link
Contributor

noklam commented Sep 24, 2024

I still unable to run the pipeline - am I suppose to get the data somewhere? Can we merge this folder with Ankita's benchmark (not hurry for now can do this at the end).

Signed-off-by: Laura Couto <laurarccouto@gmail.com>
Signed-off-by: Laura Couto <laurarccouto@gmail.com>
Copy link
Contributor

@ankatiyar ankatiyar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @lrcouto, I was able to get the pipeline to run! (thanks for helping with the setup) It looks good to me, just some minor comments.
I think it'd be nice to have this project be it's own separate repository that we could use to run performance tests instead of be a part of Kedro code base but keen to hear what others think..

@@ -0,0 +1 @@
{}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can get rid of this folder entirely, it's generated by viz

@@ -0,0 +1,19 @@
# performance-test

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also add setup instructions here, just so it's recorded somewhere!

@noklam noklam self-requested a review September 25, 2024 15:39
Copy link
Contributor

@noklam noklam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can run the pipeline successfully too with an extra instruction to install java on GitPod.

kedro run --params=hook_delay=5,dataset_load_delay=5,file_save_delay=5

Apart from @ankatiyar 's comment, some minor comments about making the parameters name consistent. Like we discussed, a few preset of configuration would be helpful so people know how to use the configuration to test (we'll likely need these preset anyway to run benchmark automatically)

If I understand correctly I don't expect any difference between:
kedro run --file-save-delay=5 and kedro run --file-load-delay=5

@@ -0,0 +1,3 @@
hook_delay: 0
dataset_load_delay: 0
file_save_delay: 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we choose one name? either data_save_delay or file_load_delay.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could just call them "save_delay" and "load_delay" maybe?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good

lrcouto and others added 4 commits September 26, 2024 10:46
Signed-off-by: Laura Couto <laurarccouto@gmail.com>
Signed-off-by: Laura Couto <laurarccouto@gmail.com>
@lrcouto
Copy link
Contributor Author

lrcouto commented Oct 16, 2024

Project is currently located at https://github.com/kedro-org/pipeline-performance-test

Closing this PR since it's not necessary anymore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants