Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DO NOT MERGE - Pipeline performance test project #4154

Closed
wants to merge 24 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
b31c0d6
Add test project
lrcouto Sep 9, 2024
43b7571
Add delays
lrcouto Sep 9, 2024
f505a7f
Use env vars to determine delay
lrcouto Sep 9, 2024
eafe4c5
Use kedro run --params to determine delays
lrcouto Sep 11, 2024
c5a1ac3
Add extra nodes
lrcouto Sep 11, 2024
d1a492a
Merge branch 'main' into pipeline-performance-test
lrcouto Sep 12, 2024
bfd5844
Merge branch 'main' into pipeline-performance-test
lrcouto Sep 16, 2024
97bc3d4
Merge branch 'main' into pipeline-performance-test
lrcouto Sep 17, 2024
bd16556
Merge branch 'main' into pipeline-performance-test
lrcouto Sep 17, 2024
6c5ac73
Remove redundant function from hooks
lrcouto Sep 19, 2024
e6ec50f
Merge branch 'main' into pipeline-performance-test
lrcouto Sep 20, 2024
60f06ad
Add usage instructions to readme
lrcouto Sep 23, 2024
3399c37
Merge branch 'pipeline-performance-test' of github.com:kedro-org/kedr…
lrcouto Sep 23, 2024
5acf23c
Merge branch 'main' into pipeline-performance-test
lrcouto Sep 23, 2024
86e53fe
Merge branch 'main' into pipeline-performance-test
lrcouto Sep 23, 2024
6f24fe0
Add pyspark to project requirements
lrcouto Sep 23, 2024
3d2e5b8
Merge branch 'pipeline-performance-test' of github.com:kedro-org/kedr…
lrcouto Sep 23, 2024
6f3b67d
Add example dataset to repo
lrcouto Sep 24, 2024
28b938a
Add spark dataset requirements to project requirements file
lrcouto Sep 24, 2024
f4fa341
Merge branch 'main' into pipeline-performance-test
lrcouto Sep 25, 2024
d476736
Merge branch 'main' into pipeline-performance-test
lrcouto Sep 26, 2024
3414c78
Change param names
lrcouto Sep 26, 2024
f1ba080
Rerun docs build
lrcouto Sep 26, 2024
b24048c
Merge branch 'main' into pipeline-performance-test
lrcouto Sep 26, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
151 changes: 151 additions & 0 deletions performance-test/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,151 @@
##########################
# KEDRO PROJECT

# ignore all local configuration
conf/local/**
!conf/local/.gitkeep

# ignore potentially sensitive credentials files
conf/**/*credentials*

# ignore everything in the following folders
data/**

# except their sub-folders
!data/**/

# also keep all .gitkeep files
!.gitkeep

# keep also the example dataset
!data/01_raw/*


##########################
# Common files

# IntelliJ
.idea/
*.iml
out/
.idea_modules/

### macOS
*.DS_Store
.AppleDouble
.LSOverride
.Trashes

# Vim
*~
.*.swo
.*.swp

# emacs
*~
\#*\#
/.emacs.desktop
/.emacs.desktop.lock
*.elc

# JIRA plugin
atlassian-ide-plugin.xml

# C extensions
*.so

### Python template
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST

# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Unit test / coverage reports
htmlcov/
.tox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
.hypothesis/

# Translations
*.mo
*.pot

# Django stuff:
*.log
.static_storage/
.media/
local_settings.py

# Flask stuff:
instance/
.webassets-cache

# Scrapy stuff:
.scrapy

# Sphinx documentation
docs/_build/

# PyBuilder
target/

# Jupyter Notebook
.ipynb_checkpoints

# pyenv
.python-version

# celery beat schedule file
celerybeat-schedule

# SageMath parsed files
*.sage.py

# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/

# mkdocs documentation
/site

# mypy
.mypy_cache/
1 change: 1 addition & 0 deletions performance-test/.viz/stats.json
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can get rid of this folder entirely, it's generated by viz

19 changes: 19 additions & 0 deletions performance-test/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
# performance-test

Check warning on line 1 in performance-test/README.md

View workflow job for this annotation

GitHub Actions / vale

[vale] performance-test/README.md#L1

[Kedro.headings] 'performance-test' should use sentence-style capitalization.
Raw output
{"message": "[Kedro.headings] 'performance-test' should use sentence-style capitalization.", "location": {"path": "performance-test/README.md", "range": {"start": {"line": 1, "column": 3}}}, "severity": "WARNING"}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's more helpful to document how this project should be used, otherwise I suggest removing it as these template doesn't add much information for us.


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also add setup instructions here, just so it's recorded somewhere!

## Overview

This is a test project meant to simulate delays in specific parts of a Kedro pipeline. It's supposed to be a tool to gauge pipeline performance and be used to compare in-development changes to Kedro with an already stable release version.

## Usage

There are three delay parameters that can be set in this project:

**hook_delay** - Simulates slow-loading hooks due to it performing complex operations or accessing external services that can suffer from latency.

Check warning on line 11 in performance-test/README.md

View workflow job for this annotation

GitHub Actions / vale

[vale] performance-test/README.md#L11

[Kedro.Spellings] Did you really mean 'hook_delay'?
Raw output
{"message": "[Kedro.Spellings] Did you really mean 'hook_delay'?", "location": {"path": "performance-test/README.md", "range": {"start": {"line": 11, "column": 3}}}, "severity": "WARNING"}

**load_delay** - Simulates a delay in loading a dataset, because of a large size or connection latency, for example.

Check warning on line 13 in performance-test/README.md

View workflow job for this annotation

GitHub Actions / vale

[vale] performance-test/README.md#L13

[Kedro.Spellings] Did you really mean 'load_delay'?
Raw output
{"message": "[Kedro.Spellings] Did you really mean 'load_delay'?", "location": {"path": "performance-test/README.md", "range": {"start": {"line": 13, "column": 3}}}, "severity": "WARNING"}

**save_delay** - Simulates a delay in saving an output file, because of, for example, connection delay in accessing remote storage.

Check warning on line 15 in performance-test/README.md

View workflow job for this annotation

GitHub Actions / vale

[vale] performance-test/README.md#L15

[Kedro.Spellings] Did you really mean 'save_delay'?
Raw output
{"message": "[Kedro.Spellings] Did you really mean 'save_delay'?", "location": {"path": "performance-test/README.md", "range": {"start": {"line": 15, "column": 3}}}, "severity": "WARNING"}

When invoking the `kedro run` command, you can pass the desired value in seconds for each delay as a parameter using the `--params` flag. For example:

`kedro run --params=hook_delay=5,load_delay=5,save_delay=5`
58 changes: 58 additions & 0 deletions performance-test/conf/base/catalog.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
congress_expenses:
type: spark.SparkDataset
filepath: data/gastos-deputados.csv
file_format: csv
load_args:
header: True
inferSchema: True

expenses_per_party:
type: spark.SparkDataset
filepath: data/output/expenses_per_party.csv
file_format: csv
save_args:
sep: ','
header: True
mode: overwrite
load_args:
header: True
inferSchema: True

largest_expense_source:
type: spark.SparkDataset
filepath: data/output/largest_expense_source.parquet
file_format: parquet
save_args:
sep: ','
header: True
mode: overwrite

top_spender_per_party:
type: spark.SparkDataset
filepath: data/output/top_spender_per_party.csv
file_format: csv
save_args:
sep: ','
header: True
mode: overwrite
load_args:
header: True
inferSchema: True

top_overall_spender:
type: spark.SparkDataset
filepath: data/output/top_overall_spender.parquet
file_format: parquet
save_args:
sep: ','
header: True
mode: overwrite

top_spending_party:
type: spark.SparkDataset
filepath: data/output/top_spending_party.parquet
file_format: parquet
save_args:
sep: ','
header: True
mode: overwrite
3 changes: 3 additions & 0 deletions performance-test/conf/base/parameters.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
hook_delay: 0
load_delay: 0
save_delay: 0
5 changes: 5 additions & 0 deletions performance-test/conf/base/parameters_expense_analysis.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# This is a boilerplate parameters config generated for pipeline 'expense_analysis'
# using Kedro 0.19.8.
#
# Documentation for this file format can be found in "Parameters"
# Link: https://docs.kedro.org/en/0.19.8/configuration/parameters.html
8 changes: 8 additions & 0 deletions performance-test/conf/base/spark.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
# You can define spark specific configuration here.

spark.driver.maxResultSize: 3g
spark.hadoop.fs.s3a.impl: org.apache.hadoop.fs.s3a.S3AFileSystem
spark.sql.execution.arrow.pyspark.enabled: true

# https://docs.kedro.org/en/stable/integrations/pyspark_integration.html#tips-for-maximising-concurrency-using-threadrunner
spark.scheduler.mode: FAIR
Empty file.
Loading