-
Notifications
You must be signed in to change notification settings - Fork 901
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DO NOT MERGE - Pipeline performance test project #4154
Changes from 14 commits
b31c0d6
43b7571
f505a7f
eafe4c5
c5a1ac3
d1a492a
bfd5844
97bc3d4
bd16556
6c5ac73
e6ec50f
60f06ad
3399c37
5acf23c
86e53fe
6f24fe0
3d2e5b8
6f3b67d
28b938a
f4fa341
d476736
3414c78
f1ba080
b24048c
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,151 @@ | ||
########################## | ||
# KEDRO PROJECT | ||
|
||
# ignore all local configuration | ||
conf/local/** | ||
!conf/local/.gitkeep | ||
|
||
# ignore potentially sensitive credentials files | ||
conf/**/*credentials* | ||
|
||
# ignore everything in the following folders | ||
data/** | ||
|
||
# except their sub-folders | ||
!data/**/ | ||
|
||
# also keep all .gitkeep files | ||
!.gitkeep | ||
|
||
# keep also the example dataset | ||
!data/01_raw/* | ||
|
||
|
||
########################## | ||
# Common files | ||
|
||
# IntelliJ | ||
.idea/ | ||
*.iml | ||
out/ | ||
.idea_modules/ | ||
|
||
### macOS | ||
*.DS_Store | ||
.AppleDouble | ||
.LSOverride | ||
.Trashes | ||
|
||
# Vim | ||
*~ | ||
.*.swo | ||
.*.swp | ||
|
||
# emacs | ||
*~ | ||
\#*\# | ||
/.emacs.desktop | ||
/.emacs.desktop.lock | ||
*.elc | ||
|
||
# JIRA plugin | ||
atlassian-ide-plugin.xml | ||
|
||
# C extensions | ||
*.so | ||
|
||
### Python template | ||
# Byte-compiled / optimized / DLL files | ||
__pycache__/ | ||
*.py[cod] | ||
*$py.class | ||
|
||
# Distribution / packaging | ||
.Python | ||
build/ | ||
develop-eggs/ | ||
dist/ | ||
downloads/ | ||
eggs/ | ||
.eggs/ | ||
lib/ | ||
lib64/ | ||
parts/ | ||
sdist/ | ||
var/ | ||
wheels/ | ||
*.egg-info/ | ||
.installed.cfg | ||
*.egg | ||
MANIFEST | ||
|
||
# PyInstaller | ||
# Usually these files are written by a python script from a template | ||
# before PyInstaller builds the exe, so as to inject date/other infos into it. | ||
*.manifest | ||
*.spec | ||
|
||
# Installer logs | ||
pip-log.txt | ||
pip-delete-this-directory.txt | ||
|
||
# Unit test / coverage reports | ||
htmlcov/ | ||
.tox/ | ||
.coverage | ||
.coverage.* | ||
.cache | ||
nosetests.xml | ||
coverage.xml | ||
*.cover | ||
.hypothesis/ | ||
|
||
# Translations | ||
*.mo | ||
*.pot | ||
|
||
# Django stuff: | ||
*.log | ||
.static_storage/ | ||
.media/ | ||
local_settings.py | ||
|
||
# Flask stuff: | ||
instance/ | ||
.webassets-cache | ||
|
||
# Scrapy stuff: | ||
.scrapy | ||
|
||
# Sphinx documentation | ||
docs/_build/ | ||
|
||
# PyBuilder | ||
target/ | ||
|
||
# Jupyter Notebook | ||
.ipynb_checkpoints | ||
|
||
# pyenv | ||
.python-version | ||
|
||
# celery beat schedule file | ||
celerybeat-schedule | ||
|
||
# SageMath parsed files | ||
*.sage.py | ||
|
||
# Environments | ||
.env | ||
.venv | ||
env/ | ||
venv/ | ||
ENV/ | ||
env.bak/ | ||
venv.bak/ | ||
|
||
# mkdocs documentation | ||
/site | ||
|
||
# mypy | ||
.mypy_cache/ |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
{} | ||
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,19 @@ | ||
# performance-test | ||
Check warning on line 1 in performance-test/README.md GitHub Actions / vale[vale] performance-test/README.md#L1
Raw output
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe it's more helpful to document how this project should be used, otherwise I suggest removing it as these template doesn't add much information for us. |
||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We could also add setup instructions here, just so it's recorded somewhere! |
||
## Overview | ||
|
||
This is a test project meant to simulate delays in specific parts of a Kedro pipeline. It's supposed to be a tool to gauge pipeline performance and be used to compare in-development changes to Kedro with an already stable release version. | ||
|
||
## Usage | ||
|
||
There are three delay parameters that can be set in this project: | ||
|
||
**hook_delay** - Simulates slow-loading hooks due to it performing complex operations or accessing external services that can suffer from latency. | ||
Check warning on line 11 in performance-test/README.md GitHub Actions / vale[vale] performance-test/README.md#L11
Raw output
|
||
|
||
**dataset_load_delay** - Simulates a delay in loading a dataset, because of a large size or connection latency, for example. | ||
Check warning on line 13 in performance-test/README.md GitHub Actions / vale[vale] performance-test/README.md#L13
Raw output
|
||
|
||
**file_save_delay** - Simulates a delay in saving an output file, because of, for example, connection delay in accessing remote storage. | ||
Check warning on line 15 in performance-test/README.md GitHub Actions / vale[vale] performance-test/README.md#L15
Raw output
|
||
|
||
When invoking the `kedro run` command, you can pass the desired value in seconds for each delay as a parameter using the `--params` flag. For example: | ||
|
||
`kedro run --params=hook_delay=5,dataset_load_delay=5,file_save_delay=5` |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,58 @@ | ||
congress_expenses: | ||
type: spark.SparkDataset | ||
filepath: data/gastos-deputados.csv | ||
file_format: csv | ||
load_args: | ||
header: True | ||
inferSchema: True | ||
|
||
expenses_per_party: | ||
type: spark.SparkDataset | ||
filepath: data/output/expenses_per_party.csv | ||
file_format: csv | ||
save_args: | ||
sep: ',' | ||
header: True | ||
mode: overwrite | ||
load_args: | ||
header: True | ||
inferSchema: True | ||
|
||
largest_expense_source: | ||
type: spark.SparkDataset | ||
filepath: data/output/largest_expense_source.parquet | ||
file_format: parquet | ||
save_args: | ||
sep: ',' | ||
header: True | ||
mode: overwrite | ||
|
||
top_spender_per_party: | ||
type: spark.SparkDataset | ||
filepath: data/output/top_spender_per_party.csv | ||
file_format: csv | ||
save_args: | ||
sep: ',' | ||
header: True | ||
mode: overwrite | ||
load_args: | ||
header: True | ||
inferSchema: True | ||
|
||
top_overall_spender: | ||
type: spark.SparkDataset | ||
filepath: data/output/top_overall_spender.parquet | ||
file_format: parquet | ||
save_args: | ||
sep: ',' | ||
header: True | ||
mode: overwrite | ||
|
||
top_spending_party: | ||
type: spark.SparkDataset | ||
filepath: data/output/top_spending_party.parquet | ||
file_format: parquet | ||
save_args: | ||
sep: ',' | ||
header: True | ||
mode: overwrite |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
hook_delay: 0 | ||
dataset_load_delay: 0 | ||
file_save_delay: 0 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can we choose one name? either data_save_delay or file_load_delay. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We could just call them "save_delay" and "load_delay" maybe? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. sounds good |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
# This is a boilerplate parameters config generated for pipeline 'expense_analysis' | ||
# using Kedro 0.19.8. | ||
# | ||
# Documentation for this file format can be found in "Parameters" | ||
# Link: https://docs.kedro.org/en/0.19.8/configuration/parameters.html |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,8 @@ | ||
# You can define spark specific configuration here. | ||
|
||
spark.driver.maxResultSize: 3g | ||
spark.hadoop.fs.s3a.impl: org.apache.hadoop.fs.s3a.S3AFileSystem | ||
spark.sql.execution.arrow.pyspark.enabled: true | ||
|
||
# https://docs.kedro.org/en/stable/integrations/pyspark_integration.html#tips-for-maximising-concurrency-using-threadrunner | ||
spark.scheduler.mode: FAIR |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,43 @@ | ||
[build-system] | ||
requires = [ "setuptools",] | ||
build-backend = "setuptools.build_meta" | ||
|
||
[project] | ||
name = "performance_test" | ||
readme = "README.md" | ||
dynamic = [ "dependencies", "version",] | ||
|
||
[project.scripts] | ||
performance-test = "performance_test.__main__:main" | ||
|
||
[tool.kedro] | ||
package_name = "performance_test" | ||
project_name = "performance-test" | ||
kedro_init_version = "0.19.8" | ||
tools = [ "PySpark", "Linting",] | ||
example_pipeline = "False" | ||
source_dir = "src" | ||
|
||
[tool.ruff] | ||
line-length = 88 | ||
show-fixes = true | ||
select = [ "F", "W", "E", "I", "UP", "PL", "T201",] | ||
ignore = [ "E501",] | ||
|
||
[project.entry-points."kedro.hooks"] | ||
|
||
[tool.ruff.format] | ||
docstring-code-format = true | ||
|
||
[tool.setuptools.dynamic.dependencies] | ||
file = "requirements.txt" | ||
|
||
[tool.setuptools.dynamic.version] | ||
attr = "performance_test.__version__" | ||
|
||
[tool.setuptools.packages.find] | ||
where = [ "src",] | ||
namespaces = false | ||
|
||
[tool.kedro_telemetry] | ||
project_id = "" |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,11 @@ | ||
ipython>=8.10 | ||
jupyterlab>=3.0 | ||
kedro~=0.19.8 | ||
kedro-datasets>=3.0; python_version >= "3.9" | ||
kedro-datasets>=1.0; python_version < "3.9" | ||
kedro-viz>=6.7.0 | ||
kedro[jupyter] | ||
notebook | ||
ruff~=0.1.8 | ||
scikit-learn~=1.5.1; python_version >= "3.9" | ||
scikit-learn<=1.4.0,>=1.0; python_version < "3.9" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
"""performance-test | ||
""" | ||
|
||
__version__ = "0.1" |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,24 @@ | ||
"""performance-test file for ensuring the package is executable | ||
as `performance-test` and `python -m performance_test` | ||
""" | ||
import sys | ||
from pathlib import Path | ||
from typing import Any | ||
|
||
from kedro.framework.cli.utils import find_run_command | ||
from kedro.framework.project import configure_project | ||
|
||
|
||
def main(*args, **kwargs) -> Any: | ||
package_name = Path(__file__).parent.name | ||
configure_project(package_name) | ||
|
||
interactive = hasattr(sys, 'ps1') | ||
kwargs["standalone_mode"] = not interactive | ||
|
||
run = find_run_command(package_name) | ||
return run(*args, **kwargs) | ||
|
||
|
||
if __name__ == "__main__": | ||
main() |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,27 @@ | ||
from time import sleep | ||
|
||
from kedro.framework.hooks import hook_impl | ||
from pyspark import SparkConf | ||
from pyspark.sql import SparkSession | ||
|
||
|
||
class SparkHooks: | ||
@hook_impl | ||
def after_context_created(self, context) -> None: | ||
"""Initialises a SparkSession using the config | ||
defined in project's conf folder. | ||
""" | ||
|
||
# Load the spark configuration in spark.yaml using the config loader | ||
parameters = context.config_loader["spark"] | ||
spark_conf = SparkConf().setAll(parameters.items()) | ||
|
||
# Initialise the spark session | ||
spark_session_conf = ( | ||
SparkSession.builder.appName(context.project_path.name) | ||
.enableHiveSupport() | ||
.config(conf=spark_conf) | ||
) | ||
sleep(context.params['hook_delay']) | ||
_spark_session = spark_session_conf.getOrCreate() | ||
_spark_session.sparkContext.setLogLevel("WARN") |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,16 @@ | ||
"""Project pipelines.""" | ||
from typing import Dict | ||
|
||
from kedro.framework.project import find_pipelines | ||
from kedro.pipeline import Pipeline | ||
|
||
|
||
def register_pipelines() -> Dict[str, Pipeline]: | ||
"""Register the project's pipelines. | ||
|
||
Returns: | ||
A mapping from pipeline names to ``Pipeline`` objects. | ||
""" | ||
pipelines = find_pipelines() | ||
pipelines["__default__"] = sum(pipelines.values()) | ||
return pipelines |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,10 @@ | ||
""" | ||
This is a boilerplate pipeline 'expense_analysis' | ||
generated using Kedro 0.19.8 | ||
""" | ||
|
||
from .pipeline import create_pipeline | ||
|
||
__all__ = ["create_pipeline"] | ||
|
||
__version__ = "0.1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can get rid of this folder entirely, it's generated by viz