Skip to content

Commit

Permalink
Add badges and fix links
Browse files Browse the repository at this point in the history
  • Loading branch information
SemyonSinchenko authored Sep 24, 2024
1 parent 006b13e commit f18884e
Showing 1 changed file with 13 additions and 3 deletions.
16 changes: 13 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
# Flake8-pyspark-with-column

[![Upload Python Package](https://github.com/SemyonSinchenko/flake8-pyspark-with-column/actions/workflows/python-publish.yml/badge.svg)](https://github.com/SemyonSinchenko/flake8-pyspark-with-column/actions/workflows/python-publish.yml) ![PyPI - Downloads](https://img.shields.io/pypi/dm/flake8-pyspark-with-column)

## Getting started

```sh
Expand Down Expand Up @@ -32,11 +34,17 @@ When you run a PySpark application the following happens:
2. Spark do analysis of this plan to create an `Analyzed Logical Plan`
3. Spark apply optimization rules to create an `Optimized Logical Plan`

![spark-flow](https://www.databricks.com/wp-content/uploads/2018/05/Catalyst-Optimizer-diagram.png)
<p align="center">
<img src="https://www.databricks.com/wp-content/uploads/2018/05/Catalyst-Optimizer-diagram.png" alt="spark-flow" width="800" align="middle"/>
</p>

What is the problem with `withColumn`? It creates a single node in the unresolved plan. So, calling `withColumn` 500 times will create an unresolved plan with 500 nodes. During the analysis Spark should visit each node to check that column exists and has a right data type. After that Spark will start applying rules, but rules are applyed once per plan recursively, so concatenation of 500 calls to `withColumn` will require 500 applies of the corresponding rule. All of that may significantly increase the amount of time from `Unresolved Logical Plan` to `Optimized Logical Plan`:

![bechmark](/static/with_column_performance.png)
<p align="center">
<img src="https://raw.githubusercontent.com/SemyonSinchenko/flake8-pyspark-with-column/refs/heads/main/static/with_column_performance.png" alt="bechmark" width="600" align="middle"/>
</p>

From the other side, both `withColumns` and `select(*cols)` create only one node in the plan doesn't matter how many columns we want to add.

## Rules
This plugin contains the following rules:
Expand Down Expand Up @@ -77,4 +85,6 @@ def cast_to_double(df: DataFrame) -> DataFrame:

`flake8 %your-code-here%`

![screenshot of how it works](/static/usage.png)
<p align="center">
<img src="https://raw.githubusercontent.com/SemyonSinchenko/flake8-pyspark-with-column/refs/heads/main/static/usage.png" alt="screenshot of how it works" width="800" align="middle"/>
</p>

0 comments on commit f18884e

Please sign in to comment.