Discord
|
Documentation
|
User Guide
|
Want to Contribute?
pip install polars-ds
PDS is a modern data science package that
- is fast and furious
- is small and lean, with minimal dependencies
- has an intuitive and concise API (if you know Polars already)
- has dataframe friendly design
- and covers a wide variety of data science topics, such as simple statistics, linear regression, string edit distances, tabular data transforms, feature extraction, traditional modelling pipelines, model evaluation metrics, etc., etc..
It stands on the shoulders of the great Polars dataframe. You can see examples. Here are some highlights!
import polars as pl
import polars_ds as pds
# Parallel evaluation of multiple ML metrics on different segments of data
df.lazy().group_by("segments").agg(
pds.query_roc_auc("actual", "predicted").alias("roc_auc"),
pds.query_log_loss("actual", "predicted").alias("log_loss"),
).collect()
shape: (2, 3)
ββββββββββββ¬βββββββββββ¬βββββββββββ
β segments β roc_auc β log_loss β
β --- β --- β --- β
β str β f64 β f64 β
ββββββββββββͺβββββββββββͺβββββββββββ‘
β a β 0.497745 β 1.006438 β
β b β 0.498801 β 0.997226 β
ββββββββββββ΄βββββββββββ΄βββββββββββ
Tabular Machine Learning Data Transformation Pipeline (See SKLEARN_COMPATIBILITY for more details.)
import polars as pl
import polars.selectors as cs
from polars_ds.pipeline import Pipeline, Blueprint
bp = (
Blueprint(df, name = "example", target = "approved", lowercase=True) # You can optionally
.filter(
"city_category is not null" # or equivalently, you can do: pl.col("city_category").is_not_null()
)
.linear_impute(features = ["var1", "existing_emi"], target = "loan_period")
.impute(["existing_emi"], method = "median")
.append_expr( # generate some features
pl.col("existing_emi").log1p().alias("existing_emi_log1p"),
pl.col("loan_amount").log1p().alias("loan_amount_log1p"),
pl.col("loan_amount").clip(lower_bound = 0, upper_bound = 1000).alias("loan_amount_log1p_clipped"),
pl.col("loan_amount").sqrt().alias("loan_amount_sqrt"),
pl.col("loan_amount").shift(-1).alias("loan_amount_lag_1") # any kind of lag transform
)
.scale( # target is numerical, but will be excluded automatically because bp is initialzied with a target
cs.numeric().exclude(["var1", "existing_emi_log1p"]), method = "standard"
) # Scale the columns up to this point. The columns below won't be scaled
.append_expr(
# Add missing flags
pl.col("employer_category1").is_null().cast(pl.UInt8).alias("employer_category1_is_missing")
)
.one_hot_encode("gender", drop_first=True)
.woe_encode("city_category") # No need to specify target because we initialized bp with a target
.target_encode("employer_category1", min_samples_leaf = 20, smoothing = 10.0) # same as above
)
print(bp)
pipe:Pipeline = bp.materialize()
# Check out the result in our example notebooks! (examples/pipeline.ipynb)
df_transformed = pipe.transform(df)
df_transformed.head()
Get all neighbors within radius r, call them best friends, and count the number
df.select(
pl.col("id"),
pds.query_radius_ptwise(
pl.col("var1"), pl.col("var2"), pl.col("var3"), # Columns used as the coordinates in 3d space
index = pl.col("id"),
r = 0.1,
dist = "sql2", # squared l2
parallel = True
).alias("best friends"),
).with_columns( # -1 to remove the point itself
(pl.col("best friends").list.len() - 1).alias("best friends count")
).head()
shape: (5, 3)
βββββββ¬ββββββββββββββββββββ¬βββββββββββββββββββββ
β id β best friends β best friends count β
β --- β --- β --- β
β u32 β list[u32] β u32 β
βββββββͺββββββββββββββββββββͺβββββββββββββββββββββ‘
β 0 β [0, 811, β¦ 1435] β 152 β
β 1 β [1, 953, β¦ 1723] β 159 β
β 2 β [2, 355, β¦ 835] β 243 β
β 3 β [3, 102, β¦ 1129] β 110 β
β 4 β [4, 1280, β¦ 1543] β 226 β
βββββββ΄ββββββββββββββββββββ΄βββββββββββββββββββββ
Run a linear regression on each category:
df = pds.random_data(size=5_000, n_cols=0).select(
pds.random(0.0, 1.0).alias("x1"),
pds.random(0.0, 1.0).alias("x2"),
pds.random(0.0, 1.0).alias("x3"),
pds.random_int(0, 3).alias("categories")
).with_columns(
y = pl.col("x1") * 0.5 + pl.col("x2") * 0.25 - pl.col("x3") * 0.15 + pds.random() * 0.0001
)
df.group_by("categories").agg(
pds.lin_reg(
"x1", "x2", "x3",
target = "y",
method = "l2",
l2_reg = 0.05,
add_bias = False
).alias("coeffs")
)
shape: (3, 2)
ββββββββββββββ¬ββββββββββββββββββββββββββββββββββ
β categories β coeffs β
β --- β --- β
β i32 β list[f64] β
ββββββββββββββͺββββββββββββββββββββββββββββββββββ‘
β 0 β [0.499912, 0.250005, -0.149846β¦ β
β 1 β [0.499922, 0.250004, -0.149856β¦ β
β 2 β [0.499923, 0.250004, -0.149855β¦ β
ββββββββββββββ΄ββββββββββββββββββββββββββββββββββ
Various String Edit distances
df.select( # Column "word", compared to string in pl.lit(). It also supports column vs column comparison
pds.str_leven("word", pl.lit("asasasa"), return_sim=True).alias("Levenshtein"),
pds.str_osa("word", pl.lit("apples"), return_sim=True).alias("Optimal String Alignment"),
pds.str_jw("word", pl.lit("apples")).alias("Jaro-Winkler"),
)
In-dataframe statistical tests
df.group_by("market_id").agg(
pds.ttest_ind("var1", "var2", equal_var=False).alias("t-test"),
pds.chi2("category_1", "category_2").alias("chi2-test"),
pds.f_test("var1", group = "category_1").alias("f-test")
)
shape: (3, 4)
βββββββββββββ¬βββββββββββββββββββββββ¬βββββββββββββββββββββββ¬ββββββββββββββββββββββ
β market_id β t-test β chi2-test β f-test β
β --- β --- β --- β --- β
β i64 β struct[2] β struct[2] β struct[2] β
βββββββββββββͺβββββββββββββββββββββββͺβββββββββββββββββββββββͺββββββββββββββββββββββ‘
β 0 β {2.072749,0.038272} β {33.487634,0.588673} β {0.312367,0.869842} β
β 1 β {0.469946,0.638424} β {42.672477,0.206119} β {2.148937,0.072536} β
β 2 β {-1.175325,0.239949} β {28.55723,0.806758} β {0.506678,0.730849} β
βββββββββββββ΄βββββββββββββββββββββββ΄βββββββββββββββββββββββ΄ββββββββββββββββββββββ
Multiple Convolutions at once!
# Multiple Convolutions at once
# Modes: `same`, `left` (left-aligned same), `right` (right-aligned same), `valid` or `full`
# Method: `fft`, `direct`
# Currently slower than SciPy but provides parallelism because of Polars
df.select(
pds.convolve("f", [-1, 0, 0, 0, 1], mode = "full", method = "fft"), # column f with the kernel given here
pds.convolve("a", [-1, 0, 0, 0, 1], mode = "full", method = "direct"),
pds.convolve("b", [-1, 0, 0, 0, 1], mode = "full", method = "direct"),
).head()
And more!
import polars_ds as pds
To make full use of the Diagnosis module, do
pip install "polars_ds[plot]"
Feel free to take a look at our benchmark notebook!
Generally speaking, the more expressions you want to evaluate simultaneously, the faster Polars + PDS will be than Pandas + (SciPy / Sklearn / NumPy). The more CPU cores you have on your machine, the bigger the time difference will be in favor of Polars + PDS.
Why does speed matter?
If your code already executes under 1s and you only use your code in non-production, ad-hoc environments, then maybe it doesn't. Even so, as your data grow, having a 5s run vs. a 1s run will make a lot of difference in your iterations for your project. Speed of execution becomes a bigger issues if you are building reports on demand, or if you need to pay extra for additional compute or when you have a production pipeline that has to deliver the data under a time constraint.
- Documentation writing, Doc Review, and Benchmark preparation
- K-means, K-medoids clustering as expressions and also standalone modules.
- Other improvement items. See issues.
Currently in Beta. Feel free to submit feature requests in the issues section of the repo. This library will only depend on python Polars (for most of its core) and will try to be as stable as possible for polars>=1. Exceptions will be made when Polars's update forces changes in the plugins.
This package is not tested with Polars streaming mode and is not designed to work with data so big that has to be streamed.
- Rust Snowball Stemmer is taken from Tsoding's Seroost project (MIT). See here
- Some statistics functions are taken from Statrs (MIT) and internalized. See here
- Linear algebra routines are powered partly by faer
- String similarity metrics are soooo fast because of RapidFuzz
- Take a look at our friendly neighbor functime