v3.1.0
What's Changed
Warning
In version 3.1.0 there's a small API change to the SparkLinker that’s backwards incompatible. i.e. it’s a minor violation of semver
The changes affect the SparkLinker only:
- The default
break_lineage_method
will change toparquet
- The
break_lineage_after_blocking
param is renamed torepartition_after_blocking
for clarity
Features
- Add the ability to use pyarrow + on on disk parquet/csv in duckdb by @ThomasHepworth in #684
- Add completeness (by dataset) chart by @samnlindsay in #669
- Add cumulative blocking rule comparison chart by @ThomasHepworth in #660
- Allow
find_matches_to_new_records
to take table name as input, in addition to rows by @RobinL in #659
Bugfixes
- remove duplicate column selections by @ThomasHepworth in #681
- fix em training tooltip by @ThomasHepworth in #665
Maintenance
- [MAINT] Clarify sql execution function names by @RobinL in #690
- [MAINT] Clarify Spark Linker caching logic by @RobinL in #691
- [MAINT] Bump version to 3.1.0 by @RobinL in #693
- Fix code formatting on
count_num_comparisons_from_blocking_rules_for_prediction
by @RobinL in #661 - Add salting to spark full test by @RobinL in #655
Docs
- Improve customising comparisons topic guide by @RobinL in #667
- [DOCS] Performance topic guide, covering blocking by @RobinL in #675
- [docs] Add issue template for bug report by @RobinL in #676
- [DOCS] Add topic guide for optimising spark jobs by @RobinL in #679
- [DOCS] Fix problem with spark docs copy by @RobinL in #685
- [Docs] Developers' guide to caching and pipelining by @RobinL in #686
- [Docs] Developer guide: Understanding and debugging Splink's computations by @RobinL in #688
- [DOCS] Developers' guide to spark caching and pipelining by @RobinL in #689
Full Changelog: v3.0.1...v3.1.0