fix: Fix performance regression for DataFrame serialization/pickling #20641

nameexhaustion · 2025-01-09T09:55:56Z

Fixes #20605

PR notes

Moved serialization logic out of the impl serde::ser::Serialize for DataFrame to DataFrame::serialize_into_reader etc.
- This allows us to serialize/deserialize directly without going through serde - we now use this when users serialize a DataFrame from Python.
- In general, serializing through serde causes an extra memcopy that we can't avoid (there's no way to ask a Serializer for provide a raw buffer that we can write into)
Introduces a deserialize_map_bytes utility function to reduce unnecessary memcopy during deserialization. This replaces some existing code that unnecessarily copied to an owned Vec<u8> during deserialization.

Performance testing

(build-debug-release)

shape = (1000000, 19) (yellow_tripdata_2015-01_head1m.csv)
┌──────────────┬─────────────┬───────────────────┬─────────────────────────────┬─────────────────────────────────┐
│ Object       ┆ Operation   ┆ Method            ┆ Runtime (s)                 ┆ Size (bytes)                    │
╞══════════════╪═════════════╪═══════════════════╪═════════════════════════════╪═════════════════════════════════╡
│              ┆             ┆                   ┆         (%1.17.1) (%1.19.0) ┆             (%1.17.1) (%1.19.0) │
│ DataFrame    ┆ serialize   ┆ pickle            ┆ 0.065   (97.0%  ) (14.0%  ) ┆ 214_010_205 (100.0% ) (860.0% ) │
│ DataFrame    ┆ deserialize ┆ pickle            ┆ 0.063   (92.0%  ) (28.0%  ) ┆                                 │
│ DataFrame    ┆ serialize   ┆ cls.serialize()   ┆ 0.064   (41.0%  ) (13.0%  ) ┆ 214_010_144 (200.0% ) (860.0% ) │
│ DataFrame    ┆ deserialize ┆ cls.deserialize() ┆ 0.049   (7.8%   ) (23.0%  ) ┆                                 │
│ LazyFrame    ┆ serialize   ┆ pickle            ┆ 0.08    (49.0%  ) (17.0%  ) ┆ 214_010_712 (200.0% ) (860.0% ) │
│ LazyFrame    ┆ deserialize ┆ pickle            ┆ 0.084   (15.0%  ) (38.0%  ) ┆                                 │
│ LazyFrame    ┆ serialize   ┆ cls.serialize()   ┆ 0.078   (50.0%  ) (16.0%  ) ┆ 214_010_651 (200.0% ) (860.0% ) │
│ LazyFrame    ┆ deserialize ┆ cls.deserialize() ┆ 0.069   (11.0%  ) (31.0%  ) ┆                                 │
│ list[Series] ┆ serialize   ┆ pickle            ┆ 0.069   (86.0%  ) (15.0%  ) ┆ 214_019_863 (100.0% ) (860.0% ) │
│ list[Series] ┆ deserialize ┆ pickle            ┆ 0.049   (73.0%  ) (23.0%  ) ┆                                 │
└──────────────┴─────────────┴───────────────────┴─────────────────────────────┴─────────────────────────────────┘
shape = (4000000, 16) (randints Int64)
┌──────────────┬─────────────┬───────────────────┬─────────────────────────────┬─────────────────────────────────┐
│ Object       ┆ Operation   ┆ Method            ┆ Runtime (s)                 ┆ Size (bytes)                    │
╞══════════════╪═════════════╪═══════════════════╪═════════════════════════════╪═════════════════════════════════╡
│              ┆             ┆                   ┆         (%1.17.1) (%1.19.0) ┆             (%1.17.1) (%1.19.0) │
│ DataFrame    ┆ serialize   ┆ pickle            ┆ 0.2     (140.0% ) (4.9%   ) ┆ 512_025_661 (100.0% ) (160.0% ) │
│ DataFrame    ┆ deserialize ┆ pickle            ┆ 0.1     (89.0%  ) (6.7%   ) ┆                                 │
│ DataFrame    ┆ serialize   ┆ cls.serialize()   ┆ 0.07    (10.0%  ) (1.7%   ) ┆ 512_025_600 (160.0% ) (160.0% ) │
│ DataFrame    ┆ deserialize ┆ cls.deserialize() ┆ 0.087   (4.9%   ) (5.7%   ) ┆                                 │
│ LazyFrame    ┆ serialize   ┆ pickle            ┆ 0.26    (36.0%  ) (6.3%   ) ┆ 512_025_959 (160.0% ) (160.0% ) │
│ LazyFrame    ┆ deserialize ┆ pickle            ┆ 0.16    (11.0%  ) (11.0%  ) ┆                                 │
│ LazyFrame    ┆ serialize   ┆ cls.serialize()   ┆ 0.15    (21.0%  ) (3.6%   ) ┆ 512_025_898 (160.0% ) (160.0% ) │
│ LazyFrame    ┆ deserialize ┆ cls.deserialize() ┆ 0.14    (7.7%   ) (9.0%   ) ┆                                 │
│ list[Series] ┆ serialize   ┆ pickle            ┆ 0.16    (120.0% ) (3.8%   ) ┆ 512_049_408 (100.0% ) (160.0% ) │
│ list[Series] ┆ deserialize ┆ pickle            ┆ 0.086   (75.0%  ) (5.6%   ) ┆                                 │
└──────────────┴─────────────┴───────────────────┴─────────────────────────────┴─────────────────────────────────┘

*Note: For % comparisons, lower is better

Observations

A lot of the slowdown in runtime is due to enabling compression - this step has now been removed:
- The serialized size is much larger compared to v1.19, but the runtime is drastically reduced. Both metrics are now more in line with what we had in v1.17.1
The extra overhead of serde can be observed by comparing the runtime of the DataFrame serialization with the the runtime of the corresponding LazyFrame serialization (LazyFrames must use serde as they serialize through a DslPlan)

nameexhaustion · 2025-01-09T09:56:28Z

crates/polars-core/src/frame/chunks.rs

+/// maximum number of rows per chunk to ensure reasonable memory efficiency when
+/// reading the resulting file, and a minimum size per chunk to ensure
+/// reasonable performance when writing.
+pub fn chunk_df_for_writing(


Code is moved from polars-io

nameexhaustion · 2025-01-09T09:57:30Z

crates/polars-core/src/frame/mod.rs

-
-    #[cfg(feature = "serde")]
-    #[test]
-    fn test_deserialize_height_validation_8751() {


Test no longer works as serialization directly errors now on mismatching height

nameexhaustion · 2025-01-09T10:15:27Z

crates/polars-core/src/serde/df.rs

-            D::Error::custom::<PolarsError>(e)
-        })
+impl DataFrame {
+    pub fn serialize_into_writer(&mut self, writer: &mut dyn std::io::Write) -> PolarsResult<()> {


Serialization logic moved to impl DataFrame rather than on impl Series.

nameexhaustion · 2025-01-09T10:16:46Z

crates/polars-core/src/serde/df.rs

+            .serialize_into_writer(&mut bytes)
+            .map_err(S::Error::custom)?;
+
+        serializer.serialize_bytes(bytes.as_slice())


This is where the extra memcopy happens when going through serde - we are calling Serializer::serialize_bytes(bytes: &[u8])

nameexhaustion · 2025-01-09T10:22:34Z

crates/polars-python/src/dataframe/serde.rs

-        }
+        py.allow_threads(|| {
+            self.df
+                .serialize_into_writer(&mut writer)


Bypass serde when serializing DataFrame - this is essentially what we did in older versions

codecov · 2025-01-09T11:11:53Z

Codecov Report

Attention: Patch coverage is 81.74387% with 67 lines in your changes missing coverage. Please review.

Project coverage is 79.01%. Comparing base (247f0b1) to head (090c99b).
Report is 5 commits behind head on main.

Files with missing lines	Patch %	Lines
crates/polars-core/src/serde/df.rs	80.00%	23 Missing ⚠️
crates/polars-plan/src/dsl/expr_dyn_fn.rs	38.09%	13 Missing ⚠️
crates/polars-utils/src/pl_serialize.rs	73.68%	10 Missing ⚠️
crates/polars-utils/src/python_function.rs	75.00%	9 Missing ⚠️
crates/polars-core/src/serde/series.rs	84.37%	5 Missing ⚠️
...quet/src/parquet/metadata/column_chunk_metadata.rs	0.00%	5 Missing ⚠️
crates/polars-core/src/frame/chunks.rs	98.03%	1 Missing ⚠️
crates/polars-core/src/frame/column/mod.rs	87.50%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main   #20641      +/-   ##
==========================================
- Coverage   79.03%   79.01%   -0.03%     
==========================================
  Files        1557     1557              
  Lines      220528   220547      +19     
  Branches     2510     2513       +3     
==========================================
- Hits       174303   174272      -31     
- Misses      45651    45702      +51     
+ Partials      574      573       -1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

datenzauberai · 2025-01-09T18:10:31Z

@nameexhaustion Timings look great, thank you for all the effort!

c

a57d59e

github-actions bot added fix Bug fix python Related to Python Polars rust Related to Rust Polars labels Jan 9, 2025

nameexhaustion commented Jan 9, 2025

View reviewed changes

nameexhaustion added 2 commits January 9, 2025 21:03

c

3f49ba7

c

442cf4c

nameexhaustion commented Jan 9, 2025

View reviewed changes

nameexhaustion added 3 commits January 9, 2025 21:24

c

c0905ed

c

c3df806

c

090c99b

nameexhaustion added the performance Performance issues or improvements label Jan 9, 2025

nameexhaustion marked this pull request as ready for review January 9, 2025 11:41

nameexhaustion requested review from ritchie46, c-peters, alexander-beedie, MarcoGorelli, reswqa, orlp and coastalwhite as code owners January 9, 2025 11:41

ritchie46 merged commit 6cd9988 into pola-rs:main Jan 9, 2025
42 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Fix performance regression for DataFrame serialization/pickling #20641

fix: Fix performance regression for DataFrame serialization/pickling #20641

nameexhaustion commented Jan 9, 2025 •

edited

Loading

nameexhaustion Jan 9, 2025

nameexhaustion Jan 9, 2025

nameexhaustion Jan 9, 2025 •

edited

Loading

nameexhaustion Jan 9, 2025 •

edited

Loading

nameexhaustion Jan 9, 2025

codecov bot commented Jan 9, 2025 •

edited

Loading

datenzauberai commented Jan 9, 2025

fix: Fix performance regression for DataFrame serialization/pickling #20641

fix: Fix performance regression for DataFrame serialization/pickling #20641

Conversation

nameexhaustion commented Jan 9, 2025 • edited Loading

PR notes

Performance testing

Observations

nameexhaustion Jan 9, 2025

Choose a reason for hiding this comment

nameexhaustion Jan 9, 2025

Choose a reason for hiding this comment

nameexhaustion Jan 9, 2025 • edited Loading

Choose a reason for hiding this comment

nameexhaustion Jan 9, 2025 • edited Loading

Choose a reason for hiding this comment

nameexhaustion Jan 9, 2025

Choose a reason for hiding this comment

codecov bot commented Jan 9, 2025 • edited Loading

Codecov Report

datenzauberai commented Jan 9, 2025

nameexhaustion commented Jan 9, 2025 •

edited

Loading

nameexhaustion Jan 9, 2025 •

edited

Loading

nameexhaustion Jan 9, 2025 •

edited

Loading

codecov bot commented Jan 9, 2025 •

edited

Loading