Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Fix performance regression for DataFrame serialization/pickling #20641

Merged
merged 6 commits into from
Jan 9, 2025

Conversation

nameexhaustion
Copy link
Collaborator

@nameexhaustion nameexhaustion commented Jan 9, 2025

Fixes #20605

PR notes

  • Moved serialization logic out of the impl serde::ser::Serialize for DataFrame to DataFrame::serialize_into_reader etc.
    • This allows us to serialize/deserialize directly without going through serde - we now use this when users serialize a DataFrame from Python.
    • In general, serializing through serde causes an extra memcopy that we can't avoid (there's no way to ask a Serializer for provide a raw buffer that we can write into)
  • Introduces a deserialize_map_bytes utility function to reduce unnecessary memcopy during deserialization. This replaces some existing code that unnecessarily copied to an owned Vec<u8> during deserialization.

Performance testing

(build-debug-release)

shape = (1000000, 19) (yellow_tripdata_2015-01_head1m.csv)
┌──────────────┬─────────────┬───────────────────┬─────────────────────────────┬─────────────────────────────────┐
│ Object       ┆ Operation   ┆ Method            ┆ Runtime (s)                 ┆ Size (bytes)                    │
╞══════════════╪═════════════╪═══════════════════╪═════════════════════════════╪═════════════════════════════════╡
│              ┆             ┆                   ┆         (%1.17.1) (%1.19.0) ┆             (%1.17.1) (%1.19.0) │
│ DataFrame    ┆ serialize   ┆ pickle            ┆ 0.065   (97.0%  ) (14.0%  ) ┆ 214_010_205 (100.0% ) (860.0% ) │
│ DataFrame    ┆ deserialize ┆ pickle            ┆ 0.063   (92.0%  ) (28.0%  ) ┆                                 │
│ DataFrame    ┆ serialize   ┆ cls.serialize()   ┆ 0.064   (41.0%  ) (13.0%  ) ┆ 214_010_144 (200.0% ) (860.0% ) │
│ DataFrame    ┆ deserialize ┆ cls.deserialize() ┆ 0.049   (7.8%   ) (23.0%  ) ┆                                 │
│ LazyFrame    ┆ serialize   ┆ pickle            ┆ 0.08    (49.0%  ) (17.0%  ) ┆ 214_010_712 (200.0% ) (860.0% ) │
│ LazyFrame    ┆ deserialize ┆ pickle            ┆ 0.084   (15.0%  ) (38.0%  ) ┆                                 │
│ LazyFrame    ┆ serialize   ┆ cls.serialize()   ┆ 0.078   (50.0%  ) (16.0%  ) ┆ 214_010_651 (200.0% ) (860.0% ) │
│ LazyFrame    ┆ deserialize ┆ cls.deserialize() ┆ 0.069   (11.0%  ) (31.0%  ) ┆                                 │
│ list[Series] ┆ serialize   ┆ pickle            ┆ 0.069   (86.0%  ) (15.0%  ) ┆ 214_019_863 (100.0% ) (860.0% ) │
│ list[Series] ┆ deserialize ┆ pickle            ┆ 0.049   (73.0%  ) (23.0%  ) ┆                                 │
└──────────────┴─────────────┴───────────────────┴─────────────────────────────┴─────────────────────────────────┘
shape = (4000000, 16) (randints Int64)
┌──────────────┬─────────────┬───────────────────┬─────────────────────────────┬─────────────────────────────────┐
│ Object       ┆ Operation   ┆ Method            ┆ Runtime (s)                 ┆ Size (bytes)                    │
╞══════════════╪═════════════╪═══════════════════╪═════════════════════════════╪═════════════════════════════════╡
│              ┆             ┆                   ┆         (%1.17.1) (%1.19.0) ┆             (%1.17.1) (%1.19.0) │
│ DataFrame    ┆ serialize   ┆ pickle            ┆ 0.2     (140.0% ) (4.9%   ) ┆ 512_025_661 (100.0% ) (160.0% ) │
│ DataFrame    ┆ deserialize ┆ pickle            ┆ 0.1     (89.0%  ) (6.7%   ) ┆                                 │
│ DataFrame    ┆ serialize   ┆ cls.serialize()   ┆ 0.07    (10.0%  ) (1.7%   ) ┆ 512_025_600 (160.0% ) (160.0% ) │
│ DataFrame    ┆ deserialize ┆ cls.deserialize() ┆ 0.087   (4.9%   ) (5.7%   ) ┆                                 │
│ LazyFrame    ┆ serialize   ┆ pickle            ┆ 0.26    (36.0%  ) (6.3%   ) ┆ 512_025_959 (160.0% ) (160.0% ) │
│ LazyFrame    ┆ deserialize ┆ pickle            ┆ 0.16    (11.0%  ) (11.0%  ) ┆                                 │
│ LazyFrame    ┆ serialize   ┆ cls.serialize()   ┆ 0.15    (21.0%  ) (3.6%   ) ┆ 512_025_898 (160.0% ) (160.0% ) │
│ LazyFrame    ┆ deserialize ┆ cls.deserialize() ┆ 0.14    (7.7%   ) (9.0%   ) ┆                                 │
│ list[Series] ┆ serialize   ┆ pickle            ┆ 0.16    (120.0% ) (3.8%   ) ┆ 512_049_408 (100.0% ) (160.0% ) │
│ list[Series] ┆ deserialize ┆ pickle            ┆ 0.086   (75.0%  ) (5.6%   ) ┆                                 │
└──────────────┴─────────────┴───────────────────┴─────────────────────────────┴─────────────────────────────────┘

*Note: For % comparisons, lower is better

Observations
  • A lot of the slowdown in runtime is due to enabling compression - this step has now been removed:
    • The serialized size is much larger compared to v1.19, but the runtime is drastically reduced. Both metrics are now more in line with what we had in v1.17.1
  • The extra overhead of serde can be observed by comparing the runtime of the DataFrame serialization with the the runtime of the corresponding LazyFrame serialization (LazyFrames must use serde as they serialize through a DslPlan)

@github-actions github-actions bot added fix Bug fix python Related to Python Polars rust Related to Rust Polars labels Jan 9, 2025
/// maximum number of rows per chunk to ensure reasonable memory efficiency when
/// reading the resulting file, and a minimum size per chunk to ensure
/// reasonable performance when writing.
pub fn chunk_df_for_writing(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code is moved from polars-io


#[cfg(feature = "serde")]
#[test]
fn test_deserialize_height_validation_8751() {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test no longer works as serialization directly errors now on mismatching height

D::Error::custom::<PolarsError>(e)
})
impl DataFrame {
pub fn serialize_into_writer(&mut self, writer: &mut dyn std::io::Write) -> PolarsResult<()> {
Copy link
Collaborator Author

@nameexhaustion nameexhaustion Jan 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Serialization logic moved to impl DataFrame rather than on impl Series.

.serialize_into_writer(&mut bytes)
.map_err(S::Error::custom)?;

serializer.serialize_bytes(bytes.as_slice())
Copy link
Collaborator Author

@nameexhaustion nameexhaustion Jan 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is where the extra memcopy happens when going through serde - we are calling Serializer::serialize_bytes(bytes: &[u8])

}
py.allow_threads(|| {
self.df
.serialize_into_writer(&mut writer)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bypass serde when serializing DataFrame - this is essentially what we did in older versions

@nameexhaustion nameexhaustion added the performance Performance issues or improvements label Jan 9, 2025
Copy link

codecov bot commented Jan 9, 2025

Codecov Report

Attention: Patch coverage is 81.74387% with 67 lines in your changes missing coverage. Please review.

Project coverage is 79.01%. Comparing base (247f0b1) to head (090c99b).
Report is 5 commits behind head on main.

Files with missing lines Patch % Lines
crates/polars-core/src/serde/df.rs 80.00% 23 Missing ⚠️
crates/polars-plan/src/dsl/expr_dyn_fn.rs 38.09% 13 Missing ⚠️
crates/polars-utils/src/pl_serialize.rs 73.68% 10 Missing ⚠️
crates/polars-utils/src/python_function.rs 75.00% 9 Missing ⚠️
crates/polars-core/src/serde/series.rs 84.37% 5 Missing ⚠️
...quet/src/parquet/metadata/column_chunk_metadata.rs 0.00% 5 Missing ⚠️
crates/polars-core/src/frame/chunks.rs 98.03% 1 Missing ⚠️
crates/polars-core/src/frame/column/mod.rs 87.50% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #20641      +/-   ##
==========================================
- Coverage   79.03%   79.01%   -0.03%     
==========================================
  Files        1557     1557              
  Lines      220528   220547      +19     
  Branches     2510     2513       +3     
==========================================
- Hits       174303   174272      -31     
- Misses      45651    45702      +51     
+ Partials      574      573       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@nameexhaustion nameexhaustion marked this pull request as ready for review January 9, 2025 11:41
@ritchie46 ritchie46 merged commit 6cd9988 into pola-rs:main Jan 9, 2025
42 checks passed
@datenzauberai
Copy link
Contributor

@nameexhaustion Timings look great, thank you for all the effort!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fix Bug fix performance Performance issues or improvements python Related to Python Polars rust Related to Rust Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Performance regression for dumping/loading DataFrames
3 participants