-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Fix performance regression for DataFrame serialization/pickling #20641
Conversation
/// maximum number of rows per chunk to ensure reasonable memory efficiency when | ||
/// reading the resulting file, and a minimum size per chunk to ensure | ||
/// reasonable performance when writing. | ||
pub fn chunk_df_for_writing( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code is moved from polars-io
|
||
#[cfg(feature = "serde")] | ||
#[test] | ||
fn test_deserialize_height_validation_8751() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test no longer works as serialization directly errors now on mismatching height
D::Error::custom::<PolarsError>(e) | ||
}) | ||
impl DataFrame { | ||
pub fn serialize_into_writer(&mut self, writer: &mut dyn std::io::Write) -> PolarsResult<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Serialization logic moved to impl DataFrame
rather than on impl Series
.
.serialize_into_writer(&mut bytes) | ||
.map_err(S::Error::custom)?; | ||
|
||
serializer.serialize_bytes(bytes.as_slice()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is where the extra memcopy happens when going through serde - we are calling Serializer::serialize_bytes(bytes: &[u8])
} | ||
py.allow_threads(|| { | ||
self.df | ||
.serialize_into_writer(&mut writer) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bypass serde when serializing DataFrame
- this is essentially what we did in older versions
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #20641 +/- ##
==========================================
- Coverage 79.03% 79.01% -0.03%
==========================================
Files 1557 1557
Lines 220528 220547 +19
Branches 2510 2513 +3
==========================================
- Hits 174303 174272 -31
- Misses 45651 45702 +51
+ Partials 574 573 -1 ☔ View full report in Codecov by Sentry. |
@nameexhaustion Timings look great, thank you for all the effort! |
Fixes #20605
PR notes
impl serde::ser::Serialize for DataFrame
toDataFrame::serialize_into_reader
etc.DataFrame
from Python.Serializer
for provide a raw buffer that we can write into)deserialize_map_bytes
utility function to reduce unnecessary memcopy during deserialization. This replaces some existing code that unnecessarily copied to an ownedVec<u8>
during deserialization.Performance testing
(build-debug-release)
*Note: For % comparisons, lower is better
Observations
DataFrame
serialization with the the runtime of the correspondingLazyFrame
serialization (LazyFrames must use serde as they serialize through aDslPlan
)