Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

replace UUIDs, implement batch insert #7

Merged
merged 6 commits into from
Jul 29, 2024

Conversation

matt-codecov
Copy link
Collaborator

@matt-codecov matt-codecov commented Jul 24, 2024

you can look at the individual commits to make reviewing easier

there are four main changes in this stack:

  • move insert logic to a new Insertable trait
  • the Upload context type / UploadDetails table is replaced with a dedicated RawUpload model
  • replace UUIDs with a joint PK
  • implement batch insert

as a result, my sample report went from taking 30s to process to taking 8s

bulk inserting UUIDs is painful because you can't benefit from any locality when updating indices for each new record. measurement models now have a joint PK instead: raw_upload_id, which is effectively constant, and local_*_id, which is auto-incremented. it's not a UUID anymore, but the new PK is still unique enough that we can merge reports by blindly concatenating tables, and it's much faster to insert

there's more to do for perf; journaling/synchronization, WITHOUT ROWID, blah blah blah

sorry this is a mess; i started this stack before anyone else would be looking at this project. with @Swatinem coming aboard, i'll organize changes to be more reviewable in the future

Copy link
Contributor

@Swatinem Swatinem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was looking at individual commits, and left a bunch of suggestions and questions scattered around.
Some might be irrelevant given more context in followup commits though.

src/report/sqlite/models.rs Outdated Show resolved Hide resolved
Comment on lines +58 to +57
pub trait Insertable<const FIELD_COUNT: usize> {
const INSERT_QUERY_PRELUDE: &'static str;

const INSERT_PLACEHOLDER: &'static str;

fn param_bindings(&self) -> [&dyn rusqlite::ToSql; FIELD_COUNT];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The FIELD_COUNT is an input generic, which makes it a bit weird when you want to use it, as you have to explicitly specify <Ty as Insertable<XXX>>::insert(&value, …).

I believe if you turn the FIELD_COUNT into an output associated const, similar to the other strs, you should be able to just refer to it inside the fn like so:

fn param_bindings(&self) -> [&dyn rusqlite::ToSql; Self::FIELD_COUNT];

Although I’m a bit unsure if that actually works that way, maybe its not possible to use the associated const in that position.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like #![feature(generic_const_exprs)] would allow this, but idk if we wanna commit harder to requiring nightly

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah, I believe generic_const_exprs still has a long and bumpy road ahead of itself, so might be fair to avoid it.
Otherwise, we don’t have to optimize for perf that hard vs developer ergonomics, so its perfectly fine to allocate and use Vec<&dyn T>.

src/report/sqlite/models.rs Show resolved Hide resolved
Comment on lines +37 to +39
-- This should be set to the hash of the `name` column so that we can
-- distribute processing across multiple different hosts and they will
-- all come up with the same ID.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mentioned concerns around id collisions. Both hashes and random ints have potential for collision (although negligibly small).
Here in particular, context_type should ideally be part of the hash as well.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think you're right, but as a separate change (since this PR is pretty big) we probably want to get rid of ContextType and rename context to label or test_case or something which you also observed

migrations/01-init/up.sql Outdated Show resolved Hide resolved
src/parsers/pyreport/utils.rs Outdated Show resolved Hide resolved
src/parsers/pyreport/utils.rs Outdated Show resolved Hide resolved
src/parsers/pyreport/utils.rs Outdated Show resolved Hide resolved
src/parsers/pyreport/utils.rs Outdated Show resolved Hide resolved
src/parsers/pyreport/utils.rs Outdated Show resolved Hide resolved
@matt-codecov matt-codecov force-pushed the matt/replace-uuid-and-batch-insert branch from c26f8f3 to c434444 Compare July 27, 2024 02:32
Copy link

codecov bot commented Jul 27, 2024

Codecov Report

Attention: Patch coverage is 99.95717% with 1 line in your changes missing coverage. Please review.

Project coverage is 99.21%. Comparing base (553307f) to head (c434444).

✅ All tests successful. No failed tests found.

Files Patch % Lines
src/parsers/pyreport/utils.rs 99.88% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main       #7      +/-   ##
==========================================
+ Coverage   98.10%   99.21%   +1.10%     
==========================================
  Files          17       17              
  Lines        6171     7249    +1078     
==========================================
+ Hits         6054     7192    +1138     
+ Misses        117       57      -60     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@matt-codecov
Copy link
Collaborator Author

i think eliminating ContextType and renaming Context to TestCase (or something) should be a separate PR since this one is already very big.

the suggestions were all great, the ones i didn't take were because i couldn't get them building after fussing with them for a minute or some specific circumstance means i think the code should stay as it is.

@matt-codecov matt-codecov marked this pull request as ready for review July 27, 2024 02:39
Comment on lines +224 to +227
let mut models: Vec<LineSessionModels> = report_lines
.iter()
.flat_map(|line| create_model_sets_for_report_line(line, ctx))
.collect::<Vec<LineSessionModels>>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You already have a type annotation on models, so no need to repeat that in the call to collect. It would also be enough to leave a placeholder like so: models: Vec<_>, as the collect call only needs to know which collection to collect into, the inner T can be inferred.

@matt-codecov matt-codecov merged commit c0a4b65 into main Jul 29, 2024
6 checks passed
@matt-codecov matt-codecov deleted the matt/replace-uuid-and-batch-insert branch July 29, 2024 17:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants