-
-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[jdbc.read] Don't track realized keys in TransientRow #185
base: master
Are you sure you want to change the base?
[jdbc.read] Don't track realized keys in TransientRow #185
Conversation
84ff99e
to
7319b51
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #185 +/- ##
==========================================
- Coverage 83.20% 83.17% -0.03%
==========================================
Files 37 37
Lines 2524 2526 +2
Branches 214 215 +1
==========================================
+ Hits 2100 2101 +1
Misses 210 210
- Partials 214 215 +1 ☔ View full report in Codecov by Sentry. |
7330740
to
76a77f9
Compare
src/toucan2/jdbc/result_set.clj
Outdated
@@ -142,6 +145,7 @@ | |||
col-names (get builder :cols (next.jdbc.rs/get-modified-column-names | |||
(.getMetaData rset) | |||
combined-opts)) | |||
col-names-kw (vec (keep #(column-name->keyword % label-fn) col-names)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're really worried about performance, (into [] (keep ...) ...)
would be faster, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code only runs once per ResultSet, so it's not as critical to make optimizations here. But sure, there's no harm in switching to a transducer.
test/toucan2/jdbc/row_test.clj
Outdated
(defn- realized-keys [^TransientRow row] | ||
@(.realized_keys row)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couldn't this still be calculated by looking at .column_names
and removing that ones that are ::not-found
? That way we don't need to delete a ton of test code below
@@ -142,6 +145,7 @@ | |||
col-names (get builder :cols (next.jdbc.rs/get-modified-column-names | |||
(.getMetaData rset) | |||
combined-opts)) | |||
col-names-kw (vec (keep #(column-name->keyword % label-fn) col-names)) | |||
col-name->index (make-column-name->index col-names label-fn)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like it will be bad to performance to calculate all the column name keywords ahead of time in the line above and then create a memorized function that has to recalculate all those names a second time here... surely if you want to go down this road it would make sense to have make-column-name->index
take col-names-kw
so it doesn't need to call column-name->keyword
again. Right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As an added bonus I guess we'd avoid the overhead of having to do column-name->keyword
calculations on the key passed to the without
method since presumably it should already be a keyword that is formatted correctly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like I've said above, column names here are calculated only once per ResultSet. The current realized-keys update is triggered for every individual cell access (N columns * M rows). That's why optimizing the once per ResultSet part is not as important. But if there are easy wins there, it makes sense to do them as well.
The reason why I left the current memoized make-column-name->index
conversion is because, if I read the code correctly, the user may pass a string, a keyword, or a symbol as a column name, and they all should work. We can perhaps calculate a map for all possible valid column names. I am not against it but it feels a bit dirty. We can go that route if you wish. We can't precalculate all valid column names because of arbitrary label-fn
. I see that it transforms names to kebab case for example, so :my_good_column
, "my-good_column"
, 'my_good-column
are all valid keys that should be transformed to :my-good-coumn
in the end. We can't precalculate that.
It would be nice to see something like Criterium benchmarks of before and after these changes to get a sense of how significant of a performance impact this really makes. I would be interested in seeing whether doing away with We lose a bit of info by taking this stuff out (as evidenced by the number of lines you had to delete from the tests) so if we're talking like a performance improvement of a few microseconds then I'm not sure it's really worth it or not. If |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Is there a way we can stop pre-calculating all of the keyword column names and only calculate them on demand if and when we print a transient row?
- If we actually do need to pre-calculate all of the keyword column names,
make-column-name->index
can probably be updated to use those so we're not calculating them twice for every row. - If we're getting really crazy with performance optimizations,
(into [] ...)
is faster than(vec (keep ...))
- If we remove
realized-keys
but still havecolumn-names
, you should still be able to calculate the realized keys by looking at column names and seeing what is not::not-found
. If we write a function for that inrow-test
then you don't need to remove all the tests you've removed
463e943
to
3e6e748
Compare
UPD regarding tests: the current tests are written using |
Here's the benchmark: ;; preparation - insert 10k rows into people table
(toucan2.insert/insert!
::test/people
(for [id (range 10 10000)]
(instance/instance ::test/people
{:id id
:name "Somebody"
:created-at (LocalDateTime/parse "2017-01-01T00:00")})))
;; the benchmark - I use time+ to get allocations too, but criterium would work just as well
(time+ (select/select ::test/people {:select [:*], :from [[:people]]})) Results:
|
And another follow-up on having to calculate all columns ahead of time: I'm mostly looking at Metabase benchmarks (not microbenchmarks of Toucan2 itself) and I see that in Metabase |
3e6e748
to
262b2c0
Compare
Tracking the hashset of realized keys for each row in the resultset has noticeable overhead. Since this is only needed to print the transient row during debugging (as far as I can tell), it makes sense to shave the overhead here. All column names can be computed upfront and then used to print the values available in transient row. Please tell me if I'm wrong or missing something.