Introduce more efficient implementations of str and join #545

alexander-yakushev · 2024-09-26T13:56:26Z

Given that most of the purpose of HoneySQL is to mash strings together so that the user doesn't have to, it's natural that string mashing is responsible for most of its running time and generated allocations. Clojure's base library functions are not the most efficient for this kind of work:

clojure.core/str performs badly when provided more than 1 argument. It resorts to varargs and inefficient sequence iteration.
clojure.string/join also performs inefficient seq-based iteration of the provided collection.

This PR proposes adding a new namespace that contains optimized implementations of those two functions. As a bonus, the new join also has a transducer-accepting arity. Since it is a common pattern to string/join something after some action is performed on every element, the transducer arity becomes useful in many situations.

The new functions only have optimized bodies for :clj language selector and delegate to base functions everywhere else.

In the honey.sql namespace, clojure.core/str is replaced with honey.sql.util/str in the namespace form. join is referred directly, so everywhere in the code str/join is replaced with join and also modified to use transducer arity where necessary.

Obviously, I'm open to restructure the solution or dial back some changes if this is too much in one go.

igrishaev · 2024-09-26T14:17:38Z

Could you please share benchmarks (if you still have your repl open, by any chance)? Just wonder how much performance it gives.

alexander-yakushev · 2024-09-26T14:31:43Z

I was mostly measuring the allocations improvements across multiple Metabase endpoints. This PR focuses mostly on allocations, the execution speed is not impacted as much. Here's one of the concrete queries:

(crit/quick-bench
   (format-dsl {:select [[[:min :recent_views.user_id] :user_id]
                         :model
                         :model_id
                         [[:max [:coalesce :d.view_count :t.view_count]] :cnt]
                         [:%max.timestamp :max_ts]]
                :group-by  [:model :model_id]
                :where     [:and
                            [:= :context "view"]
                            [:in :model #{"dashboard" "table"}]]
                :order-by  [[:max_ts :desc] [:model :desc]]
                :limit     10
                :left-join [[:report_dashboard :d]
                            [:and
                             [:= :model "dashboard"]
                             [:= :d.id :model_id]]
                            [:metabase_table :t]
                            [:and
                             [:= :model "table"]
                             [:= :t.id :model_id]]]}))

;; Before:

Evaluation count : 13122 in 6 samples of 2187 calls.
             Execution time mean : 50.880818 µs
    Execution time std-deviation : 11.514117 µs
   Execution time lower quantile : 44.925380 µs ( 2.5%)
   Execution time upper quantile : 70.466794 µs (97.5%)
                   Overhead used : 1.879403 ns

Allocations per call: 118,560b

---

;; After:

Evaluation count : 13188 in 6 samples of 2198 calls.
             Execution time mean : 45.462514 µs
    Execution time std-deviation : 53.006990 ns
   Execution time lower quantile : 45.418067 µs ( 2.5%)
   Execution time upper quantile : 45.542755 µs (97.5%)
                   Overhead used : 1.879403 ns

Allocations per call: 105,072b

I've made further improvements locally, but I don't want to dump everything into one PR to accommodate for easier review.

p-himik · 2024-09-26T14:41:19Z

A couple of counterpoints:

When the query is not dynamic or has few possible values, HoneySQL can be set to cache the formatter's result
Perhaps a better approach would be to use a string builder throughout

alexander-yakushev · 2024-09-26T14:47:14Z

I'm not sure whether such cache should be implemented on HoneySQL side. I think it's up to the consumer to know which queries are cacheable/AOT-computable and act accordingly.
I agree that carrying around a stringbuilder would be even more efficient; however, I aimed at having minimal impact with this PR. Single StringBuilder approach will require much more rewriting. It can be done; perhaps, as another change in the future (if the improvement justifies the rewrite). A StringBuilder approach would also be harder to make CLJS-compatible in a clean way.

seancorfield · 2024-09-26T14:54:08Z

I'll note that the caching is built into HoneySQL -- but it is opt-in and you have to add the clojure.core.cache dependency and choose, on a per-call basis, whether to use caching. You need to use named parameters to really benefit from that (and there are a couple of SQL constructs you currently have to avoid, e.g., :in due to how they are currently expanded).

I like the idea of speeding up str and join but have two comments:

It almost seems like this should be a separate library (that HoneySQL could then depend on)
I feel strongly that the transducer arity of join should be (join sep xform coll) -- the parallel is (into to xform from) IMO and it makes the transformation from (join sep (map f coll)) to (join sep (map f) coll) more natural

alexander-yakushev · 2024-09-26T20:05:04Z

I understand what you mean, but I really really dislike "util" deps in other libraries, and I don't think it's a good idea to multiplicate those. Maybe, at some point, when the list of "improved core functions" is big enough, it would make sense to move those into a separate library and depend upon. Until then, there is a a big value in HoneySQL having no dependencies, and keeping it this way is valuable. Adding two utility functions is a small price to pay for that.

I agree. I didn't know which to pick between the two; glad you have a strong stance on it.

alexander-yakushev · 2024-09-26T20:13:46Z

Updated WRT to comment 2. by @seancorfield. Also added more str arities.

seancorfield · 2024-09-26T21:17:07Z

Thanks. That makes the diff a lot more readable as well as having a more intuitive arg order.

seancorfield · 2024-09-26T21:20:24Z

Can you explain why this line is present? clj-kondo flags it as an error (unresolved symbol):

  {:tag String}

Signed-off-by: Sean Corfield <sean@corfield.org>

alexander-yakushev · 2024-09-27T06:45:13Z

I copied it from clojure.core/str and forgot to remove. I don't think it does anything useful given that individual arity return types are already type-tagged. I will remove this in the next PR.

alexander-yakushev force-pushed the perf-opt branch from fa1632b to 9b7a6cc Compare September 26, 2024 14:06

alexander-yakushev force-pushed the perf-opt branch 2 times, most recently from b3338f1 to 98a7fd2 Compare September 26, 2024 20:13

Introduce more efficient implementations of str and join

846123c

alexander-yakushev force-pushed the perf-opt branch from 98a7fd2 to 846123c Compare September 26, 2024 20:15

seancorfield merged commit 8c93e28 into seancorfield:develop Sep 26, 2024
4 checks passed

seancorfield added a commit that referenced this pull request Sep 26, 2024

note #545 in changelog

f31533d

Signed-off-by: Sean Corfield <sean@corfield.org>

alexander-yakushev deleted the perf-opt branch September 27, 2024 06:43

alexander-yakushev mentioned this pull request Sep 27, 2024

Hodgepodge of optimizations #546

Merged

alexander-yakushev mentioned this pull request Oct 11, 2024

Bump HoneySQL version metabase/metabase#48602

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce more efficient implementations of str and join #545

Introduce more efficient implementations of str and join #545

alexander-yakushev commented Sep 26, 2024

igrishaev commented Sep 26, 2024

alexander-yakushev commented Sep 26, 2024

p-himik commented Sep 26, 2024

alexander-yakushev commented Sep 26, 2024 •

edited

Loading

seancorfield commented Sep 26, 2024

alexander-yakushev commented Sep 26, 2024

alexander-yakushev commented Sep 26, 2024

seancorfield commented Sep 26, 2024

seancorfield commented Sep 26, 2024

alexander-yakushev commented Sep 27, 2024

Introduce more efficient implementations of str and join #545

Introduce more efficient implementations of str and join #545

Conversation

alexander-yakushev commented Sep 26, 2024

igrishaev commented Sep 26, 2024

alexander-yakushev commented Sep 26, 2024

p-himik commented Sep 26, 2024

alexander-yakushev commented Sep 26, 2024 • edited Loading

seancorfield commented Sep 26, 2024

alexander-yakushev commented Sep 26, 2024

alexander-yakushev commented Sep 26, 2024

seancorfield commented Sep 26, 2024

seancorfield commented Sep 26, 2024

alexander-yakushev commented Sep 27, 2024

alexander-yakushev commented Sep 26, 2024 •

edited

Loading