Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Breaking: Rename old InsertMany, add new version with return values #589

Merged
merged 4 commits into from
Sep 13, 2024

Conversation

bgentry
Copy link
Contributor

@bgentry bgentry commented Sep 10, 2024

Responding to some customer feedback, this PR reimplements InsertMany / InsertManyTx using a new query with RETURNING so the inserted rows are returned. The old COPY FROM methods are retained for extremely high-volume use cases which do not care about return values, but are renamed to InsertManyFast and InsertManyFastTx. While these new methods are likely to be slower, for most use cases it won't matter and the added usability-by-default of this method is worthwhile IMO.

While this was initially implemented using a multirow values syntax and associated query builder, the current approach uses unnest along with array parameters to enable using RETURNING with an insert query nearly the same as what we already had for the database/sql driver.

This is a breaking change. However, my expectation is that most users were not checking the count value and as such will not have to change anything when they upgrade.

Pinging @krhubert in case he wants to give some feedback on this

TODO:

  • changelog entry
  • doc updates

riverdriver/riverpgxv5/river_pgx_v5_driver.go Outdated Show resolved Hide resolved
// Job uniqueness is not respected when using InsertMany due to unique inserts
// using an internal transaction and advisory lock that might lead to
// significant lock contention. Insert unique jobs using Insert instead.
func (c *Client[TTx]) InsertMany(ctx context.Context, params []InsertManyParams) ([]*rivertype.JobInsertResult, error) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had a thought that maybe instead of JobInsertResult being a plain struct we could consider using an interface. That would allow for expandability, particularly if we override these on the Pro side to return additional stuff at some point.

Need to think more about concrete use cases for that though.

client.go Outdated
}

if param.InsertOpts != nil {
// UniqueOpts aren't support for batch inserts because they use PG
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some ideas on making this work for multi insert but it'd probably be better for a video call brainstorming session.

@brandur
Copy link
Contributor

brandur commented Sep 12, 2024

Just a couple points on this one:

  • It's not really in the description, but I'm guessing the main driver here is that the inserted job rows are needed here? OOC, what was the use case for that?

  • Did you see the JobInsertFastMany query that's already in there for database/sql (because that doesn't support copy/from). A RETURNING * could easily be added to it, and it accomplishes this same thing in a tiny fraction of the LOCs, without all the string gluing (which is unsightly, but is also a security liability).

-- name: JobInsertFastMany :execrows
INSERT INTO river_job(
args,
kind,
max_attempts,
metadata,
priority,
queue,
scheduled_at,
state,
tags
) SELECT
unnest(@args::jsonb[]),
unnest(@kind::text[]),
unnest(@max_attempts::smallint[]),
unnest(@metadata::jsonb[]),
unnest(@priority::smallint[]),
unnest(@queue::text[]),
unnest(@scheduled_at::timestamptz[]),
unnest(@state::river_job_state[]),
-- lib/pq really, REALLY does not play nicely with multi-dimensional arrays,
-- so instead we pack each set of tags into a string, send them through,
-- then unpack them here into an array to put in each row. This isn't
-- necessary in the Pgx driver where copyfrom is used instead.
string_to_array(unnest(@tags::text[]), ',');

  • Did you consider inverting the functions such that returning function gets a new name instead of the current InsertMany getting switched to a new one? If I was designing from first principles, I'm not sure which of the two should get the main InsertMany name, but I could make an argument that it should be the non-returning variant because that's faster, and a lot of people won't need return job values anyway. And when not thinking about it from first principles, it'd certainly be an advantage to not make a breaking change, so we should weigh that fairly heavily in favor of InsertMany staying the non-returning variant.

@bgentry
Copy link
Contributor Author

bgentry commented Sep 12, 2024

  • It's not really in the description, but I'm guessing the main driver here is that the inserted job rows are needed here? OOC, what was the use case for that?

Yes, it's to make it so that jobs added with InsertMany (i.e. workflows) have the same UX affordances as individual inserts. Doing this in a single bulk insert is dramatically better than asking users to insert the rows one-at-a-time, because that approach involves a round trip per insert.

There are lots of use cases where people want to log or record the job ID after it's been inserted so that they can track it, monitor it, or otherwise interact with the job programmatically after insertion.

  • Did you see the JobInsertFastMany query that's already in there for database/sql (because that doesn't support copy/from). A RETURNING * could easily be added to it, and it accomplishes this same thing in a tiny fraction of the LOCs, without all the string gluing

I gave this unnest approach a try, and pretty quickly ran into the fact that when using an array of a custom enum type (river_job_state[]), pgx requires that the OID be looked up and registered as part of each conn upon creation. That would require it to be added to any pgxpools' Configs prior to them being created. That's trivial to do in an individual app, but asking everybody using a library like River to do that is quite a bit more complicated and will degrade UX. We'd have to make people set up an AfterConnect() on the pool config.

The workaround is to pass an array of strings, and then cast it to river_job_state in the query. Not ideal, but probably not worse in efficiency then building up and sending a large query string.

I also needed to keep the workaround for tags that you have for dbsql, which is to join them with commas, pass a single-level text array, and then split them in the query. The reason is that unnest fully flattens multi-dimensional arrays and I didn't find a clean way to single-level unnest that didn't rely on custom plpgsql.

(which is unsightly, but is also a security liability).

Maybe I'm missing something, but I'm not seeing the security liability aspect of this. This code is was in its previous form very deliberately not string interpolating arguments—it's just building up the positional parameters in the query string. This is the same thing any ORM (ActiveRecord, Ecto, etc) would do in order to do a multirow values syntax insert. Of course sqlc doesn't make that possible, but given this is a useful feature to have access to it's unfortunately another major drawback of sqlc as it exists today.

The code maintenance thing is a much bigger concern, so I'm more than willing to try the unnest approach for now to see if we run into problems. We can always introduce more complex code as an optimization later on if needed. ~Here is that branch if you want to compare` The PR has been updated to use this approach.

  • Did you consider inverting the functions such that returning function gets a new name instead of the current InsertMany getting switched to a new one? If I was designing from first principles, I'm not sure which of the two should get the main InsertMany name, but I could make an argument that it should be the non-returning variant because that's faster, and a lot of people won't need return job values anyway. And when not thinking about it from first principles, it'd certainly be an advantage to not make a breaking change, so we should weigh that fairly heavily in favor of InsertMany staying the non-returning variant.

I did consider this, and landed on this approach for these reasons:

  1. I would like the UX between the default paths for Insert and InsertMany to be as aligned as possible. In particular I have some ideas for bringing unique job support to multi-insert, and if possible I want the API to allow me to say whether or not an insert was skipped due to uniqueness. I would also like to get to a point where the code paths for these two cases are the same, i.e. we don't need to maintain separate queries for the single-insert case and can just internally wrap everything as a multi insert that happens to only have one row.
  2. The cases where somebody is going to care about the perf difference between COPY vs another approach are likely very minimal—probably only where you're going to insert (many?) thousands of rows at once. Just guessing but I'd say that's probably <.1% of usage. Given that, I believe the far-less-common path should be the one with the weird name.
  3. The actual impact to customers from this breaking change is probably quite small, because there is no practical need to ever check that the integer count returned by InsertMany actually matches the size of the array you gave it. Maybe some people were using it instead of len(rows) but I'd wager it's super unlikely.
  4. We are still pre v1 for a reason, to tease out mistakes like this and correct them while we can.

I'm still open to changing the approach, but felt that the benefits pretty strongly outweighed the downsides here.

This new query allows many rows to be inserted (up to ~7280 as of now)
while also allowing the new records to be returned. While this is not
quite as fast as the `COPY FROM` option when loading many rows, it
provides a better UX for the vast majority of use cases.

It _does_ require that we ditch sqlc for this one query, because sqlc
does not support the multirow values insert syntax due to the dynamic
nature of the param placeholders.
@bgentry bgentry force-pushed the bg-job-insert-many-returning branch 2 times, most recently from b79871e to b5764d0 Compare September 12, 2024 19:17
@bgentry bgentry marked this pull request as ready for review September 12, 2024 19:18
This adds new implementations of `InsertMany` / `InsertManyTx` that use
the multirow `VALUES` syntax to allow the new rows to be returned upon
insert. The alternative `COPY FROM ` implementation has been renamed to
`InsertManyFast` / `InsertManyFastTx`. The expectation is that these
forms will only be needed in cases where an extremely large number of
records is being inserted simultaneously, whereas the new form is more
user-friendly for the vast majority of other cases.
Copy link
Contributor

@brandur brandur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. Thanks for those changes!


defaultObject := "{}"

insertJobsParams.Args[i] = valutil.ValOrDefault(string(params.EncodedArgs), defaultObject)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should maybe pull this trick into the other JobInsertMany in here too.

@@ -118,6 +118,7 @@ type Executor interface {
JobInsertFast(ctx context.Context, params *JobInsertFastParams) (*rivertype.JobRow, error)
JobInsertFastMany(ctx context.Context, params []*JobInsertFastParams) (int, error)
JobInsertFull(ctx context.Context, params *JobInsertFullParams) (*rivertype.JobRow, error)
JobInsertManyReturning(ctx context.Context, params []*JobInsertFastParams) ([]*rivertype.JobRow, error)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one's still kind of a "job insert fast" variant, so that should probably still be in there.

I wonder too if we should try to match up which one gets the suffix between the client and driver a little bit more. In the client, the "fast" version gets a suffix, but in the driver here the "returning" variant gets the suffix (i.e. so they're flipped).

Maybe:

  • JobInsertFastMany (called from InsertManyFast) -> JobInsertFastManyNoReturning
  • JobInsertManyReturning (called from InsertMany) -> JobInsertFastMany

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call, renamed them as suggested ✌️

@bgentry bgentry enabled auto-merge (squash) September 13, 2024 03:00
@bgentry bgentry merged commit 67c8a8a into master Sep 13, 2024
14 checks passed
@bgentry bgentry deleted the bg-job-insert-many-returning branch September 13, 2024 03:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants