-
Notifications
You must be signed in to change notification settings - Fork 534
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add locking mechanism to prevent parallel execution as a safety measure #335
Comments
I can't speak for @VojtechVitek, but I've come around on this. After chatting with other folks and gathering feedback the consensus is "it'd be a nice feature to have". I'm not sure what this would actually look like to the end-user, but, I suggest we keep this issue open and continue the discussion. |
Not an exclusive list, but some things to think through..
I'm sure more questions and implementation details will come up. |
To solve this we had to write a wrapper executable for goose that obtains an advisory lock in postgres. It's pretty common in Kubernetes deployments to use a init container to perform database migrations. The issue arises due to the fact that init containers are part of deployments which run multiple copies of the container. With no locking in place, goose migrations interleave with each other. It's a race condition so we don't see the issue all the time, but with no locking in place we run into issues randomly. We use advisory locks, but the implementation is pretty specific to each DB backend. I'd expect the end user experience for either of these methods to not look any different that it does now. Goose runs all the same, the only difference is that another copy of goose cannot run until the first session is complete. |
Oddly enough, I was just chatting with folks about this. @jessedearing, was your approach similar to the one described by @dahu33 in this #191 (comment)? I've seen this come up a lot recently, probably because more and more environments are punting this problem to the tool itself. |
It was. We moved it to it's own executable and launched Goose so we can still pull in upstream Goose updates easily. We just use The implementation is very specific to Postgres, but MySQL has GET_LOCK and RELEASE_LOCK functions too. |
To add another scenario: I'm embedding Goose into the test suite to ensure that code and database (Postgres) remain in sync. However, I find that the golang test runner introduces concurrency such that migrations can inconsistently happen simultaneously. |
Thanks all for the feedback. Going to bump this up and add a label as something we should add to this library. It'll likely be opt-in behaviour to avoid unexpected issues for existing users. |
After some thought, I think an advisory lock (at least for postgres) is the right approach, granted this won't always work (see cockroachdb/cockroach#13546). We can document which databases are safe for concurrent use and which aren't. Although this may be a reason to use a table lock, similar to liquibase, which not only helps in the cockroachdb example but also where PgBouncer is being used with transaction pooling. Maybe the current goose behavior should be the default and users are responsible for serializing their deployments (as an example) .. but if the database supports it then they can opt-in for advisory locks. Tbd whether this should be session level or transaction level advisory lock. We could implement both and leave it up to the user to configure based on their needs, something along the lines of:
I really like @dahu33 comment from above, which is why I'd like to see goose add support for this feature and with some wood behind this arrow hopefully we can get an implementation we all like.
If folks have further thoughts, suggestions or ideas on an implementation please drop your comments! |
Dropping some thoughts here. I think it'd be ideal if we implemented a transaction-level advisory lock, instead of a session-level advisory lock. But there are some notable gotchas with transaction-level locks, such as These statements wait for other transactions to complete before executing (reference):
This means if we took a transaction-level lock and then attempted to run a migration it'd hang indefinitely. Not good. Now, there is another option. For each migration, we know whether it is safe to run within a transaction or not (with the The downside with session-level advisory locks is they MUST be released by the same session (read connection), and if anything happens to that session you're in for a bad time trying to figure out why subsequent operations are hanging.
Another thing I've been thinking about is whatever the implementation is, we should keep the door open to supporting this request: #222 tl;dr it'd be nice if multiple migrations could be applied within a single atomic transaction. (assuming all those files don't have the I'd like to explore what a separate |
For the postgres session-level advisory lock, I changed the implementation to use The retry has sane default but can be configured by the user. This implementation avoids an edge case that can lead to a deadlock. It's quite easy to reproduce with any tool that uses
migration 1CREATE TABLE owners (
owner_id BIGSERIAL PRIMARY KEY,
owner_name text NOT NULL
); migration 2CREATE UNIQUE INDEX CONCURRENTLY ON owners(owner_name); This will lead to the following error: If I have some cycles, I'll add more details here. |
@mfridman if/when you have the cycles, I'd be very interested to learn how taking a lock with |
Oh, I was able to reproduce this with just two Connection 1:
Connection 2:
This hangs, waiting for Connection 1 to release the lock. Then, back in Connection 1:
That's... pretty crazy, and not what I would have expected! The documentation on
Or, like you quoted earlier, maybe it is a virtual transaction that "could use the index"
Looks like the Flyway project also switched their implementation to spinning on Appreciate you mentioning it here, I wouldn't have been aware of the issue otherwise. |
We never saw this in our tooling because our tables are small enough that we haven't added |
@peterldowns Coincidentally, I recently discovered a series of Blog posts from Sourcegraph (huge kudos to @efritz) that outlines their migration woes with https://about.sourcegraph.com/blog/introducing-migrator-service .. scroll down to A deeper contributing factor The good news is the |
@mfridman thanks for the links to those blogposts. Where I last worked, we would build large indexes concurrently "out of band", and then check in a migration that creates-if-not-exists. For example, let's say we needed to create some index on a really large table where it would probably take a few hours. We would normally run any new migrations as part of our deployment process, as a prelude to replacing the existing app containers with the new version. We couldn't create that index during deployment, even with CREATE INDEX -- CONCURRENTLY
IF NOT EXISTS my_example_idx
ON some_really_large_table (primary_key, created_at, some_other_key); Once the PR with this migration was approved, we'd use a script to manually execute the concurrent version of that query: CREATE INDEX CONCURRENTLY
IF NOT EXISTS my_example_idx
ON some_really_large_table (primary_key, created_at, some_other_key); Only once that concurrent index creation had succeeded would we merge the PR. After merging the PR, redeploying would involve running migrations before starting up the app, but it would take no time at all because the index was already created! In a testing scenario where we would need to migrate from an empty database to the current state, this would be fast becuase there wouldn't be any data in the tables, and we didn't do concurrent index creation. I'm actually planning on releasing a migration library/framework soon based on this example and a few other things I learned. Because of this dance around concurrent migrations I had never run into this problem that you mentioned. Thank you again for helping me understand the problem and for teaching me something new about Postgres! |
Saw my blog post mentioned above, thought I'd drop in and say 👋! The way we do migrations at Sourcegraph assumes:
This affords us certain properties such as "a incomplete attempt while no migrator holds a lock" means a previous migration died before completion, but it didn't affect anything (since it was done atomically in a transaction). Concurrent transactions throw a wrench in the works here because you can't hold an advisory lock and complete the index creation at the same time. The way we get around this is special-casing migrations with concurrent index creation here:
Note that the reason we could do this (and golang-migrate couldn't easily) was because we only support Postgres. |
This functionality has been added in v3.16.0 and higher. Requires a A bit more detail in this blog post: https://pressly.github.io/goose/blog/2023/goose-provider/#database-locking Thank you to everyone for the feedback, fruitful discussions, and contributions! Going to close this issue, but feel free to continue the discussion and looking forward to your feedback. |
I know this topic has already been discussed in #191 and #258 and while I understand the argument that migrations should not run in parallel in the first place, I still think that goose should have a locking mechanism to prevent parallel execution as a safety measure. This would not only allow new use cases but also make goose a much safer tool to use!
The text was updated successfully, but these errors were encountered: