Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Push rescuer down into internal/maintenance + JobRow to rivertype #36

Merged
merged 1 commit into from
Nov 20, 2023

Conversation

brandur
Copy link
Contributor

@brandur brandur commented Nov 18, 2023

Backstory: We've made some attempt to modularize River so that it's
entire implementation doesn't sit in one top-level package. We had some
success with this approach (see internal/), but after making some
progress we eventually hit a brick wall with most of the core types, and
even regressed as one of the queue maintainers (rescuer) had to live in
the top-level package so it could access worker-related information.

Modularity aside: I realized the other day that we also have another
problem on the horizon. We have a new driver system, but current drivers
like riverpgxv5 cheat by implementing a trivial wrapper around pgx.
Long term, these drives are going to have to look more like how
DBAdapter looks today. They'll have functions like JobInsert and
JobComplete specific to the particular driver implementation, and
those functions will need to take and return types representing things
like insert parameters and job return values. It might be possible to
use types from river for this, but we'd have to invert the dependency
requirement from river -> riverpgxv5 and change a lot of tests.
Being able to depend on a shared submodule (which rivertype could
become) would be a lot easier.

Approach

I wanted to try and fix that and pursue further modularization, but it
wasn't easy. I tried every trick in the book to cast types from
something internal to public-facing etc., but as implemented, it's
either not possible, or only possible with considerable downsides (see
bottom).

However, there is a change we can make that unlocks everything: JobRow
is the core type that needs to be shared amongst everything. If it could
live in a shared package, internal packages get the ability to reference
it and whole puzzle unjumbles into crystal clarity.

Previously, I'd convinced you that we could put core types into a
separate rivertype package, which we did, and then later undid after
finding a few techniques to share some internally referenced types.

Here, we bring that idea back to some degree, but only for one type:
JobRow (and okay, AttemptError too). This doesn't produce anywhere
near the same degree of required API adjustments because JobRow is
generally something that's returned from functions, and therefore
rivertype never needs to be referenced. It's also embedded on Job,
so its properties are available there again without importing
rivertype. Job stays in the top-level package so worker Work
signatures are not affected.

Signatures of the error handler and a few other types are affected, but
it's certainly the less common case.

With that move in place we go on to demonstrate that further
encapsulation becomes possible. We move the new work unit primitives
into a shared internal package, and then make use of them in the rescuer
so that it can decouple itself from the concept of a top-level worker.
With that done, we can push it into internal/maintenance where it was
supposed to be originally.

Another benefit: when introducing rivertest, we had to duplicate the
jobRowFromInternal function because of the same type problems
mentioned above. With JobRow not in a shared package, we can push this
helper into dbsqlc and have it shared amongst river and rivertest,
eliminating the necessity for this extra function.

And from here all the other pieces can fall too: job executor, producer,
etc. could all be encapsulated separately if we'd like them to be.

Alternatives considered

Struct conversion by removing AttemptError

Go allows struct to easily be converted from one to another so you could
potentially have river.JobRow(internalJobRow), but the limitation is
that you can't do it as soon as substructs (e.g. AttempError) are
introduced, even if the substructs are also identical to each other.

One thing that we could potentially do here is change AttemptError
back to a plain set of []bytes and add an unmarshaling helper like
UnmarshalAttemptErrors() []AttemptError. I think it'd work, but also
has downsides of its own:

  • Prevents us from having adding other substructs to the job row in the
    future.

  • Caching unmarshaled attempt errors is difficult because if either
    struct picked up an unexported member, it can no longer be converted
    to the other.

  • The conversions and transform wrappers that'd need to happen to pass
    back and forth with interfaces like ErrorHandler would be tricky and
    add probably cruft.

I took a stab at implementing this in #37.

Use of unsafe

Another trick you can do similar to the above: if you're sure two
structs really are identical, you can convert one to the other with
unsafe.Pointer, even if they include substructs like AttemptError.

I made a prototype of this and it definitely works, so I really
considered this as a good possible way forward.

However, use of the unsafe package is frowned upon, and it's also
banned from some hosted providers like Google App Engine, which seems
bad for our use case.

@brandur brandur force-pushed the brandur-push-down-job-row branch 2 times, most recently from 92a6bf2 to 190dba3 Compare November 18, 2023 19:38
// there's no way to put a helper in a shared package that can produce one,
// which is why we have this copy/pasta. There are some potential alternatives
// to this, but none of them are great.
func jobRowFromInternal(internal *dbsqlc.RiverJob) *river.JobRow {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note we get to kill this duplicated convert function from rivertest which is nice.

)

type CustomErrorHandler struct{}

func (*CustomErrorHandler) HandleError(ctx context.Context, job *river.JobRow, err error) *river.ErrorHandlerResult {
func (*CustomErrorHandler) HandleError(ctx context.Context, job *rivertype.JobRow, err error) *river.ErrorHandlerResult {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Despite the move, this is the only example that changes thanks to JobRow being embedded on Job.

brandur added a commit that referenced this pull request Nov 18, 2023
I wanted to throw this one up as a possible alternative to a public
facing `rivertype` module in #36, in which we demonstrate an alternate
strategy using struct conversions. The problems it aims to solve are the
same:

* A way to access core job types from within subpackages, allowing more
  modularity (e.g. `Rescuer` could live with other maintenance services).

* Long term, a way of giving drivers access to job primitives without
  having to reference the top-level `river`.

The advantage of this approach is there's fewer user-facing changes, and
the user never has to interact with an alternate package.

It works by doing a struct conversion as necessary when moving from an
internal package to one that bubbles up to `river`. Go allows pure
struct conversions as long as their fields are the same, but there are
some limitations:

* Substructs like `AttemptError` can't be converted, even if identical.

* Custom types like `JobState` can't be converted, even if identical.

We work around these issues by:

* Keeping errors as a `[][]byte` instead of it being unmarshaled by pgx.
  A helper called `ErrorsUnmarshaled()` is added to the top-level
  `JobRow` to give easy access to unmarshaled errors.

* `State` is changed to a basic string akin to functions in `http`. This
  is technically a little less type-safe, but practically speaking would
  make very little difference because all these string-based types can
  be converted to and from each other with no checks anyway. Also, users
  aren't expected to access job state all that often. When they do, they
  can still use `job.State == river.JobStateAvailable` because the
  constants have been made available as strings.

I think this approach does also work reasonably well, although I worry a
little bit that there may be some sharp edge in the future that ends up
limiting us in some unexpected way. Also, having to use helpers like
`ErrorsUnmarshaled()` is definitely a little bit worse. That said, ~no
user-facing API changes, and has some marginal performance benefits:

* Converting structs between each other should be ~free.

* Errors probably aren't used that much, so by not unmarshaling them
  most of the time, we save on a lot of unnecessary JSON parsing.
brandur added a commit that referenced this pull request Nov 18, 2023
I wanted to throw this one up as a possible alternative to a public
facing `rivertype` module in #36, in which we demonstrate an alternate
strategy using struct conversions. The problems it aims to solve are the
same:

* A way to access core job types from within subpackages, allowing more
  modularity (e.g. `Rescuer` could live with other maintenance services).

* Long term, a way of giving drivers access to job primitives without
  having to reference the top-level `river`.

The advantage of this approach is there's fewer user-facing changes, and
the user never has to interact with an alternate package.

It works by doing a struct conversion as necessary when moving from an
internal package to one that bubbles up to `river`. Go allows pure
struct conversions as long as their fields are the same, but there are
some limitations:

* Substructs like `AttemptError` can't be converted, even if identical.

* Custom types like `JobState` can't be converted, even if identical.

We work around these issues by:

* Keeping errors as a `[][]byte` instead of it being unmarshaled by pgx.
  A helper called `ErrorsUnmarshaled()` is added to the top-level
  `JobRow` to give easy access to unmarshaled errors.

* `State` is changed to a basic string akin to functions in `http`. This
  is technically a little less type-safe, but practically speaking would
  make very little difference because all these string-based types can
  be converted to and from each other with no checks anyway. Also, users
  aren't expected to access job state all that often. When they do, they
  can still use `job.State == river.JobStateAvailable` because the
  constants have been made available as strings.

I think this approach does also work reasonably well, although I worry a
little bit that there may be some sharp edge in the future that ends up
limiting us in some unexpected way. Also, having to use helpers like
`ErrorsUnmarshaled()` is definitely a little bit worse. That said, ~no
user-facing API changes, and has some marginal performance benefits:

* Converting structs between each other should be ~free.

* Errors probably aren't used that much, so by not unmarshaling them
  most of the time, we save on a lot of unnecessary JSON parsing.

Unlike #36, I didn't push the rescuer down (just in case this is a total
dead end), but this approach should make that possible.
@brandur brandur requested a review from bgentry November 18, 2023 20:40
Copy link
Contributor

@bgentry bgentry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably the right approach once we started down the road of micro packages, though it does make me a bit sad to have to interface with multiple river packages for some of the built-in features 😕 Anyway I think we should ship this.

// within any given one. (i.e. Different Go processes have different IDs,
// but IDs are shared within any given process.) A process generates a new
// ULID (an ordered UUID) worker ID when it starts up.
AttemptedBy []string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Post launch we are probably going to want to think more about the worker ID thing and maybe offer a way for users to set this per client to match an internal identifier they already have for the node.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah +1. I was also thinking the other day that we might want to try and default to something a little more meaningful — even a hostname or something would be more human-friendly, but yeah, it should be configurable.

Backstory: We've made some attempt to modularize River so that it's
entire implementation doesn't sit in one top-level package. We had some
success with this approach (see `internal/`), but after making some
progress we eventually hit a brick wall with most of the core types, and
even regressed as one of the queue maintainers (rescuer) had to live in
the top-level package so it could access worker-related information.

Modularity aside: I realized the other day that we also have another
problem on the horizon. We have a new driver system, but current drivers
like `riverpgxv5` cheat by implementing a trivial wrapper around pgx.
Long term, these drives are going to have to look more like how
`DBAdapter` looks today. They'll have functions like `JobInsert` and
`JobComplete` specific to the particular driver implementation, and
those functions will need to take and return types representing things
like insert parameters and job return values. It might be possible to
use types from `river` for this, but we'd have to invert the dependency
requirement from `river` -> `riverpgxv5` and change a lot of tests.
Being able to depend on a shared submodule (which `rivertype` could
become) would be a lot easier.

 ## Approach

I wanted to try and fix that and pursue further modularization, but it
wasn't easy. I tried every trick in the book to cast types from
something internal to public-facing etc., but as implemented, it's
either not possible, or only possible with considerable downsides (see
bottom).

However, there is a change we can make that unlocks everything: `JobRow`
is the core type that needs to be shared amongst everything. If it could
live in a shared package, internal packages get the ability to reference
it and whole puzzle unjumbles into crystal clarity.

Previously, I'd convinced you that we could put core types into a
separate `rivertype` package, which we did, and then later undid after
finding a few techniques to share some internally referenced types.

Here, we bring that idea back to some degree, but only for one type:
`JobRow` (and okay, `AttemptError` too). This doesn't produce anywhere
near the same degree of required API adjustments because `JobRow` is
generally something that's returned from functions, and therefore
`rivertype` never needs to be referenced. It's also embedded on `Job`,
so its properties are available there again without importing
`rivertype`. `Job` stays in the top-level package so worker `Work`
signatures are not affected.

Signatures of the error handler and a few other types are affected, but
it's certainly the less common case.

With that move in place we go on to demonstrate that further
encapsulation becomes possible. We move the new work unit primitives
into a shared internal package, and then make use of them in the rescuer
so that it can decouple itself from the concept of a top-level worker.
With that done, we can push it into `internal/maintenance` where it was
supposed to be originally.

Another benefit: when introducing `rivertest`, we had to duplicate the
`jobRowFromInternal` function because of the same type problems
mentioned above. With `JobRow` not in a shared package, we can push this
helper into `dbsqlc` and have it shared amongst `river` and `rivertest`,
eliminating the necessity for this extra function.

And from here all the other pieces can fall too: job executor, producer,
etc. could all be encapsulated separately if we'd like them to be.

 ## Alternatives considered

 ### Struct conversion by removing `AttemptError`

Go allows struct to easily be converted from one to another so you could
potentially have `river.JobRow(internalJobRow)`, but the limitation is
that you can't do it as soon as substructs (e.g. `AttempError`) are
introduced, even if the substructs are also identical to each other.

One thing that we could potentially do here is change `AttemptError`
back to a plain set of `[]byte`s and add an unmarshaling helper like
`UnmarshalAttemptErrors() []AttemptError`. I think it'd work, but also
has downsides of its own:

* Prevents us from having adding other substructs to the job row in the
  future.

* Caching unmarshaled attempt errors is difficult because if either
  struct picked up an unexported member, it can no longer be converted
  to the other.

* The conversions and transform wrappers that'd need to happen to pass
  back and forth with interfaces like `ErrorHandler` would be tricky and
  add probably cruft.

 ### Use of unsafe

Another trick you can do similar to the above: if you're sure two
structs really are identical, you can convert one to the other with
`unsafe.Pointer`, even if they include substructs like `AttemptError`.

I made a prototype of this and it definitely works, so I really
considered this as a good possible way forward.

However, use of the `unsafe` package is frowned upon, and it's also
banned from some hosted providers like Google App Engine, which seems
bad for our use case.
@brandur
Copy link
Contributor Author

brandur commented Nov 20, 2023

Thanks! Going to get this shipped for tomorrow.

@brandur
Copy link
Contributor Author

brandur commented Nov 20, 2023

(And added changelog.)

@brandur brandur merged commit 62b5ae3 into master Nov 20, 2023
5 checks passed
@brandur brandur deleted the brandur-push-down-job-row branch November 20, 2023 06:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants