Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Paginating/ranging over non-unique fields #81

Open
bjeanes opened this issue Jul 22, 2015 · 15 comments
Open

Paginating/ranging over non-unique fields #81

bjeanes opened this issue Jul 22, 2015 · 15 comments

Comments

@bjeanes
Copy link

bjeanes commented Jul 22, 2015

The current advice implicitly in this doc and explicitly in the Heroku API reference breaks large result sets into ranges, and uses field values as constraints to be applied to the query for the rest of the result set. It isn't hard to stumble upon scenarios where a result set is sorted by a non-unique field (ratings, popularity, counts, times, etc), wherein ranges requested by the specified constraints would yield overlapping result sets.

Unfortunately, the Content-Range scheme doesn't account for these scenarios (likely, because we haven't yet had such a scenario in the Heroku API). As this is document aims to serve as general API design advice and non-unique sort+pagination is not an uncommon situation, some thought towards this problem may be merited by the document authors. Additionally, our team (at Heroku) has come across some scenarios we'd like to introduce which would benefit from an answer.

@geemus @brandur have either of you thought about this much before?

@mathias and I have discussed the idea of using "composite keys" which include at least one unique field as a component in search to achieve something similar in spirit to the current advice, but nothing clearly simple emerges.

@mathias
Copy link

mathias commented Jul 22, 2015

The simple solution I have now is that the query to sort the results is always something like WHERE conditions ORDER BY deploys DESC id ASC where deploys is some non-unique count. The ORDER BY would be part of a default scope in my models, and my equivalent of API's Ranger class would handle the pagination inside that sorted list.

For the small-ish number of records that we might need to serve (1-10,000 likely, or no more than 10K for awhile) in the service I'm proposing, sorting like this in each query without saving an index should be ok, but that probably can be improved in the future.

@geemus
Copy link
Member

geemus commented Jul 22, 2015

@mathias I had originally started pondering down a very similar path. Unfortunately although a convention based approach helps in terms of consistent order, it doesn't help in terms of boundaries. ie if the main key has several, sorted by id, when you ask for the next page, you would could either end up skipping the differing id ones (or duplicating). In the pathological case, there could even be a case where an entire range (or more) of records shared the same primary sorting key.

No particular suggestion just yet, need to ponder/research a bit more. Will try to work on something soon.

@mathias
Copy link

mathias commented Jul 22, 2015

@geemus Doesn't using the id of the record (which should be unique) as the second ORDER BY argument prevent those cases you mentioned?

@mathias
Copy link

mathias commented Jul 22, 2015

I imagine with some data (as pseudo-CSV)

id,deploys,another_attr
1,123,foo
2,4,bar
3,123,baz

When we want to sort by deploys descending, it'd really sort like this:

deploys/id
123/1
123/3
4/2

And if we sorted by another_attr ascending:

another_attr/id
bar/2
baz/3
foo/1

Essentially the unique ID is being used to break ties for the sort order.

@geemus
Copy link
Member

geemus commented Jul 22, 2015

Correct. The issue is, how do you ask for the second range if the last value of the first range and first value of the second range have the same primary sort column value? So, for instance:

1/1
2/1
---
2/2
3/1

Let's say this is the data, and we are using ranges with 2 items each. The first range would be 1..2 and presumably the next-range would then be 2..1, but this would include the 2/1 again, since it matches with that specification also. Does that help clarify my meaning?

@mathias
Copy link

mathias commented Jul 22, 2015

@geemus The right side would have to be a unique value for each of those. You couldn't have a list like

1/1
2/1
---
2/2
3/1

Because the 1 on the right is reused. "Inside" any value on the left side, the right side would also be sorted.

A better version of the above data would hopefully look like:

1/1
2/4
---
2/2
3/3

where the values on the right side are the database ID (or UUID) -- basically, where they're unique.

However, this leads to a cumbersome thing from the client side: We'll have to know that internally that's what pagination needs, and send both values to paginate:

Range: deploys 1..2; id 1..4; max=2

Passing order=desc doesn't make sense, unless it is guaranteed to apply to the first attr sorted on and the second attr is always assumed to be ASC (for simplicity.)

@geemus
Copy link
Member

geemus commented Jul 22, 2015

Well, my example was bad, but yeah, you would have to pass both keys (at least in some cases) to get the next range. And there are already semantics around asking for multiple ranges in the RFCs that are NOT at all like this, which seems troublesome. In the RFCs it is more like bytes 1-10, 20-30, so the same range unit, but more than one range at once (which might or might not be contiguous). Doesn't mean we can't do something like your suggestion, but might mean that doing it as a doubled range could be problematic.

I was pondering using something like marker from S3 in cases like this, ie you would pass an extra param that would always contain the last-seen uuid or something. Similar really to your suggestion, but just putting the arg elsewhere, ie Range: deploys 1..2; marker=4, max=2; or something like that.

@mathias
Copy link

mathias commented Jul 22, 2015

You're right, and I suspect that the fact that the Range header is intended for ordered and unique values (bytes in a file), it won't work here without expanding what we allow passed in / expanding the definition to some degree.

Adding in the marker is an interesting twist, and it means the client doesn't have to be quite so clever about the values it sends it, it can simply use the Next-Range headers from the last response as is intended with this design.

Sounds like S3 uses Marker for something slightly different:

Specifies the key to start with when listing objects in a bucket. Amazon S3 returns object keys in alphabetical order, starting with key after the marker in order.

Type: String

Default: None

They also have a NextMarker value which corresponds to our Next-Range to some degree.

Would the response headers look like this?

Content-Range: id 5..10, marker=3
Next-Range: ]10..; marker=4, max=2

@mathias
Copy link

mathias commented Jul 22, 2015

Follow-up to that: I guess the Next-Range shouldn't use ]10..; there unless we're sure that there's no 10 on the next page, so rather than check, we'd want it to look more like this:

Content-Range: id 5..10, marker=3
Next-Range: 10..; marker=]3, max=2

?

Edit: Gah, can't seem to get this right. We wouldn't include a range in marker in Next-Range either, right? It'd be assumed that the range started from 3 on above?

@geemus
Copy link
Member

geemus commented Jul 22, 2015

Yeah, I think whatever we end up with, it should be such that client can just pass Next-Range. Since the markers in S3 give you back things AFTER the marker, but not including the marker, already (if I'm reading that right), I think the marker values in the examples could just be 3 and not need the exclusivity marker, though perhaps it would be good to include it to be explicit, I think it would probably always be exclusive. I don't know, this seems like a pretty weird conglomeration of barely associated things, which worries me. Can't say I've figured out a good alternative just yet though.

@bjeanes
Copy link
Author

bjeanes commented Jul 22, 2015

Marker seems like a close precedent, though it is used to set a boundary on the primary sort order by S3. Nonetheless, worth thinking about applying it here! I appreciate you dwelling on this for us @geemus!

@geemus
Copy link
Member

geemus commented Jul 23, 2015

Still working on this, but ran out of time as I leave for vacation. A couple thoughts (though I'm midstream in research/pondering).

Many of the other implementations of pagination via range do it with natural numbers, more like byte ranges. I imagine this often points to incrementing primary keys, but it could also just be used as a means of pointing to offsets/limits. ie something like releases 1-100 could indicate the first hundred in some ordering, which might actually be encoded later on (or default to primary key/uuid).

To make it more generic it could be something like resources 1-100 instead of being resource-specific. If we did it this way (with dashes instead of double dots), it could probably be done in a backwards compatible way. This undermines the level of meaning of a range and makes them somewhat less stable, but maybe that doesn't really matter.

Anyway, I have a ton of tabs open I'm still working through to try and get something solidified, but I'm also just out of time (it's already a couple hours past my EOD, but wanted to share where I was for reference and so that I can hopefully more easily pick the thread back up upon my return).

@geemus
Copy link
Member

geemus commented Jan 7, 2016

See also a detailed discussion around a possible way to work around this (and the many tradeoffs/issues): interagent/pliny#234

@geemus
Copy link
Member

geemus commented May 23, 2016

Here is a proposal I wrote a little while back that would be much closer to the original Range stuff: https://gist.github.com/geemus/8dfc01bd99367346c25891c915327220

@gudmundur
Copy link
Member

I know there were some performance concerns around using the object offset for paginating. But I do wonder how large the dataset has to be for that to become an actual issue.

On interagent/pliny#234 I proposed a design using the unique identifiers with row value expressions. The only question remaining was what separator to use for the ranges.

It seems to me that the performance of pagination is highly dependent on the underlying data store, so I'm not sure we can have a one size fit all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants