Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New paging api for Cdx Server #309

Open
johnerikhalse opened this issue Feb 24, 2016 · 0 comments
Open

New paging api for Cdx Server #309

johnerikhalse opened this issue Feb 24, 2016 · 0 comments

Comments

@johnerikhalse
Copy link
Contributor

Current paging API is described here: https://github.com/iipc/openwayback/tree/master/wayback-cdx-server-webapp#pagination-api.

The paging api has some shortcomings:

  • Only available with Zipnum indexes
  • Can only use one index at a time
  • Some queries give different results compared to not using paging
  • Client must do url manipulation

This proposal tries to overcome those shortcomings, but might as a consequence infer some restrictions to the flexibility of the current api.

The assumptions is that this functionality isn't needed for replay engines like OpenWayback, but for processing by tools like map-reduce. But then high throughput and the possibility to request chunks in parallel is essential.

The proposal is to remove the query parameters page, showNumPages, pageSize and showPagedIndex. Instead creating a separate path (for example /bulk or /batch) to allow for getting large result sets. From now on I will use paging api to reference the current api and bulk api to reference the new proposal. In the bulk api I will refer to batches as a more or less equivalent to pages.

Using the bulk api is a two step process. First submit a query for the data set. This query won't return any data, but a list of links to batches of data. Then you need to submit each of those urls to actually get the data you want. These urls are absolute fully qualified urls, so the client do not need to know anything about its semantic.

To enable the bulk api on indexes without secondary indexes, and multiple indexes without the need to scan through the whole index, it is not possible to manipulate how many pages the request will be split into. That decision is completely up to the server. But it is possible to limit the number of captures returned for each http request. This is done with the limit parameter which doesn't alter the batch size, but merely allows a batch to return a part of a batch with a resumption url added to the result which is used to get the next part of the batch. Each part of a batch can only be fetched in sequence, but each batch kan be fetched in parallel. Using the limit parameter and resumption url fills the same purpose as HTTP 1.1 Chunked transfer encoding. Maybe that should be used instead.

Each batch result is required to be sorted internally, but results from different batches might overlap. So if order of the total result is important, then sorting must be applied by the client, for example a reduce operation in a map-reduce tool. This is done to avoid the need to merge indexes on server.

The query parameters allowed to refine the request should be carefully evaluated to not imposing heavy load on the server independently of type of index.

So far the following query parameters are considered ok:

  • url - the url to query allowing wildcards lik .example.com and example.com/somepath/
  • from, to - restricting the time range
  • filter - filter out unwanted captures
  • fields - change the number and order of fields to return
  • limit - limit the maximum number of captures for each response

All of the parameters allowed for batch requests are added to the initial request, but are only influencing the results of the secondary requests which gets the actual data. The format of the url for secondary requests are implementation specific and altering the secondary requests are not allowed.

Functions that need to compare captures are not possible without processing the whole request on the server and therefore not part of the api. From the paging api, this applies only to collapsing and sorting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant