New paging api for Cdx Server #309

johnerikhalse · 2016-02-24T14:57:35Z

Current paging API is described here: https://github.com/iipc/openwayback/tree/master/wayback-cdx-server-webapp#pagination-api.

The paging api has some shortcomings:

Only available with Zipnum indexes
Can only use one index at a time
Some queries give different results compared to not using paging
Client must do url manipulation

This proposal tries to overcome those shortcomings, but might as a consequence infer some restrictions to the flexibility of the current api.

The assumptions is that this functionality isn't needed for replay engines like OpenWayback, but for processing by tools like map-reduce. But then high throughput and the possibility to request chunks in parallel is essential.

The proposal is to remove the query parameters page, showNumPages, pageSize and showPagedIndex. Instead creating a separate path (for example /bulk or /batch) to allow for getting large result sets. From now on I will use paging api to reference the current api and bulk api to reference the new proposal. In the bulk api I will refer to batches as a more or less equivalent to pages.

Using the bulk api is a two step process. First submit a query for the data set. This query won't return any data, but a list of links to batches of data. Then you need to submit each of those urls to actually get the data you want. These urls are absolute fully qualified urls, so the client do not need to know anything about its semantic.

To enable the bulk api on indexes without secondary indexes, and multiple indexes without the need to scan through the whole index, it is not possible to manipulate how many pages the request will be split into. That decision is completely up to the server. But it is possible to limit the number of captures returned for each http request. This is done with the limit parameter which doesn't alter the batch size, but merely allows a batch to return a part of a batch with a resumption url added to the result which is used to get the next part of the batch. Each part of a batch can only be fetched in sequence, but each batch kan be fetched in parallel. Using the limit parameter and resumption url fills the same purpose as HTTP 1.1 Chunked transfer encoding. Maybe that should be used instead.

Each batch result is required to be sorted internally, but results from different batches might overlap. So if order of the total result is important, then sorting must be applied by the client, for example a reduce operation in a map-reduce tool. This is done to avoid the need to merge indexes on server.

The query parameters allowed to refine the request should be carefully evaluated to not imposing heavy load on the server independently of type of index.

So far the following query parameters are considered ok:

url - the url to query allowing wildcards lik .example.com and example.com/somepath/
from, to - restricting the time range
filter - filter out unwanted captures
fields - change the number and order of fields to return
limit - limit the maximum number of captures for each response

All of the parameters allowed for batch requests are added to the initial request, but are only influencing the results of the secondary requests which gets the actual data. The format of the url for secondary requests are implementation specific and altering the secondary requests are not allowed.

Functions that need to compare captures are not possible without processing the whole request on the server and therefore not part of the api. From the paging api, this applies only to collapsing and sorting.

The text was updated successfully, but these errors were encountered:

johnerikhalse added enhancement discussion CDX Server labels Feb 24, 2016

johnerikhalse mentioned this issue Feb 24, 2016

Feedback on CDX Server requirements page #305

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New paging api for Cdx Server #309

New paging api for Cdx Server #309

johnerikhalse commented Feb 24, 2016

New paging api for Cdx Server #309

New paging api for Cdx Server #309

Comments

johnerikhalse commented Feb 24, 2016