-
Notifications
You must be signed in to change notification settings - Fork 291
Description
Hi, I wanted to give feedback on the CDX Server requirements wiki page.
There's not really a good way to comment on the page though, so rather than just editing the wiki page, I thought it'd be easier to start a conversation as an issue. Feedback follows as comments.
As part of making the CDX-Server the default index engine for OpenWayback we need to clean up and formally define the API for the CDX-Server. This document is meant as a workplace for defining those API's.
I think that's a great idea, especially this API can be shared across multiple implementations, not just OpenWayback.
The CDX-Server API, as it is today, is chracterized by a relatively close link to how the underlying CDX format is implemented. Functionality varies if you are using traditional flat CDX files or compressed zipnum clusters. One of the nice things by having a CDX Server is to separate the API from the underlying implementation. This way it would be relatively easy to implement indexes based on other technologies in the future. As a consequence we should avoid implementing features just because they are easy to do with a certain format if there is no real need for it. The same feature might be hard to implement on other technologies.
The intent was to keep it separate (and there is support for different output formats, eg. JSON lines). The zipnum cluster does provide extra APIs, such as Pagination, but that is mostly because pagination is otherwise technically difficult without a secondary index, nothing ties it to zipnum cluster implementation in particular. The 'secondary index' is presented as a separate concept and perhaps could be kept abstracted out further.
The API should also try to avoid giving the user conflicting options. For example it is possible, in the current api, to indicate match type both with a parameter and a wildcard. It is then possible to set matchType=prefix and at the same time use a wildcard indicating matchType=domain.
Sure, the wildcard query was added a 'shortcut' in place of the matchType query, 'syntactic sugar', but if people feel strongly about removing one or the other, I don't think its a big deal
The following is a list of use-cases seen from the perspective of a user. Many of the use-cases are described as expectations to the GUI of OpenWayback, but is meant to help the understanding of the CDX-Server's role. For each use-case we need to understand what functionality the CDX-Server is required to support. CDX-Server functionality with no supporting use-case should not be implemented in OpenWayback 3.0.0.
This is a work in progress. Edits and comments are highly appreciated.
The CDX Server API was not just designed for GUI access in OpenWayback, but a more general API for querying web archives. The interactions from a GUI in OpenWayback should be thought of as a subset of the functionality that the API provides. Everything that was in the API had a specific use case at one point or another.
As a starting point, the CDX Server API provides two APIs that are defined by memento:
- TimeGate Query = closest timestamp match
- TimeMap Query = list of all captures for a given url
The closest match functionality is designed to provide an easy way to provide the next closest fallback, if replay of the first memento fails, and allow for trying the next best, and so forth..
Another use case was better support for the prefix query, where the result is a list of unique urls per prefix, followed by the starting date, end date and count. The query can then be continued to get more results from where the end of the previous query.
Another important use case is parallel bulk querying, which can be used for data extraction.
For example, a user may wish to extract all captures by host, prefix, or domain across a very large archive. The user can create MapReduce job to query the CDX server in parallel, where each map task sets the page value. (Implementations of this use case already exist in several forms).
The difference between the bulk query and the regular prefix query, is that the pagination api allows you to query a large dataset in parallel, instead of continuing from where the previous query left off.
But this requires pagination support, which requires the zipnum cluster, but it would in theory be possible to support without (just requires do a lot more work to sample the cdx to determine page distribution).
Another use case was resolving revisit records, if the original was the same url, in a single pass, to avoid having to do a second lookup. This is done by appending the original record as extra fields.
This may be not as useful if most deduplication is 'url agnostic'
_
Use-cases
1. The user has a link to a particular version of a document
This case could be a user referencing a document from a thesis. It is important that the capture referenced is exactly the one the user used when writing the thesis. In this case the user should get the capture that exactly matches both the url and timestamp.
This is more of a replay system option, rather than cdx query.
What happens if the exact url doesn't exist? There is not a way to guarantee exact match just by url and timestamp, you would also need the digest, and you can filter by url, timestamp and digest with cdx server, but not with a replay (archival url) format.
2. The user selects one particular capture in the calendar
Pretty much the same as above, but it might be allowed to return a capture close in time if the requested capture is missing.
I think this is not at all the same as above, but closest capture/timegate behavior. An option could be added to remove closest match and only do exact match, but again, this is a replay system option, not cdx server option..
3. Get the best matching page when following a link
User is looking at a page and want to follow a link by clicking it. User then expects to be brought to closest in time capture of the new page.
4. Get the best match for embedded resources
Similar to above, but user is not involved. This is for loading embedded images and so on.
It seems that these all fall under the 'closest match' / Memento TimeGate use case
5. User requests/searches for an exact url without any timestamp, expecting to get a summary of captures for the url over time
The summary of captures might be presented in different ways, for example a list or a calendar.
Yep, this is the Memento TimeMap use case.
6. User looks up a domain expecting a summary of captures over time
7. User searches with a truncated path expecting the results to show up as matching paths regardless of time
8. User searches with a truncated path expecting the results to show up as matching paths regardless of time and subdomain
These are all different examples of the prefix query use case.
9. User navigates back and forth in the calendar
This is already possible with the timemap query, right?
But, could also add an "only after" or "only before" query, to support navigating in one direction explicitly.
10. User wants to see when content of a page has changed
This seems more like a replay api, as cdx server is not aware of embeds or relationships between different urls.