[OAI-implementers] resumptionToken Implementation

Tue Sep 28 18:03:23 EDT 2004

> Sample case:
> 
> Harvester issues a query. DP sends back 100 out of 10,000 results. 
> Harvester then begins to request the consecutive chunks. 
> Given that the 
> total data set is 10,000, this will probably take a while. Before the 
> entire result set is transfered, the DP updates it's repository which 
> shuffle the order in which the results are returned. Objects 
> that were 
> transferred previously are now kicked back to a later 
> position so it is 
> included in a chunk later requested by the harvester.
> 
> Does the DP now invalidate the resumptionToken or does it assume the 
> Harvester will de-dupe objects on it's side?
> 
> What about the new objects that have been added and are in chunks of 
> the resultset already transferred? Is it assumed that they will be 
> caught the next time around given that the modifydate SHOULD be later 
> than the last harvest date? Or is it the harvester's 
> responsibility to 
> straighten this all out?

DSpace uses the 'stateless' approach - see http://dspace.org/technology/system-docs/application.html#oai and scroll down a bit.  The sorting is done by (internal database) ID so de-dupping shouldn't be an issue for the harvester.  However your corner case may just cause a problem, or weird side-effect.

Say you're harvesting date range X-Y.  When you first issue the request, a certain set of items have a 'last modified' date within that range, so DSpace returns a load, and a resumption token.  If some items are then modified so that their 'last modified' date is outside the date range X-Y, they'll drop off that list, so suddenly item Z that was result number 101 of those items is now result number 99, and the next harvest request will miss it, since DSpace will think that Z was already served up in the first request.

DSpace would probably work OK in the scenario you've mentioned if the date range specified is X-(present) or no date range; results are sorted by ID so the order wouldn't change, new items would appear at the end of the list and updated items wouldn't have 'moved'.

Deleted items might be a bit yucky though...

Maybe you could to 'freeze' a result set when a harvest comes in, but that may not scale up when your repository is huge and the number of harvests is large (caching dozens of 100,000-long result sets?)

Solutions on a postcard to...

 Robert Tansley / Digital Media Systems Programme / HP Labs
  http://www.hpl.hp.com/personal/Robert_Tansley/