[OAI-implementers] resumptionToken Implementation

Wed Sep 29 05:22:59 EDT 2004

BioMed Central similarly uses a stateless approach for resumption tokens,
and I too have been concerned about long term scaleability using 

(a)  the stateless approach:
Retrieving items 999,900 to 1,000,000 of an ordered set from a database
tends can be a very expensive operation, and using 10,000 such 100 item
requests in order to retrieve a full listing from an OAI-enabled database
containing  a million records is clearly vastly more expensive (in terms of
resources) than, say downloading a compressed file containing the data for
all 1 million records in one go.

(b) a stateful approach 
Caching lots of resultsets is the middle tier doesn't really seem easily
scaleable to very large sets, since cached resultsets tend to be inherently
memory-resident. A database temporary table for each new request could be
used, but would create its own resource issues.

I guess that the best that can be done is to sort items by a unique ordered
accession number/id/ (which doesn't change if an item is updated) and to use
this value as the resumption token, rather than using "offset within the
ordered set" as the resumption token. This should help both reliability and
performance (since appropriate relational database indexes can allow the
performance of  set=xxxx and accessionnumber>yyyy this to be tuned pretty
effectively, in a way that 

set=xxxx and offset>yyyyyy 

cannot be

Matt 
 == 
Matthew Cockerill Ph.D. 
Technical Director
BioMed Central ( http://www.biomedcentral.com/ ) 
34-42, Cleveland Street 
London 
W1T 4LB 
UK 

Tel 020 7631 9127 
Fax: 020 7580 1938 
Email: matt at biomedcentral.com 

> DSpace uses the 'stateless' approach - see 
> http://dspace.org/technology/system-docs/application.html#oai 
> and scroll down a bit.  The sorting is done by (internal 
> database) ID so de-dupping shouldn't be an issue for the 
> harvester.  However your corner case may just cause a 
> problem, or weird side-effect.
> 
> Say you're harvesting date range X-Y.  When you first issue 
> the request, a certain set of items have a 'last modified' 
> date within that range, so DSpace returns a load, and a 
> resumption token.  If some items are then modified so that 
> their 'last modified' date is outside the date range X-Y, 
> they'll drop off that list, so suddenly item Z that was 
> result number 101 of those items is now result number 99, and 
> the next harvest request will miss it, since DSpace will 
> think that Z was already served up in the first request.
> 
> DSpace would probably work OK in the scenario you've 
> mentioned if the date range specified is X-(present) or no 
> date range; results are sorted by ID so the order wouldn't 
> change, new items would appear at the end of the list and 
> updated items wouldn't have 'moved'.
> 
> Deleted items might be a bit yucky though...
> 
> Maybe you could to 'freeze' a result set when a harvest comes 
> in, but that may not scale up when your repository is huge 
> and the number of harvests is large (caching dozens of 
> 100,000-long result sets?)
> 
> Solutions on a postcard to...
> 
>  Robert Tansley / Digital Media Systems Programme / HP Labs
>   http://www.hpl.hp.com/personal/Robert_Tansley/
>