[OAI-implementers] Better resumption mechanism - more important than ever!
Alan Kent
ajk@mds.rmit.edu.au
Wed, 6 Mar 2002 18:11:38 +1100
On Tue, Mar 05, 2002 at 10:42:26PM -0500, Michael L. Nelson wrote:
> >Does OAI 2.0 say that resumptionToken's must be unique within
> >a download? And that reusing an old resumptionToken must be
> >supported (or rejected with an error)? If not guaranteed by
> >the spec, then I would not want to write a harvester relying
> >on it. I would rather spend the effort and get the spec right
> >rather than having to come to agreements with individual data
> >providers.
>
> I don't think the spec currently requires that a repository reject expired
> resumptionTokens, but a harvester would be wise not to use them if they
> are expired. its like drinking milk a day or two after the expiration
> date: its *probably* ok, but you gotta be pretty thirsty to do it.
This is not quite what I meant. Its more if my resumption token is
just a result set name and the DP remembers where the transfer is
up to, the reusing the same token would get the next N records.
The current spec allows this.
I think OAI 2.0 should allow a DP to advertise (via Identify?) that
it resumptionToken's can be reused (are idempotent) to retry.
That would satisfy me.
> > There is a difference, but is the difference worth the complexity
> > to the protocol? That is a different question.
>
> I'll rephrase my answer: the repository can implement it so there is no
> difference.
I agree that a DP can implement (idempotent resumption tokens), but how
does a harvester know that the DP has implemented it? Either OAI 2.0
must mandate it (possibly overly restrictive for smaller repositories),
or DP's must be able advertise it in a standard way, such as in the
Identify response.
So not much needs to be done, but something does need to be done.
The 1.1 spec at present is not enough.
> this is an artifact of your implementation... write the result set out to
> disk and set the expirationDate to a few days. add a reasonable response
> caching algorithm, and you could end up with a huge performance
> win. Depending on the DP accession rate, harvesting patterns, etc., your
> mileage could vary, but I suspect it would be very good.
I would never write the result set out to disk. For a very large
result set (eg: 10,000,000 records), I would have to fetch all the
records (lots of disk accesses) get their OAI-id's, then start
transferring. Then how long to keep the temporary file around for?
How many people might be doing transfers at the same time?
(A Z39.50 result set is not a client side data structure, but
a server side data structure by the way.)
But Liu had a good solution - just store both what I called resumptionToken
and restartToken in the resumptionToken. Ie: the result set name and
query. If the result set has timed out, use the query part and build
it up again. Its up to me to get it correct. I personally would
have problems with cursors and list sizes (I would not support them
because if I redid the query, the result set size may change and
so both the size and cursor would be invalidated). But I can munge
my own DP implementation stuff in there to do something pretty similar
(my own internal concept of a 'cursor').
> 2.0 will already have more machine processable information in the Identify
> response. I'm not sure there is a good way around it, and since that
> door is already open, if you want to provide hints about how your
> resumptionTokens are used/implemented, that's surely ok.
Ok, then I think advertise a little more about idempotency of
resumptionToken's and everything is fine. Implementors for large
repositories should try to have long time-to-live for resumption
tokens, but no protocol change is required.
> but if their resumptionTokens had a long life, and were idempotent within
> that lifetime you would not have to start from scratch. 2.0 will allow
> the specification of the former, and we should probably discuss the latter
> some more.
Agreed. The simplest solution is (as above) to allow a server to advertise
its resumptionToken's are idempotent.
> you better build your system after all this! ;-)
*-)
One problem is I dont have any data to export - only data that other
people have made available. The other problem relates to number of
hours in the day. I still want to put my harvested collection up
for public access to if I can scrounge up the disk space.
> seriously, you bring up a lot of good points. a lot of this exchange
> should probably be reflected in the implementation guide that will
> accompany the protocol doc.
I think the conclusions, such as 'advertise idempontency, and make resumption
tokens long lifed to handle where a harvester hits a problem and waits
for a humam to try and keep going' are worth documenting, not the rest.
There are always the mail archives.
Alan