[OAI-implementers] Better resumption mechanism - more important
than ever!
Michael L. Nelson
mln@ils.unc.edu
Tue, 5 Mar 2002 10:02:43 -0500 (EST)
actually, the way I see it is the protocol should not be complicated with
additional tokens and such to enforce what ETDCat (and similiarly
large-sized DPs) should do:
1. partition their collection into sets
2. use stateless (or very long lived) resumptionTokens
in 2.0, resumptionTokens will have optional attributes, including
"expirationDate", so this will take the guess work out of knowing how long
a resumptionToken will be valid.
IMO, introducing an optional restartToken is no different (from an
implementer's point of view) than making the resumptionToken last a long
time.
at some point, you (as a harvester) are simply at the mercy of the
repository. new features in the protocol won't change that.
regards,
Michael
On Tue, 5 Mar 2002, 'Alan Kent' wrote:
> I just got some mail from Jeff at OCLC talking about ETDCat (hope
> you don't mind me quoting some of your mail Jeff). In particular,
> he just told me
>
> ETDCat contains a lot of records (over 4 million), all of
> which currently have the exact same datestamp from the initial load.
>
> He also told me that there were no sets. So basically, its all
> or nothing for this site because OAI has no standard way to resume
> if a transfer fails.
>
> If this has happened already, I think its likely to occur again.
> (That is, one very large database all with the same time stamp.)
> So any comments about having a single large collection like this
> is beside the point. The point is OAI does not handle it well.
>
> So I would like to resurrect the discussion again if people don't
> mind on how to do support restarts. My understanding of the general
> feeling so far is
>
> (1) Mandating support is not going to be acceptable
>
> (2) Mandating format of resumption tokens is not going to be acceptable
>
> (3) Mandating resumption tokens be long lifed (eg: can try again the
> following day) is not acceptable
>
> (4) In fact, mandating that resumption tokens be unique (allowing
> a token to be reused twice in quick succession to get the same
> data) is not acceptable
>
> So any proposal needs to be optionally supported.
>
> Question time:
>
> Does anyone else think that this is a major hole in OAI? I personally
> do. After trying to crawl sites, things go wrong. The larger the site,
> the greater the probability that something will go wrong. The larger
> the site, the greater the pain of starting all over again. I do not
> think it is practical for anyone to harvest ETDCat if is really got
> 4,000,000 records. Any fault, and start downloading that 4gb again!
> So I feel strongly on this one. In fact, I think this is the most
> major problem OAI has.
>
> Do people think its better to reuse resumption tokens for this purpose,
> or introduce a different sort of token? ETDCat for example I think
> allocates a session id in resumption tokens, meaning they cannot
> be reused when the session times out in the server (similar semantics
> anyway). This is a reasonable implementation decision to make.
> So maybe its better for servers to return an additional token,
> which is a <restartToken> which means a client can instead of
> specifying from= and to= again, specify restartToken= instead where
> the server then automatically works out whatever other parameters
> it needs, creates a new session etc internally. The new 'session'
> (ListXXX verb) then can use resumptionTokens to manage that new
> transfer.
>
> The idea is for a <restartToken> to be long lifed. It may be less
> efficient to use than a resumptionToken, but its only purpose is
> if the client fails the download. If a server does not support
> restartToken, it simply never returns one. Large collections *should*
> support restartTokens.
>
> For my harvester, I can then remember (to disk) the restartToken for
> every packet I get back, allowing me to recover much more easily
> if anything crashes. If restartToken's are too hard for someone
> to implement, then they don't. If you have a large data collection
> on the other hand, to reduce network load, I think its probably worth
> the extra effort of supporting restartTokens.
>
> Any comments? Better suggesions?
>
> Alan
> _______________________________________________
> OAI-implementers mailing list
> OAI-implementers@oaisrv.nsdl.cornell.edu
> http://oaisrv.nsdl.cornell.edu/mailman/listinfo/oai-implementers
>
---
Michael L. Nelson
NASA Langley Research Center m.l.nelson@larc.nasa.gov
MS 158, Hampton, VA 23681 http://www.ils.unc.edu/~mln/
+1 757 864 8511 +1 757 864 8342 (f)