[OAI-implementers] issues with OAI-PMH specifications for OAI-Provider implementations using a cache

Fridman, Rozita Rozita.Fridman at FIZ-Karlsruhe.DE
Tue Jun 2 10:50:07 EDT 2009


Hello Simeon,

thanks a lot for your quick response.

> The notion of including an explicit start-next-incremental-harvest-from
> date
> in the response is something I have thought about too. It would solve
> the
> cache problem you describe. Not sure how much support there would be
> for such
> a change, what do others think?

Hopefully we will get support from other OAI-developer to extend a schema for the OAI-PMH response.
 

> One way to solve this using the current protocol without modification
> is to
> use days granularity and to make sure that the cache is updated at
> least once
> within each day (and that the the update does not span a day boundary
> in UTC).
> That way T1=T2 always using your example.

It is a good solution until we get a protocol enhancement. But the problem is when a cache update has not run for 1 day (for example because an underlying repository was not available) a harvester will miss records for that day.

Now we use the same workaround, that fedora-OAI-Provider uses: we deliver records based on update time in a cache and not on original update time of records in an underlying repository. But this approach requires us to change the earliestDatestamp entry contained in a OAI-PMH Identify-response. It have to be set to a time of the first cache update and not to original earliest time stamp in the underlying repository. Otherwise a harvester will possibly miss changes in the time range between earliest time stamp in the underlying repository and the first cache update time.

> If you opted to follow the 503 route then you could issue a
> second/multiple
> 503's if the harvester comes back before the update is complete. This
> is
> really the only good approach if the cache is in an inconsistent state
> such
> that the idempotency requirements of the protocol are not met.
> 

Yes, it is an option. 

Best regards,
Rozita

> Cheers.
> Simeon
> 
> 
> 
> Fridman, Rozita wrote:
> > Hello all,
> >
> > we developed an OAI-Provider for Escidoc repositories.
> > Escidoc-OAI-Provider is based on the Fedora-OAI-Provider, which uses
> a
> > cache to reduce a response time. Escidoc repositories intend to
> contain
> > multiple millions of objects. The Escidoc-Core framework only
> requires
> > that objects metadata stored in a Escidoc repository are well formed
> > xml-structures. Therefore using of a cache in the Escidoc-OAI-
> Provider
> > is essential to ensure validness of metadata in OAI-PMH response and
> an
> > acceptable response time.
> >
> > But the current OAI-PMH protocol specification doesn't account for
> some
> > issues, caused by the employment of a cache.
> >
> > The main problem is a time lag between a harvester request and a last
> > cache update:
> > A harvester asks the OAI-Provider for all records that have changed
> > between T0 and T2 in the underlying repository. The last cache update
> > was at T1.The harvester gets records that have changed between T0 and
> > T1, but assumes that it got all changes between T0 and T2. Therefore
> in
> > the next request it asks for records that have changed between T2 and
> T3
> > and is missing all changes between T1 and T2. If cache update
> interval
> > is long and the next cache update takes place after T3, the harvester
> is
> > also missing all changes between T2 and T3 and so on.
> >
> > One proposal would be to put a date stamp of the last cache update
> into
> > the OAI-PMH response, in order to inform a harvester about possibly
> > missed records.
> >
> > Does anybody face the same problem? What do you think about it? Maybe
> > there are better solutions for this problem?
> >
> > The other issue is that depending on the OAI-Provider implementation
> a
> > cache may be in an inconsistent state while a cache update process is
> > running. Are there means in the OAI-PMH protocol to respond to
> harvester
> > requests during a cache update? A possible solution would be to
> respond
> > with a HTTP-status code 503-Service unavailable (section 3.1.2.2 of
> the
> > specification), but the problem is to specify Retry-After period. A
> > duration of the cache update is not constant, it depends on the
> changes
> > in the repository.
> >
> > Thanks a lot,
> > Rozita
> >
> >
> >
> >
> >
> > ---------------------------------------------------------------------
> ---
> >
> >
> >
> > -------------------------------------------------------
> >
> > Fachinformationszentrum Karlsruhe, Gesellschaft für
> wissenschaftlich-technische Information mbH.
> > Sitz der Gesellschaft: Eggenstein-Leopoldshafen, Amtsgericht Mannheim
> HRB 101892.
> > Geschäftsführerin: Sabine Brünger-Weilandt.
> > Vorsitzender des Aufsichtsrats: MinR Hermann Riehl.
> >
> >
> >
> > ---------------------------------------------------------------------
> ---
> >
> > _______________________________________________
> > OAI-implementers mailing list
> > List information, archives, preferences and to unsubscribe:
> > http://www.openarchives.org/mailman/listinfo/oai-implementers
> >



-------------------------------------------------------

Fachinformationszentrum Karlsruhe, Gesellschaft für wissenschaftlich-technische Information mbH. 
Sitz der Gesellschaft: Eggenstein-Leopoldshafen, Amtsgericht Mannheim HRB 101892. 
Geschäftsführerin: Sabine Brünger-Weilandt. 
Vorsitzender des Aufsichtsrats: MinR Hermann Riehl.





More information about the OAI-implementers mailing list