[OAI-implementers] harvester guidelines
Hussein Suleman
hussein at cs.uct.ac.za
Fri May 27 14:04:56 EDT 2011
hi Jasper
regarding the sets issue,
this was acknowledged as a gap a long time ago. addressing the gap
creates significant additional complexity in the protocol so this was
not done in the interest of simplicity. the OAI-PMH was developed to be
a simple protocol and often that need for simplicity has overridden the
goal of completeness.
regarding the harvesting dates,
your idea seems reasonable for the specific case you mention (sequential
out-of-sync dates). this does of course give control of the harvesting
process to the data providers - a service provider that keeps track of
dates itself is more likely to act predictably. the general problem you
are referring to is where data providers add records that are
essentially in the past. there are numerous ways to do this and there is
simply no general solution - either a service provider trusts a data
provider not to do this or periodically reharvests from scratch.
ttfn,
----hussein
=====================================================================
hussein suleman ~ hussein at cs.uct.ac.za ~ http://www.husseinsspace.com
=====================================================================
On 26/05/2011 12:43, Jasper Op de Coul wrote:
> Hi list,
>
>
> I've been doing some work with OAIPMH harvesters lately, and would like
> to share some of my experiences on the subject.
>
> When harvesting specific sets with the `set` param, there is an issue
> that a harvester is not notified when a record is removed from that set.
>
> I think most implementers are aware of this, and it is the biggest hole
> in the specification.
>
> For example: A specific set is harvested, but at a later time one of the
> records is no longer part of that set. The record then disappears from
> the feed, but the harvester is never notified because there is no delete
> event.
>
> There are several ways to deal with this:
>
> 1. Do incremental harvests with the ?set param, then do a full harvest
> periodically or when someone calls or mails that records are missing.
> This is a common approach but it is no solution to the problem. I think
> we can and should do better then that.
>
> 2. Always do a full harvest with the ?set param. This will put a lot of
> load on the servers, take lots of time, and is not a very social thing
> to do. So, not a good option.
>
> 3. Use incremental harvests, but never use the ?set param. The client
> will receive all records and can inspect the SetSpec header manually to
> see if this record is part of the wanted set. Records that are not part
> of the wanted set but are in the client database can be removed.
>
> The last option means a lot more housekeeping for the client, but it is
> the only way for a client to be sure that the data is correct.
>
> Although sets are a very useful feature, the set parameter is basically
> broken. This should be noted somewhere in the documentation, probably in
> the harvester guidelines.
> Personally I would be in favour of deprecating the set parameter so we
> can put a big fat warning explaining this problem.
>
> Another issue that came up recently has to do with incremental
> harvesting. The harvester guidelines mention that for the value of the
> from parameter, the `responseDate` should be used, and that it is
> advisable to overlap by a small additional amount.
> I think it would be better if a harvester does not use the responseDate,
> but instead uses the latest datestamp of all harvested records.
>
> Consider the following situation:
>
> Someone modifies a document in a database at 4 o'clock.
> An external OAI service gets updated once an hour, so it will have the
> changes at 5 o'clock. The OAI software will use the modification dates
> from the database, so at 5 o'clock the modified record is added with a
> datestamp of 4 o'clock.
> If a harvester comes by at 4:30, that modifed document is not in the OAI
> feed yet. An hour later at 5:30, the harvester harvests again with a
> `from` parameter value of 4:30. The harvester will never get the
> modified document because it was modified earlier then 4:30.
>
> Off course this whole situation is far from ideal, but it could be that
> there is a gap between the modification date of a resource, and when it
> gets added to the oai server. This gap can be anything from a few
> seconds to a few weeks.
>
> If a harvester always uses the latest datestamp of any of the harvested
> records, you are sure that no records are missed, and you never harvest
> too much.
>
> I hope this helps implementers build better harvesters. If there is
> concensus about adding this to the harvester guidelines, I would be
> willing to write some text for it.
>
> Kind regards,
> Jasper
>
>
>
More information about the OAI-implementers
mailing list