FW: [OAI-implementers] Open Archives Initiative Protocol for Meta data Harvesting Version 2 news
Tim Cole
Tim Cole" <t-cole3@uiuc.edu
Fri, 8 Feb 2002 17:02:50 -0600
Not to curtail the very interesting technical back and forth, but...
The flexible and naive nature of the resumptionToken parameter and the fact
that the OAI-PMH doesn't allow Service Providers to request a fixed number
of records is very much by design. The minimum granularity and inherent
limitations of the datestamp argument was also a decision made after some
thought. Given the intended mission of the OAI PMH, I believe the decisions
were correct. (Whether there's really a niche for what OAI-PMH is intended
to be is of course open to debate.)
OAI PMH was created initially to facilitate interchange of metadata between
E-Print archives. These archives could be characterized by several
characteristics -- among them that data contained in the archive changed
relatively slowly (i.e., on average relatively few new records added,
changed or deleted day to day) and that the repositories were built on
limited resources and with limited capabilities (some didn't even support
keyword search of full-text of documents held in the repository).
Accordingly OAI PMH built in a lot of flexibility (and a certain amount of
wiggle room) for implementers, particularly metadata providers. Timestamps
with granularity of only 1 day were allowed. Flow control was implemented
in the least prescriptive, most stateless way possible.
Some metadata provider services have been built to take advantage of this
flexibility. For instance I have an experimental OAI provider service that
has no database management software behind it at all. Instead it relies on
the implementation platform's file system. Metadata is stored in XML files
and dynamically transformed when requested to the requested metadata schema
using XSLT. The number of record chunk size returned for a ListRecord
request varies according to the number of records in each file system
directory at the time the request is received. The order in which records
are returned is determined by the implementation platform's file system and
typically is not chronological, meaning it will change between requests as
records are added, deleted, and updated. This implementation would not be
able to return a fixed number of records specified by the Service Provider
without substansial changes to its basic design.
The resumption token as used in this implementation includes the requested
metadata prefix, the date range values of the original request, and a list
of remaining directories to be exported. No state information is ever
maintained on the server side, and the number of records returned in
response to a request with a resumption token isn't determined until the
request is received and processed. (Thus a later request with same
resumption token may get more or less records.) Datestamps are maintained to
the day only (no hours, minutes, or seconds). Implementing locking or
creating some sort of state maintainence mechanism would require substansial
and fundamental changes to the design of this implemetation.
I believe the implementation conforms to the current protocol document, and
I'm reasonably sure that with only minor changes it will conform to the 2.0
spec. I've been surprised at how hard it is to break, though I certainly
don't expect it to be as reliable and robust as some other implementations
I've seen.. It does what it was designed to do.
However this implementation clearly does not support precise harvesting
along the lines that have been discussed on this list over the course of the
last week or two. The resumptionToken is not deterministic, but only a
somewhat imprecise method used to chunk a long response. I would contend
that given that the provider implementation descirbed is intended only to
handle a respository of at most a few 10s of thousands of metadata records
and in which additions, updates, and deletions occur at most weekly, and
more often monthly, the imprecise harvesting does not lead to poor
representation of the metadata stored in my repository, and therefore should
not be of concern to Service Providers. Of course that's debatable.
Which is the question before the OAI Community at this point in time. Is
there really a niche for a relatively simple protocol that allows in certain
instances for less precise harvesting? (For instance we've known from the
start that some re-harvesting occurs because datestamps only have
granularity of one day.) Can services built on such a protocol be useful --
at least for certain purposes? Obviously not for a bank trying to do
financial transactions, but perhaps in the DL world. A number of us are
trying to answer these kinds of questions by empirical means rather than
speculation.
Given that there can be circumstances when a metadata provider might want to
avoid overhead of a transactional database system, I would very much oppose
moving OAI-PMH in the direction of SQL style transactions and cursors. I
would also oppose, especially as a required functionality, upgrading flow
control to allow SPs to specify numbers of records wanted, or to specify
resuming from a particularly record (which implictly assumes an ordered,
persistent response object). These changes would require providers to
maintain state and would effectively require them to provide transactional
functionalities -- things many of the current providers aren't in a good
position to do. The benefits of such changes for the target audience don't
seem worth it. (Which comes back to question raised earlier about whether a
niche protocol aimed at a particular target audience can survive. I think
it can, but we'll have to see.)
Tim Cole
University of Illinois at Urbana-Champaign
----- Original Message -----
From: "Xiaoming Liu" <liu_x@cs.odu.edu>
To: "Alan Kent" <ajk@mds.rmit.edu.au>
Cc: <oai-implementers@openarchives.org>
Sent: Friday, February 08, 2002 6:57 AM
Subject: Re: FW: [OAI-implementers] Open Archives Initiative Protocol for
Meta data Harvesting Version 2 news
> Sorry for replying my own email ;-)
>
> The more I think this problem, the more I believe it's not a
> stateful/stateless problem. If we all agree that Http is a stateless
> protocol, what's the fundemental differences between URL rewriting and
> resumptionToken?
>
> I believe the real problem is a read/write lock problem, if a data
> provider wants to implement a perfect service , namely return a consistent
> cursor between DP (data provider) and SP (service provider), it has to be
> working either the way Jeff has suggested: Keep a snapshot of all
> identifiers at the instant (a huge work for 1M records); or totally
> read lock the whole database.
>
> Because the datastamp is always increasing in OAI, I think Alan's
> method (high resolution date stamps and results is ordered by
> time) will also work, but not necessarily monatomically, if the DP could
> return all records of a specific datestamp in one reply. But it did
> put some dangers to harvester as Walter suggested, if suddenly DP creates
> 10K records with same datestamp, it has to return them in one response, it
> quite possibly will break the harvester.
>
> liu
>
>
>
> On Thu, 7 Feb 2002, Xiaoming Liu wrote:
>
> > Alan,
> >
> > I guess there are two aspects of my arguments,(DP) data provider and
> > (SP) service provider.
> >
> > >From the side of SP, it could not presume "a request for the past will
> > always get the same answer". So the method suggested by Walter won't
work.
> > Instead, SP has to use the resumptionToken to get the right anwser.
> >
> > >From the side of DP, they could implement the resumptionToken by its
own
> > way. If DP can promise "a request for the past will never change", or
> > they don't care missing something, they can use the method I suggest.
> > That's the case for CVS-like system (keep each version with different
> > release number), or maybe some historical documents.
> >
> > So my opinion is: SP has to use resumptionToken, DP has its own options
> > about how to implement it.
> >
> >
> > About "whether new records are created with monotomic dates" See
> > definition of datestamp in OAMHP:
> > "A datestamp is the date of creation, deletion, or latest date of
> > modification of an item, the effect of which is a change in the metadata
> > of a record disseminated from that item."
> >
> > So in a correctly-implemented OAI repository, the new records should be
> > created with monotomic dates, in your case of webpage/crawler, the date
of
> > the metadata is the date of webpage is harvested.
> >
> > > Or is the idea with OAI that if a record is updated, then the
> > > old slot is marked as 'deleted' and a new record added as 'inserted'
> > > to keep the same number of slots around?
> >
> > If one record is changed (but identifier keeps same), the correct way is
> > to change the datestamp. However, if you have a version control system
and
> > change identifier each time, the "deleted"/"inserted" is also a right
way.
> >
> > > The only invariant that I can think of is the date stamp.
> > > If date/time stamps (to a high resolution) were used, and the
> > > results of ListRecords was in monotomically increasing order
> > > of time, then you actually no longer need resumptionToken at all.
> >
> > By my understanding, OAI2.0 (from Carl&Herbert's email) will support
high
> > resolution date/time stamps as an option. However, there is no promise
> > that results of ListRecords will be in monotomically increasing order of
> > time. (It may be unnecessary limitation to some data providers).
> >
> > But I agree it will support a pure stateless protocol if all assumption
> > are satisfied (high resolution date stamps and results is ordered by
> > time).
> >
> > Regards,
> > liu
> >
> >
> >
> >
> >
> >
> >
> >
> > On Fri, 8 Feb 2002, Alan Kent wrote:
> >
> > > Sorry if this is all old hat to other people, but I find getting
involved
> > > is the best way to learn and understand. People can always ignore me!
:-)
> > >
> > > On Thu, Feb 07, 2002 at 09:50:27PM -0500, Xiaoming Liu wrote:
> > > > --- Walter Underwood wrote:
> > > > > A request for all changes between two dates in the past should
always get
> > > > > the same answer, so stateless harvesting should work.
> > > >
> > > > This is a neat way, but I am now sure how well the past is kept in
digital
> > > > library ;-) Especially
> > > > in OAI protocol, whenever a record is changed, its datestamp is
changed
> > > > too. So even a request
> > > > for past may not get the same answer.
> > >
> > > and
> > >
> > > > Maybe there is one way to implement a stateless protocol in current
OAI:
> > > > encode query parameters in ResumptionToken:
> > > ...
> > > > one example is:
> > > > resumptionToken= 1999:2000:math:oai_dc:100
> > >
> > > I assume the 100 means start from record 100.
> > >
> > > So by your own argument, the contents of previous queries may change
> > > between requests. So the server *must* keep a copy of the state of the
> > > system when the original query was issued and continue to provide
> > > that consistently to the client. If the results are not consistent,
> > > data could be lost (overlooked) during a long transfer.
> > >
> > > Let me expand and ask a few questions (partly from my ignorance).
> > > Is it expected with OAI that new records will come into existance
> > > at a previous point in time? Or are all new records always added
> > > created with monotomically increasing date/time values? For example,
> > > if metadata is harvested from a web site, would the dates of the
> > > web pages be used? Or the date the data was harvested be used?
> > > If the date of the web page, then when a new site is crawled,
> > > new pages can come into existence dated in the past. If the date
> > > the metadata was collected from the web page, then dates increase
> > > monotomically.
> > >
> > > If new records are *not* created with monotomic dates, then OAI falls
> > > down doesn't it? Any one who has done a previous crawl may never crawl
> > > for that old date range again and so not get the data. So to be safe,
> > > dates must be monotomically increasing for metadata modified in the
> > > repository.
> > >
> > > If changes to the repository are then always given monotomically
> > > increasing dates, then history will never be added to. However,
> > > history can be lost if an old entry is updated (as it will be given
> > > a newer date). So if a cursor scheme is used which says 'give me
> > > records starting from 100' is used, then if a record that was in
> > > the range 1-99 is updated between requests, then what was record
> > > number 100 would slip back to become record number 99. The request
> > > starting from 100 would then miss that record.
> > >
> > > Or is the idea with OAI that if a record is updated, then the
> > > old slot is marked as 'deleted' and a new record added as 'inserted'
> > > to keep the same number of slots around?
> > >
> > > The normal way this problem is addressed in database systems of
> > > course is to use transactions. When the query is used, the full
> > > answer is effectively worked out and kept around. Any updates,
> > > inserts, or deletes do not affect the query results. The current
> > > OAI protocol then uses the resumptionToken to identify the query
> > > set. But at some stage, the query may be discarded. If the client
> > > has not got all the data yet, then it has to start again from
> > > scratch (unless the data is guaranteed to be returned in monotomically
> > > increasing date order - which its not at present I think).
> > >
> > > Using the identifier of a record to remember the position in a
> > > result set is no good either. If that record is updated, it will
> > > move in the result set, messing things up again.
> > >
> > > The only invariant that I can think of is the date stamp.
> > > If date/time stamps (to a high resolution) were used, and the
> > > results of ListRecords was in monotomically increasing order
> > > of time, then you actually no longer need resumptionToken at all.
> > > Instead, a new request can be specified with a precise 'from'
> > > value. That would make requests completely stateless. Deletions
> > > in history (due to an update) would not be a problem.
> > >
> > > Ok, I will be quiet now and let someone with more history behind
> > > OAI and all its goals etc speak instead.
> > >
> > > Alan
> > > _______________________________________________
> > > OAI-implementers mailing list
> > > OAI-implementers@oaisrv.nsdl.cornell.edu
> > > http://oaisrv.nsdl.cornell.edu/mailman/listinfo/oai-implementers
> > >
> >
> > _______________________________________________
> > OAI-implementers mailing list
> > OAI-implementers@oaisrv.nsdl.cornell.edu
> > http://oaisrv.nsdl.cornell.edu/mailman/listinfo/oai-implementers
> >
>
> _______________________________________________
> OAI-implementers mailing list
> OAI-implementers@oaisrv.nsdl.cornell.edu
> http://oaisrv.nsdl.cornell.edu/mailman/listinfo/oai-implementers
>