[OAI-implementers] Sets Proposal (from DLF)
Dr Robert Sanderson
azaroth at liverpool.ac.uk
Fri Apr 22 13:15:51 EDT 2005
At DLF last week in one of the Birds of a Feather sessions some of the issues
that people had with sets were discussed. The following is a proposal for how
these issues might be addressed without changing the OAI protocol.
Your comments are very welcome.
Thanks to Ralph and Jeff for their additional material incorporated below and
to Tom Habing for his comments.
Rob Sanderson
[resent for crazy UoL email address mapping]
-------
*** OAI Set Proposal: Devolve Sets to Subsidiary URL Access Points ***
Author: Rob Sanderson (azaroth at liv.ac.uk)
Contributors: Ralph Levan (levan at oclc.org)
Jeff Young (jyoung at oclc.org)
Problem:
There have been many issues identified by the OAI community with the
interoperability of sets, set descriptions and the out-of-band communications
between data provider and service provider required to either create new sets
or determine the nature of sets if the description is insufficient or not
present.
Furthermore, given the present set specification, there has yet to be a
consistent approach described to having a hierarchical structure of sets, nor a
way to specify that a record has been deleted from a set, but not from the
repository as a whole.
When populating set descriptions, it is possible to have some subsets make
their records available in a different set of schemas to other subsets and it
is not clear that this should be permitted at all, and if it should, how to
unambiguously specify the set's capabilities.
Proposal:
Instead of furthering the misconception that sets are a weak form of search,
due primarily to the lack of explicit semantics as to how sets should be
defined, sub-collections could instead be treated as separate OAI repositories
without any change to the existing protocol. These subsidiary repositories
typically would be an extended path from the base provider's URL.
For example, an OAI interface might be available at:
http://oai.cheshire3.org/liverpoolArchives/
The server might then support sub-collections of the full archives at:
http://oai.cheshire3.org/liverpoolArchives/gypsy/
http://oai.cheshire3.org/liverpoolArchives/cunard/ and
http://oai.cheshire3.org/liverpoolArchives/cunard/titanic/
Commentary:
Additional URL paths are free for practically all intents and purposes. Once a
service can be made to listen at one URL, additional listening points below are
much easier. Thus the infrastructure of the web is made to assume part of the
technical requirements of the protocol.
Not only does this shift a part of the burden on to the infrastructure, it
enables various implementation strategies not currently available. Instead of
one repository maintaining the complete collection and all of the sets,
sub-collections could be handled either by one repository implementation or by
running a new instance for each subsidiary URL/set. This second method lowers
the barrier-to-entry for supporting sub-collections as a data provider.
This solves the tree-of-sets issue with no additional requirements. Because
there is no (practical) limit on the length of a URL path, the depth of the
sub-collections is likewise not technically limited. The context is also more
apparent, as it is present in the URL.
It also solves the problem of not being able to signal the deletion of records
out of sets when they have not been deleted from the repository. For example,
if a record were moved from one set into another, the record could be flagged
as deleted in the repository maintaining the first set, and recently added to
its new location. This is not currently possible and requires periodic
complete reharvesting.
The misperception regarding sets as a filter or search, rather than access to a
well defined sub-collection is greatly lessened by the proposal. Instead of
having a parameter carried in the request, which can easily be abused, the new
URLs imply much more strongly that this is a well defined OAI service,
especially as it is one. This lessens the likelihood that service providers
will contact the data provider asking for new sets [whether or not that is an
advantage is easily debatable], and also the likelihood that searches will be
crushed into the set parameter.
Benefits also accrue for the service providers as they do not have to implement
set handling for the approximately half of the total data providers that
support them [as per Thomas Habing's registry]. The move from the set
specification to linked services also means that service providers will
actually process the friends information, which otherwise may be ignored.
If there is one record which appears in multiple sets, then it will appear in
each OAI repository instance. Also, the records should all appear at the base
OAI repository. Even though these are different instances, the sets should
have the same unique identifier such that they can be deduplicated if and when
necessary. The protocol specifically allows for common identifier schemes to
be defined, so there seems to be no technical issue with this.
Technical Requirements:
The technical requirement for discovering these sets would be done via a
slightly enhanced version of the Friends schema. Each repository instance
would maintain the next lowest level of sets. In the examples given in the
description, liverpoolArchives would link to gypsy and cunard. Cunard would
then link to its sub-set about the Titanic. Each set would also link upwards
to its super-set.
Each friend entry needs to identify the relationship between the current
repository instance and the friend. As an example, this could be accomplished
by a type attribute, with a default value of 'relatedCollection' for backwards
compatibility.
<f:friends type=subCollection>
http://oai.cheshire3.org/liverpoolArchives/cunard/titanic
</f:friends>
<f:friends type=superCollection>
http://oai.cheshire3.org/liverpoolArchives/
</f:friends>
<f:friends type=relatedCollection>
http://some.other.repository.edu/path/to/oai/
</f:friends>
The need for the link upwards is that the services may be provided by systems
that do not work by url path elements, and instead use some other system. For
example a less sophisticated, but equally valid approach would be to name CGI
scripts with increasingly long names within the same directory. For example:
http://www.cheshire3.org/cgi/oai-liverpool.cgi
http://www.cheshire3.org/cgi/oai-liverpool-gypsy.cgi
http://www.cheshire3.org/cgi/oai-liverpool-cunard-titanic.cgi
The sub-collection description would then be moved into the Identify response
from the listSets response. This has the advantage of only requiring service
providers to process one sort of collection/service description rather than
two. Secondly it means that the best practices regarding the Identify response
would also apply to any sub-collections, increasing the likelihood of content
providers describing their sets appropriately.
---------
,'/:. Dr Robert Sanderson (azaroth at liverpool.ac.uk)
,'-/::::. http://www.csc.liv.ac.uk/~azaroth/
,'--/::(@)::. Dept. of Computer Science, Room 805
,'---/::::::::::. University of Liverpool
____/:::::::::::::. I L L U M I N A T I Cheshire3 IR System:
http://www.cheshire3.org/
More information about the OAI-implementers
mailing list