[OAI-implementers] Sets Proposal (from DLF)
Dr Robert Sanderson
azaroth at liverpool.ac.uk
Fri Apr 22 13:11:36 EDT 2005
At DLF last week in one of the Birds of a Feather sessions some of the
issues that people had with sets were discussed. The following is a
proposal for how these issues might be addressed without changing the
OAI protocol.
Your comments are very welcome.
Thanks to Ralph and Jeff for their additional material incorporated
below and to Tom Habing for his comments.
Rob Sanderson
-------
*** OAI Set Proposal: Devolve Sets to Subsidiary URL Access Points ***
Author: Rob Sanderson (azaroth at liv.ac.uk)
Contributors: Ralph Levan (levan at oclc.org)
Jeff Young (jyoung at oclc.org)
Problem:
There have been many issues identified by the OAI community with the
interoperability of sets, set descriptions and the out-of-band
communications between data provider and service provider required to
either create new sets or determine the nature of sets if the description
is insufficient or not present.
Furthermore, given the present set specification, there has yet to be a
consistent approach described to having a hierarchical structure of sets,
nor a way to specify that a record has been deleted from a set, but not
from the repository as a whole.
When populating set descriptions, it is possible to have some subsets make
their records available in a different set of schemas to other subsets and
it is not clear that this should be permitted at all, and if it should,
how to unambiguously specify the set's capabilities.
Proposal:
Instead of furthering the misconception that sets are a weak form of
search, due primarily to the lack of explicit semantics as to how sets
should be defined, sub-collections could instead be treated as separate
OAI repositories without any change to the existing protocol. These
subsidiary repositories typically would be an extended path from the base
provider's URL.
For example, an OAI interface might be available at:
http://oai.cheshire3.org/liverpoolArchives/
The server might then support sub-collections of the full archives at:
http://oai.cheshire3.org/liverpoolArchives/gypsy/
http://oai.cheshire3.org/liverpoolArchives/cunard/ and
http://oai.cheshire3.org/liverpoolArchives/cunard/titanic/
Commentary:
Additional URL paths are free for practically all intents and purposes.
Once a service can be made to listen at one URL, additional listening
points below are much easier. Thus the infrastructure of the web is made
to assume part of the technical requirements of the protocol.
Not only does this shift a part of the burden on to the infrastructure, it
enables various implementation strategies not currently available.
Instead of one repository maintaining the complete collection and all of
the sets, sub-collections could be handled either by one repository
implementation or by running a new instance for each subsidiary URL/set.
This second method lowers the barrier-to-entry for supporting
sub-collections as a data provider.
This solves the tree-of-sets issue with no additional requirements.
Because there is no (practical) limit on the length of a URL path, the
depth of the sub-collections is likewise not technically limited. The
context is also more apparent, as it is present in the URL.
It also solves the problem of not being able to signal the deletion of
records out of sets when they have not been deleted from the repository.
For example, if a record were moved from one set into another, the record
could be flagged as deleted in the repository maintaining the first set,
and recently added to its new location. This is not currently possible
and requires periodic complete reharvesting.
The misperception regarding sets as a filter or search, rather than access
to a well defined sub-collection is greatly lessened by the proposal.
Instead of having a parameter carried in the request, which can easily be
abused, the new URLs imply much more strongly that this is a well defined
OAI service, especially as it is one. This lessens the likelihood that
service providers will contact the data provider asking for new sets
[whether or not that is an advantage is easily debatable], and also the
likelihood that searches will be crushed into the set parameter.
Benefits also accrue for the service providers as they do not have to
implement set handling for the approximately half of the total data
providers that support them [as per Thomas Habing's registry]. The move
from the set specification to linked services also means that service
providers will actually process the friends information, which otherwise
may be ignored.
If there is one record which appears in multiple sets, then it will appear
in each OAI repository instance. Also, the records should all appear at
the base OAI repository. Even though these are different instances, the
sets should have the same unique identifier such that they can be
deduplicated if and when necessary. The protocol specifically allows for
common identifier schemes to be defined, so there seems to be no technical
issue with this.
Technical Requirements:
The technical requirement for discovering these sets would be done via a
slightly enhanced version of the Friends schema. Each repository instance
would maintain the next lowest level of sets. In the examples given in
the description, liverpoolArchives would link to gypsy and cunard. Cunard
would then link to its sub-set about the Titanic. Each set would also
link upwards to its super-set.
Each friend entry needs to identify the relationship between the current
repository instance and the friend. As an example, this could be
accomplished by a type attribute, with a default value of
'relatedCollection' for backwards compatibility.
<f:friends type=subCollection>
http://oai.cheshire3.org/liverpoolArchives/cunard/titanic
</f:friends>
<f:friends type=superCollection>
http://oai.cheshire3.org/liverpoolArchives/
</f:friends>
<f:friends type=relatedCollection>
http://some.other.repository.edu/path/to/oai/
</f:friends>
The need for the link upwards is that the services may be provided by
systems that do not work by url path elements, and instead use some other
system. For example a less sophisticated, but equally valid approach would
be to name CGI scripts with increasingly long names within the same
directory. For example:
http://www.cheshire3.org/cgi/oai-liverpool.cgi
http://www.cheshire3.org/cgi/oai-liverpool-gypsy.cgi
http://www.cheshire3.org/cgi/oai-liverpool-cunard-titanic.cgi
The sub-collection description would then be moved into the Identify
response from the listSets response. This has the advantage of only
requiring service providers to process one sort of collection/service
description rather than two. Secondly it means that the best practices
regarding the Identify response would also apply to any sub-collections,
increasing the likelihood of content providers describing their sets
appropriately.
---------
,'/:. Dr Robert Sanderson (azaroth at liverpool.ac.uk)
,'-/::::. http://www.csc.liv.ac.uk/~azaroth/
,'--/::(@)::. Dept. of Computer Science, Room 805
,'---/::::::::::. University of Liverpool
____/:::::::::::::.
I L L U M I N A T I Cheshire3 IR System: http://www.cheshire3.org/
More information about the OAI-implementers
mailing list