[OAI-implementers] Sets Proposal (from DLF)

Fri Apr 22 13:15:51 EDT 2005

At DLF last week in one of the Birds of a Feather sessions some of the issues 
that people had with sets were discussed.  The following is a proposal for how 
these issues might be addressed without changing the OAI protocol.

Your comments are very welcome.

Thanks to Ralph and Jeff for their additional material incorporated below and 
to Tom Habing for his comments.

Rob Sanderson

[resent for crazy UoL email address mapping]

-------

*** OAI Set Proposal: Devolve Sets to Subsidiary URL Access Points ***

Author:  	Rob Sanderson (azaroth at liv.ac.uk)
Contributors:  	Ralph Levan   (levan at oclc.org)
 		Jeff Young    (jyoung at oclc.org)

Problem:

There have been many issues identified by the OAI community with the 
interoperability of sets, set descriptions and the out-of-band communications 
between data provider and service provider required to either create new sets 
or determine the nature of sets if the description is insufficient or not 
present.
Furthermore, given the present set specification, there has yet to be a 
consistent approach described to having a hierarchical structure of sets, nor a 
way to specify that a record has been deleted from a set, but not from the 
repository as a whole.
When populating set descriptions, it is possible to have some subsets make 
their records available in a different set of schemas to other subsets and it 
is not clear that this should be permitted at all, and if it should, how to 
unambiguously specify the set's capabilities.

Proposal:

Instead of furthering the misconception that sets are a weak form of search, 
due primarily to the lack of explicit semantics as to how sets should be 
defined, sub-collections could instead be treated as separate OAI repositories 
without any change to the existing protocol.  These subsidiary repositories 
typically would be an extended path from the base provider's URL.

For example, an OAI interface might be available at:
 	http://oai.cheshire3.org/liverpoolArchives/
The server might then support sub-collections of the full archives at:
 	http://oai.cheshire3.org/liverpoolArchives/gypsy/
 	http://oai.cheshire3.org/liverpoolArchives/cunard/ and
 	http://oai.cheshire3.org/liverpoolArchives/cunard/titanic/

Commentary:

Additional URL paths are free for practically all intents and purposes. Once a 
service can be made to listen at one URL, additional listening points below are 
much easier.  Thus the infrastructure of the web is made to assume part of the 
technical requirements of the protocol.

Not only does this shift a part of the burden on to the infrastructure, it 
enables various implementation strategies not currently available. Instead of 
one repository maintaining the complete collection and all of the sets, 
sub-collections could be handled either by one repository implementation or by 
running a new instance for each subsidiary URL/set. This second method lowers 
the barrier-to-entry for supporting sub-collections as a data provider.

This solves the tree-of-sets issue with no additional requirements. Because 
there is no (practical) limit on the length of a URL path, the depth of the 
sub-collections is likewise not technically limited.  The context is also more 
apparent, as it is present in the URL.

It also solves the problem of not being able to signal the deletion of records 
out of sets when they have not been deleted from the repository. For example, 
if a record were moved from one set into another, the record could be flagged 
as deleted in the repository maintaining the first set, and recently added to 
its new location.  This is not currently possible and requires periodic 
complete reharvesting.

The misperception regarding sets as a filter or search, rather than access to a 
well defined sub-collection is greatly lessened by the proposal. Instead of 
having a parameter carried in the request, which can easily be abused, the new 
URLs imply much more strongly that this is a well defined OAI service, 
especially as it is one.  This lessens the likelihood that service providers 
will contact the data provider asking for new sets [whether or not that is an 
advantage is easily debatable], and also the likelihood that searches will be 
crushed into the set parameter.

Benefits also accrue for the service providers as they do not have to implement 
set handling for the approximately half of the total data providers that 
support them [as per Thomas Habing's registry].  The move from the set 
specification to linked services also means that service providers will 
actually process the friends information, which otherwise may be ignored.

If there is one record which appears in multiple sets, then it will appear in 
each OAI repository instance.  Also, the records should all appear at the base 
OAI repository.  Even though these are different instances, the sets should 
have the same unique identifier such that they can be deduplicated if and when 
necessary.  The protocol specifically allows for common identifier schemes to 
be defined, so there seems to be no technical issue with this.

Technical Requirements:

The technical requirement for discovering these sets would be done via a 
slightly enhanced version of the Friends schema.  Each repository instance 
would maintain the next lowest level of sets.  In the examples given in the 
description, liverpoolArchives would link to gypsy and cunard. Cunard would 
then link to its sub-set about the Titanic.  Each set would also link upwards 
to its super-set.

Each friend entry needs to identify the relationship between the current 
repository instance and the friend.  As an example, this could be accomplished 
by a type attribute, with a default value of 'relatedCollection' for backwards 
compatibility.

<f:friends type=subCollection>
 	http://oai.cheshire3.org/liverpoolArchives/cunard/titanic
</f:friends>
<f:friends type=superCollection>
 	http://oai.cheshire3.org/liverpoolArchives/
</f:friends>
<f:friends type=relatedCollection>
 	http://some.other.repository.edu/path/to/oai/
</f:friends>

The need for the link upwards is that the services may be provided by systems 
that do not work by url path elements, and instead use some other system. For 
example a less sophisticated, but equally valid approach would be to name CGI 
scripts with increasingly long names within the same directory.  For example:
 	http://www.cheshire3.org/cgi/oai-liverpool.cgi
 	http://www.cheshire3.org/cgi/oai-liverpool-gypsy.cgi
 	http://www.cheshire3.org/cgi/oai-liverpool-cunard-titanic.cgi

The sub-collection description would then be moved into the Identify response 
from the listSets response.  This has the advantage of only requiring service 
providers to process one sort of collection/service description rather than 
two.  Secondly it means that the best practices regarding the Identify response 
would also apply to any sub-collections, increasing the likelihood of content 
providers describing their sets appropriately.

---------

       ,'/:.          Dr Robert Sanderson (azaroth at liverpool.ac.uk)
     ,'-/::::.        http://www.csc.liv.ac.uk/~azaroth/
   ,'--/::(@)::.      Dept. of Computer Science, Room 805
,'---/::::::::::.    University of Liverpool
____/:::::::::::::. I L L U M I N A T I  Cheshire3 IR System: 
http://www.cheshire3.org/