[OAI-implementers] Some OAI-PMH protocol issues

Wed Dec 5 04:40:28 EST 2007

Hello,

Further to the previous email I sent about the document 
<http://www.ifremer.fr/docelec/doc/2007/acte-3238.pdf> we redacted to 
assess the main difficulties met during the first year of management of 
our Avano <http://www.ifremer.fr/avano/> harvester, I would like to 
focus, in this email, on just 3 problems linked to the OAI-PMH protocol, 
Dublin Core or to repositories implementation. I would like to focus 
particularly on these 3 problems because I guess they should not be so 
difficult to fix.  

*Managing duplicates *

Too many duplicates in a result list in Harvesters list can affect the 
user's comfort. This is not the main problem harvesters are facing 
today, but this should increase in the coming years. Today, at least two 
phenomenons can generate duplicates in the harvesters' databases: 

    * Several research organisations or universities can record the same
      electronic resource in their own institutional repository. If
      Avano harvests those repositories, it will get descriptive index
      files of the same topic stored in several places. This can happen
      if, for example, a publication is written in collaboration with
      several institutions. If so, this publication may be archived on
      the server of each institution. Considering the current low
      auto-archiving rate, especially in life sciences, this phenomenon
      is not the main cause of the production of duplicates.
    * Projects for national or thematic aggregators can pose problem. In
      some countries, projects of merged institutional repositories can
      agregate records from a selection of repositories in a centralised
      database before displaying them again in OAI-PMH on their own
      server. As a consequence, records referenced on those servers are
      displayed twice in OAI-PMH: via the institutional repository and
      via the centralised database. If the manager of an harvester does
      not know about the architecture of those national or thematic
      projects, he may record the two different servers and generate
      duplicates in his harvester's result lists. 

/To help harvesters administrator to avoid recording repositories 
generating duplicates, could we imagine adding to the description of the 
repository information about the involvement of the said repository in a 
national or thematic agregation system that would reexpose the records 
in OAI-PMH from a different server?
/

*Managing Type and Date field*

As far as I understand, in order to comply with the OAI-PMH protocol, 
repositories have to expose their data in the non-qualified Dublin Core 
DTD. In this DTD all fields are optional. Those fields are also 
non-qualified, meaning, for example, that they do not have to correspond 
to an enclosed value list. This optional and non-formalised information 
trait raises several issues, especially for the Type field.

Indeed, even if the Dublin Core DTD recommends storing the Type 
information by using standardised text strings, few repositories take 
this into consideration and still present the information as free text 
(ex: publication, artjournal, text, article are used to describe an 
article). Some harvesters, including Avano, offer their users to limit 
their search to one or several types of resources. To set up this 
filter, harvesters try to standardise the Type field using a system 
based on key-word recognition in this character string. This 
standardising is therefore imperfect and the filter system may exclude 
resources from the result list when a user narrows his search to one or 
several types of specific data. Some informations contained in this Type 
field cannot be standardised.

Even more problematic is the fact that some repositories do not fill in 
this field. As an example, in September 2007, out of the 107.000 records 
available in Avano, more than 26.000 did not have a Type field. All of 
those records are automatically barred from the search space if a user 
limits is search to one or several selected types.

/Could it be possible to imagine getting a new normalised and mandatory 
information about the type of the digital object (text, image, 
video....) so harvesters could offer an reliable option to filter one or 
several types ob objects from the end-user search.
/
The publication date is also problematic for harvester. For example, In 
September 2007, out of the 107.000 records available in Avano, about 
15.000 did not have a publication date. When a record does not have a 
publication date or when it cannot be standardised, it is automatically 
located at the end of the list if the user wants the results to be 
sorted by date. In the same way, when a user limits his search to a 
specific period of time (see fig. 9), those files are barred from the 
search even if they correspond to the specified search. 

But I guess this problem with the publication date will be more 
difficult to fix because it is difficult to define it as mandatory.

*Records without free access to the digital object*

As far as I understand, the OAI-PMH protocol defines only the sharing 
process of bibliographical records contained in a group of repositories. 
As a consequence, some repositories mix records without links to the 
digital object together with records providing free access to the 
resource. Others provide records with paying access (ex : BePress) or 
records with restricted access, for example, for university staff. 

In my opinion, this is the major problem harvesters have to face today. 
There is no indication in the Dublin Core DTD showing the harvesters the 
degree of accessibility of the objects described in the records. As a 
consequence, harvesters cannot pass on this information to their users 
or provide them with the ability to filter empty records or records 
offering paying access to the resource.

It is my opinion that hiding records with free full text among records 
with inaccessible full text is not helpful. For lack of time and/or 
interest, scientists are reluctant to join the Open Access movement and 
the archiving rate of free access publications stays very low, 
especially in life sciences. Free and immediate access to documentation 
is, without doubt, the best way to convince the scientists of the 
interest of the Open Access movement. And drowning a minority of records 
providing free access publications in an ocean of records without link 
to the full text and/or records offering paying access to the documents 
may not be the best way to promote the Open Access movement.

Again, those records without free access to the full text would not be a 
problem for the harvesters if the Dublin Core DTD enabled to signify the 
harvesters the degree of accessibility of the objects described in the 
records. Harvesters could then provide their users with the possibility 
of filtering the records without free access to the digital object. But 
it is still not the case. 

/Could we then imagine that, in a possible future version of the 
OAI-PMH, each record will have to provide a normalised and mandatory 
information about the degree of accessibility of the digital object 
(free, paying, impossible, restricted,...)? This will help harvesters so 
much to provide a better service to theirs end-users.
/

What do you think?

Kind regards,
Fred

-- 
Fred Merceur
Ifremer / Bibliothèque La Pérouse
frederic.merceur at ifremer.fr
Tél : 02-98-49-88-69
Fax : 02-98-49-88-84
Bibliothèque La Pérouse <http://www.ifremer.fr/blp/>
Archimer, Ifremer's Institutional Repository 
<http://www.ifremer.fr/docelec/>
Avano, a marine and aquatic OAI harvester <http://www.ifremer.fr/avano/>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.openarchives.org/pipermail/oai-implementers/attachments/20071205/32eab2aa/attachment.htm