[OAI-implementers] Re: Proposed corrections/fixes to OAI-PMH protocol document and schema

Simeon Warner simeon at cs.cornell.edu
Wed Sep 15 16:58:16 EDT 2004


Back in June I detailed several proposed corrections and fixes to the
OAI-PMH protocol document and schema (copied below, alternatively see:
http://openarchives.org/pipermail/oai-implementers/2004-June/001216.html).
Hussein Suleman sent comments about issues 1 and 2 which I'll reply to in
a separate email. I have implemented the other changes:

3) Added bullet to protocol document, sec2.6, saying sets may be empty.

4) Corrected typo in protocol document sec2.6 to say ListIdentifiers,
   ListRecords and GetRecord instead of just GetRecord.

5) Corrected typo the second example in protocol document sec5.

6) Changed schema to tighten checks on dateTime values so that they
   must use Z notation (in addition to be a valid XML Schema dataTime).

7) Added section "2.5.1 Deleted records", folded in some text from 2.7.1
   following Hussein's suggestion.

8) Already implemented for static repository (2004-03-29)

For reference, the old version of the protocol document is at:
http://www.openarchives.org/OAI/2.0/openarchivesprotocol.2004-06-10.htm
(also includes lots of HTML fixes), and the old version of schema is at:
http://www.openarchives.org/OAI/2.0/OAI-PMH.2004-03-29.xsd

None of these changes should affect conforming implementations.

Cheers,
Simeon


On Tue, 15 Jun 2004, Simeon Warner wrote:
> I've been collecting problems/corrections to the OAI-PMH protocol document
> and schema for some time. I detail proposed corrections and fixes below.
> Please reply either directly to me or to the list with comments or
> additional issues/corrections/errors.
>
> Cheers,
> Simeon
>
>
>
> PROPOSED FIXES TO OAI PROTOCOL DOCUMENT AND SCHEMA
> --------------------------------------------------
>
> 1) Correct protocol document and schema definition of setSpec to be
> consistent, and also to permit the use of URL encoding.
>
> 1.1) Motivation
>
> First, the protocol document and the schema simply do not agree. The use
> of the wording "any characters that are safe in a query component of a
> URI" is unclear and cannot be construed to agree with the schema. Second,
> many repositories are using URL-like encoding to create setSpecs so it
> seems better to permit the recognized URL encoding. The practical change
> to meet both of these criteria is very small: the schema regular
> expression should be changed to remove $ and +, and to add ~ and %xx (URL
> encoding). This will bring the protocol document in line with the terms
> "escaped" and "unreserved" as used in the URI RFC.
>
> 1.2) Impact
>
> The only conforming repository that we know of using setSpecs affected by
> this change is Jeff Young's OpenURL repository
> (http://alcme.oclc.org/openurl/servlet/OAIHandler) where he uses '+' as
> an encoding for space. Jeff agrees that a change would be sensible and
> that he could be replace '+' with '%20'. Repositories using URL-like
> encodings will not be affected although they may choose to change to use
> real URL encoding. All OAI software maintainers should, however, review
> the change and update their parsing code accordingly.
>
> 1.3) Changes
>
> 1.3.1) Change wording in protocol document
> http://www.openarchives.org/OAI/2.0/openarchivesprotocol.htm#Set
> from:
>
> a setSpec -- a colon [:] separated list indicating the path from the root
> of the set hierarchy to the respective node.  Each element in the list is
> a string consisting of any characters that are safe in a query component
> of a URI , which must not contain any colons [ :].  Since a setSpec forms
> a unique identifier for the set within the repository, it must be unique
> for each set.  Flat set organizations have only sets with setSpec that do
> not contain any colons [ :].
>
> to:
>
> a setSpec -- a colon [:] separated list indicating the path from the root
> of the set hierarchy to the respective node. Each element in the list is a
> string consisting of any valid URI "unreserved" and "escaped" characters.
> A setTag must not contain URI "reserved" characters, for example the colon
> [:] which is used to delimit setTags. Since a setSpec forms a unique
> identifier for the set within the repository, it must be unique for each
> set. Flat set organizations have only sets with setSpec that do not
> contain any colons [:].
>
> The corresponding parts of the specification of allowed characters in URIs
> are:
>
> unreserved    = alphanum | mark
> mark          = "-" | "_" | "." | "!" | "~" | "*" | "'" |
>                 "(" | ")"
> escaped       = "%" hex hex
> hex           = digit | "A" | "B" | "C" | "D" | "E" | "F" |
>                 "a" | "b" | "c" | "d" | "e" | "f"
>
>
> 1.3.2) Change definition of setSpecType in the schema to match the definition
> from:
>
>  <simpleType name="setSpecType">
>     <restriction base="string">
>       <pattern value=
>        "([A-Za-z0-9_!'$\(\)\+\-\.\*])+(:[A-Za-z0-9_!'$\(\)\+\-\.\*]+)*"/>
>     </restriction>
>   </simpleType>
>
> to:
>
>   <simpleType name="setSpecType">
>     <restriction base="string">
>       <pattern value="([A-Za-z0-9\-_\.!~\*'\(\)]|(%[A-Fa-f0-9]{2}))+(:([A-Za-z0-9\-_\.!~\*'\(\)]|(%[A-Fa-f0-9]{2}))+)*"/>
>     </restriction>
>   </simpleType>
>
>
> 2) Correct protocol document and schema definition for metadataPrefix to
> be consistent, and also to match the revised setSpec definition.
>
> 2.1) Motivation
>
> The protocol document uses the same imprecise wording for metadataPrefix
> as it does for setSpec ("any characters that are safe in a query
> component of a URI") and the schema does not even follow a reasonable
> interpretation of this wording. It seems sensible to use the same
> character restrictions in a consistent fashion. This will bring the
> protocol document in line with the terms "escaped" and "unreserved" as
> used in the URI RFC.
>
> 2.2) Impact
>
> This change is not expected to impact any known repository.  All OAI
> software maintainers should, however, review the change and update their
> parsing code accordingly.
>
> 2.3) Changes
>
> 2.2.1) Change wording in protocol document
> http://www.openarchives.org/OAI/2.0/openarchivesprotocol.htm#metadataPrefix
> from:
>
> The metadataPrefix - a string to specify the metadata format in OAI-PMH
> requests issued to the repository. metadataPrefix consists of any
> characters that are safe in a query component of a URI. metadataPrefix
> arguments are used in ListRecords, ListIdentifiers, and GetRecord
> requests to retrieve records, or the headers of records that include
> metadata in the format specified by the metadataPrefix;
>
> to:
>
> The metadataPrefix - a string to specify the metadata format in OAI-PMH
> requests issued to the repository. metadataPrefix consists of any valid
> URI "unreserved" and "escaped"  characters. A metadataPrefix must not
> contain URI "reserved" characters. metadataPrefix arguments are used in
> ListRecords, ListIdentifiers, and GetRecord requests to retrieve records,
> or the headers of records that include metadata in the format specified
> by the metadataPrefix;
>
> 2.3.2) Change definition of metadataPrefixType in schema to match the
> definition from:
>
>   <simpleType name="metadataPrefixType">
>     <restriction base="string">
>       <pattern value="[A-Za-z0-9_!'$\(\)\+\-\.\*]+"/>
>     </restriction>
>   </simpleType>
>
> to:
>
>   <simpleType name="metadataPrefixType">
>     <restriction base="string">
>       <pattern value="([A-Za-z0-9\-_\.!~\*'\(\)]|(%[A-Fa-f0-9]{2}))+"/>
>     </restriction>
>   </simpleType>
>
>
> 3) Change protocol document to explicitly state the sets may be empty
>
> 3.1) Motivation
>
> This issue has been raised a number of times in discussion and is not
> made explicit in the protocol document.
>
> 3.2) Impact
>
> None (except clarity)
>
> 3.3) Change: add additional bullet to the final set of bullets in protocol
> document sec2.6
> http://www.openarchives.org/OAI/2.0/openarchivesprotocol.htm#Set
>
> Introduction of bullets should say "Five issues should be noted here"
> instead of "Four issues should be noted here".
>
> Additional bullet: The set hierarchy of a repository may include sets
> that are empty.
>
>
> 4) Correct typo on sec2.6
> (http://www.openarchives.org/OAI/2.0/openarchivesprotocol.htm#Set)
>
> Section 2.6 says:
>
> If a repository supports sets then it must include set membership
> information in the GetRecord request. The list of setSpec should include
> only the minimum number of setSpec required to specify the set
> membership. Using the previous example of a set hierarchy, the header for
> an item organized in set institution:florida should not include setSpec
> institution since that is implied by the setSpec institution:florida .
>
> when it should say 'in response to ListIdentifiers, ListRecords and
> GetRecord requests' instead of 'in the GetRecord request'. The corrected
> paragraph reads:
>
> If a repository supports sets then it must include set membership
> information in the headers returned in response to ListIdentifiers,
> ListRecords and GetRecord requests. The list of setSpec elements should
> include only the minimum number of setSpec elements required to specify
> the set membership. Using the previous example of a set hierarchy, the
> header for an item organized in set institution:florida should not
> include setSpec institution since that is implied by the setSpec
> institution:florida.
>
> [problem pointed out by Mike Taylor <mike at indexdata.com>]
>
>
> 5) Correct typo the second example in protocol document sec5
> http://www.openarchives.org/OAI/2.0/openarchivesprotocol.htm#dublincore
>
> The namespace is incorrectly given in the declaration. It is 'oai' and
> not 'oai_dc' and the example is not schema valid. The following
> correction is required:
>
> 3c3
> <     xmlns:oai="http://www.openarchives.org/OAI/2.0/oai_dc/"
> ---
> >     xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/"
>
> [problem pointed out by Terry Harrison <1maniac at cox.net>]
>
>
> 6) Schema should more tightly validate UTCdatetime
>
> 6.1) Motivation
>
> While the specification is quite clear about what datetime formats must
> be used, the fact that the schema does not enforce the restriction to Z
> notation means that there are errors not found during schema validation.
> The additional check is easy to add to the schema and would likely
> improve interoperability.
>
> 6.2) Impact
>
> No conforming repository should be affected. The responses from some
> non-conforming implementations will no longer schema validate. It is
> hoped that this will encourage maintainers to correct them.
>
> 6.3) Change schema definition from:
>
>   <simpleType name="UTCdatetimeType">
>     <union memberTypes="date dateTime"/>
>   </simpleType>
>
> to:
>
>   <simpleType name="UTCdatetimeType">
>     <union memberTypes="date oai:UTCdateTimeZType"/>
>   </simpleType>
>
>   <simpleType name="UTCdateTimeZType">
>     <restriction base="dateTime">
>       <pattern value=".*Z"/>
>     </restriction>
>   </simpleType>
>
>
> 7) Add section on deleted records to the protocol document
>
> 7.1) Motivation
>
> Deleted records are an issue that causes confusion. This is not helped by
> information about them being distributed over the protocol document.
>
> 7.2) Impact
>
> none (except greater comprehension!)
>
> 7.3) Change protocol document to include:
>
> 2.5.1 Deleted records
>
> If a record is no longer available then it is said to be 'deleted'.
> Repositories may or may not keep track of deletions. If a repository does
> not keep track of deletions then such records will simply vanish from
> responses and there will be no way for a harvester to discover deletions
> through continued incremental harvesting. If a repository does keep track
> of deletions then the datestamp of the deleted record _must_ be the date
> and time that it was deleted. Responses to a GetRecord request for a
> deleted record _must_ then include a header with the attribute
> status="deleted" and no metadata or about parts. Similarly, responses to
> selective harvesting requests with set membership and date range criteria
> that include deleted records _must_ include the headers of these records.
> Incremental harvesting will thus discover deletions from repositories
> that keep track of them. Repositories must indicate their level of
> support for deletions in the deletedRecord element of the Identify
> response.
>
> Note that deleted status is a property of individual records. Like a
> normal record, a deleted record is identified by a unique identifier, a
> metadataPrefix and a datestamp. Other records, with different
> metadataPrefix but the same unique identifier, may remain available for
> the an item.
>
>
> 8) Change schema to defined a type for protocolVersion instead of using
> an anonymous definition (ALREADY DONE, 2004-03-29)
>
> 8.1) Motivation
>
> To allow reuse as part of the Static Repository schema
> http://www.openarchives.org/OAI/2.0/OAI-PMH-static-repository.xsd
>
> 8.2) Impact
>
> None except as noted in motivation. Schema semantics unchanged; all
> validating instances will still validate.
>
> 8.3) Change to OAI-PMH.xsd
>
> 98,104c100
> <       <element name="protocolVersion">
> <         <simpleType>
> <           <restriction base="string">
> <             <enumeration value="2.0"/>
> <           </restriction>
> <         </simpleType>
> <       </element>
> ---
> >       <element name="protocolVersion" type="oai:protocolVersionType"/>
> 192a189
> >
> 253a251,256
> >
> >   <simpleType name="protocolVersionType">
> >     <restriction base="string">
> >       <enumeration value="2.0"/>
> >     </restriction>
> >   </simpleType>
>
> (old version available as
> http://www.openarchives.org/OAI/2.0/OAI-PMH.2002-06-13.xsd)
>



More information about the OAI-implementers mailing list