[OAI-implementers] Re: Part II: Proposed corrections/fixes to
OAI-PMH protocol document and schema
Simeon Warner
simeon at cs.cornell.edu
Tue Oct 12 11:48:11 EDT 2004
Since there has been no objection to part A below, I have gone ahead and
made those changes. Updated versions of the OAI-PMH schema and protocol
document are now live on the OAI website:
http://www.openarchives.org/OAI/2.0/openarchivesprotocol.htm
http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd
The previous versions are available for reference:
http://www.openarchives.org/OAI/2.0/openarchivesprotocol.2004-09-15.htm
http://www.openarchives.org/OAI/2.0/OAI-PMH.2004-09-14.xsd
I have not yet tackled part B below because of lack of agreement.
Implementers should check their implementations to make sure that nothing
needs to be changed to agree with these revised (and now consistent!)
specifications for allowed characters in setSpec and metadataPrefix
values.
One other minor change to the protocol document was revision of the list
of technical committee members and alpha testers to remove the email
addresses to list just names and affiliations. This may be closing the
door somewhat after the horse has bolted but it is one less trivially
harvestable set of email addresses.
Cheers,
Simeon
On Thu, 16 Sep 2004, Simeon Warner wrote:
> I'd like to solicit further comment regarding issues 1 and 2 of the set of
> proposed corrections and fixes to the OAI-PMH protocol document and schema
> that I sent back in June (copied below, alternatively see:
> http://openarchives.org/pipermail/oai-implementers/2004-June/001216.html).
> These are really the same issue repeated for both setSpec and
> metadataPrefix. Both cases involve the same two parts which I describe
> below: part A I assume is not controversial; part B Hussein commented on.
> A lack of other comments presumably indicates lack of other objections but
> I'd like to confirm that since the proposal will involve minor changes in
> some implementations.
>
>
> A) The values of setSpec and metadataPrefix permitted protocol document
> and the by the schema simply do not agree. This should be corrected.
>
> The meaning of the current wording "any characters that are safe in a
> query component of a URI" is unclear and cannot be construed to agree with
> the schema. I suggest the simplest way to clarify and fix this is to
> rephrase as "a string consisting of any valid URI 'unreserved' characters"
> which would give the following changes in allowed values (both of which
> add ~ and disallow $ and + ):
>
> setSpec from:
> <pattern value="([A-Za-z0-9_!'$\(\)\+\-\.\*])+(:[A-Za-z0-9_!'$\(\)\+\-\.\*]+)*"/>
> to:
> <pattern value="([A-Za-z0-9\-_\.!~\*'\(\)])+(:[A-Za-z0-9\-_\.!~\*'\(\)]+)*"/>
>
> metadataPrefix from:
> <pattern value="[A-Za-z0-9_!'$\(\)\+\-\.\*]+"/>
> to:
> <pattern value="[A-Za-z0-9\-_\.!~\*'\(\)]+"/>
>
> The setSpec pattern is more complicated because elements are separated by
> colons [:].
>
>
> B) There should be some standard way to permit straightforward use,
> perhaps via escaping, of setSpec and metadataPrefix values native to
> repositories.
>
> The suggestion is to permit URI "escaped" characters (%xx where xx are two
> hex digits). I note that a number of repositories have already adopted
> encoding using hex but that in most cases the escape character is simply
> omitted; in a few cases another escape character has been chosen (e.g. *)
> because % is not permitted. The fact that implementers are already doing
> this demonstrates a desire to encode values native to other systems.
> Permitting URI "escaped" characters is a simple way to standardize this
> using and well-known escaping mechanism without significantly increasing
> complexity.
>
> Alternatives include:
>
> 1) Use another escaping mechanism. Another obvious choice would be to use
> XML numeric entities (e.g. ':' (decimal) or ':' (hex) for a
> quotation mark). These entities would themselves have to be escaped in
> XML responses (otherwise you have alternative 2) so responses might
> include XML of the form <setSpec>&#x3A;</setSpec> to encode a setSpec
> which is internally a colon [:]. One might also want to restrict to
> just-decimal or just-hex to reduce complexity. It seems to me that one
> ends up with a complex set of restrictions on XML entity encoding which
> largely negate any benefit of adopting that standard. Perhaps there is
> another good option?
>
> 2) Permit a much larger character set in the first place (the limit being
> "anything" - the XML schema "string" type). I see three issues with this.
> First, when OAI-PMH was first designed we decided on a limited character
> set to make implementation easier, I think this still has some merit.
> Second, in the setSpec there will always be a potential need to escape a
> colon [:], since that has special meaning in OAI-PMH (which may not
> correspond to use in values native to a repository). Third, this would be
> a significant change requiring updates to most harvesting software.
> Significant extension of the character set is beyond the scope of the
> present proposal.
>
> 3) Do not include a standard way to permit the use of setSpec and
> metadataPrefix values native to repositories (simply make the protocol
> document and schema agree as described in A).
>
> Note that this issue is quite separate from URL-encoding of OAI requests
> made over HTTP. Characters used in any escaping mechanism for setSpec and
> metadataPrefix may need to be further escaped when used in URLs.
>
> On Mon, 21 Jun 2004, Hussein Suleman wrote:
> ...
> > 1/2: i have some reservations about us requiring URL-encoding within
> > XML. this mixes syntax with intended semantics of use and further
> > entrenches the implicit support for URL-encoding, which is irrelevant
> > if, for example, OAI-PMH makes the jump to using a SOAP request/response
> > model. the model and abstractions must be clean and separable, they
> > arent quite so already and i would prefer they didnt get more complicated.
>
> In response, I don't think the proposal was to _require_ URL-encoding. It
> was to allow it at a data-provider's choice; service providers should (in
> the absence of other information, e.g. oai_dc is special) treat both
> setSpec and metadataPrefix values as opaque tokens. OAI-PMH's special use
> of the colon means that this issue would not entirely go away even if
> OAI-PMH used an XML-clean transport such as SOAP, and we were no longer
> concerned about the burden on harvesters of permitting any string to be
> used.
>
>
> Ug, that got longer than I hoped...
>
> Cheers,
> Simeon
>
>
> > Simeon Warner wrote:
> > > ...
> > > PROPOSED FIXES TO OAI PROTOCOL DOCUMENT AND SCHEMA
> > > --------------------------------------------------
> > >
> > > 1) Correct protocol document and schema definition of setSpec to be
> > > consistent, and also to permit the use of URL encoding.
> > >
> > > 1.1) Motivation
> > >
> > > First, the protocol document and the schema simply do not agree. The use
> > > of the wording "any characters that are safe in a query component of a
> > > URI" is unclear and cannot be construed to agree with the schema. Second,
> > > many repositories are using URL-like encoding to create setSpecs so it
> > > seems better to permit the recognized URL encoding. The practical change
> > > to meet both of these criteria is very small: the schema regular
> > > expression should be changed to remove $ and +, and to add ~ and %xx (URL
> > > encoding). This will bring the protocol document in line with the terms
> > > "escaped" and "unreserved" as used in the URI RFC.
> > >
> > > 1.2) Impact
> > >
> > > The only conforming repository that we know of using setSpecs affected by
> > > this change is Jeff Young's OpenURL repository
> > > (http://alcme.oclc.org/openurl/servlet/OAIHandler) where he uses '+' as
> > > an encoding for space. Jeff agrees that a change would be sensible and
> > > that he could be replace '+' with '%20'. Repositories using URL-like
> > > encodings will not be affected although they may choose to change to use
> > > real URL encoding. All OAI software maintainers should, however, review
> > > the change and update their parsing code accordingly.
> > >
> > > 1.3) Changes
> > >
> > > 1.3.1) Change wording in protocol document
> > > http://www.openarchives.org/OAI/2.0/openarchivesprotocol.htm#Set
> > > from:
> > >
> > > a setSpec -- a colon [:] separated list indicating the path from the root
> > > of the set hierarchy to the respective node. Each element in the list is
> > > a string consisting of any characters that are safe in a query component
> > > of a URI , which must not contain any colons [ :]. Since a setSpec forms
> > > a unique identifier for the set within the repository, it must be unique
> > > for each set. Flat set organizations have only sets with setSpec that do
> > > not contain any colons [ :].
> > >
> > > to:
> > >
> > > a setSpec -- a colon [:] separated list indicating the path from the root
> > > of the set hierarchy to the respective node. Each element in the list is a
> > > string consisting of any valid URI "unreserved" and "escaped" characters.
> > > A setTag must not contain URI "reserved" characters, for example the colon
> > > [:] which is used to delimit setTags. Since a setSpec forms a unique
> > > identifier for the set within the repository, it must be unique for each
> > > set. Flat set organizations have only sets with setSpec that do not
> > > contain any colons [:].
> > >
> > > The corresponding parts of the specification of allowed characters in URIs
> > > are:
> > >
> > > unreserved = alphanum | mark
> > > mark = "-" | "_" | "." | "!" | "~" | "*" | "'" |
> > > "(" | ")"
> > > escaped = "%" hex hex
> > > hex = digit | "A" | "B" | "C" | "D" | "E" | "F" |
> > > "a" | "b" | "c" | "d" | "e" | "f"
> > >
> > >
> > > 1.3.2) Change definition of setSpecType in the schema to match the definition
> > > from:
> > >
> > > <simpleType name="setSpecType">
> > > <restriction base="string">
> > > <pattern value=
> > > "([A-Za-z0-9_!'$\(\)\+\-\.\*])+(:[A-Za-z0-9_!'$\(\)\+\-\.\*]+)*"/>
> > > </restriction>
> > > </simpleType>
> > >
> > > to:
> > >
> > > <simpleType name="setSpecType">
> > > <restriction base="string">
> > > <pattern value="([A-Za-z0-9\-_\.!~\*'\(\)]|(%[A-Fa-f0-9]{2}))+(:([A-Za-z0-9\-_\.!~\*'\(\)]|(%[A-Fa-f0-9]{2}))+)*"/>
> > > </restriction>
> > > </simpleType>
> > >
> > >
> > > 2) Correct protocol document and schema definition for metadataPrefix to
> > > be consistent, and also to match the revised setSpec definition.
> > >
> > > 2.1) Motivation
> > >
> > > The protocol document uses the same imprecise wording for metadataPrefix
> > > as it does for setSpec ("any characters that are safe in a query
> > > component of a URI") and the schema does not even follow a reasonable
> > > interpretation of this wording. It seems sensible to use the same
> > > character restrictions in a consistent fashion. This will bring the
> > > protocol document in line with the terms "escaped" and "unreserved" as
> > > used in the URI RFC.
> > >
> > > 2.2) Impact
> > >
> > > This change is not expected to impact any known repository. All OAI
> > > software maintainers should, however, review the change and update their
> > > parsing code accordingly.
> > >
> > > 2.3) Changes
> > >
> > > 2.2.1) Change wording in protocol document
> > > http://www.openarchives.org/OAI/2.0/openarchivesprotocol.htm#metadataPrefix
> > > from:
> > >
> > > The metadataPrefix - a string to specify the metadata format in OAI-PMH
> > > requests issued to the repository. metadataPrefix consists of any
> > > characters that are safe in a query component of a URI. metadataPrefix
> > > arguments are used in ListRecords, ListIdentifiers, and GetRecord
> > > requests to retrieve records, or the headers of records that include
> > > metadata in the format specified by the metadataPrefix;
> > >
> > > to:
> > >
> > > The metadataPrefix - a string to specify the metadata format in OAI-PMH
> > > requests issued to the repository. metadataPrefix consists of any valid
> > > URI "unreserved" and "escaped" characters. A metadataPrefix must not
> > > contain URI "reserved" characters. metadataPrefix arguments are used in
> > > ListRecords, ListIdentifiers, and GetRecord requests to retrieve records,
> > > or the headers of records that include metadata in the format specified
> > > by the metadataPrefix;
> > >
> > > 2.3.2) Change definition of metadataPrefixType in schema to match the
> > > definition from:
> > >
> > > <simpleType name="metadataPrefixType">
> > > <restriction base="string">
> > > <pattern value="[A-Za-z0-9_!'$\(\)\+\-\.\*]+"/>
> > > </restriction>
> > > </simpleType>
> > >
> > > to:
> > >
> > > <simpleType name="metadataPrefixType">
> > > <restriction base="string">
> > > <pattern value="([A-Za-z0-9\-_\.!~\*'\(\)]|(%[A-Fa-f0-9]{2}))+"/>
> > > </restriction>
> > > </simpleType>
>
>
> ----------------------------------------------------------
> Simeon Warner Email: simeon at cs.cornell.edu
> Cornell Information Science Tel: 607-254-8605
> 301 College Ave Fax: 607-255-5196
> Ithaca, NY 14850-4623, USA
>
>
More information about the OAI-implementers
mailing list