[OAI-implementers] Part II: Proposed corrections/fixes to OAI-PMH
protocol document and schema
Hussein Suleman
hussein at cs.uct.ac.za
Mon Sep 20 12:13:16 EDT 2004
hi Simeon (et al)
to follow on, i agree that we will always need to escape ":" because of
PMH semantics.
the clean solution is to propose the use of a special OAI escape
character, say "!". then, we could use the forward mapping:
: -> !:
! -> !!
then, specify that setSpecs and mdps are simply unrestricted Unicode,
with service providers having to apply URL-encoding when submitting
requests involving setSpecs and mdps, and data providers having to apply
XML encoding when returning such information (with reverse
transformation as needed). there are a few other issues here - like
Unicode use in URLs, but lets punt on that for now ...
now, i know this proposes to change semantics - i believe we are already
on the slippery slope of trying to patch things up by introducing more
complexity and greater reliance on basic HTTP.
practically, in the short term, i support option 3, to tackle only issue
A and not issue B. in the long term, maybe when we consider SOAP, we
really should clean up the data model.
ttfn,
----hussein
Simeon Warner wrote:
> I'd like to solicit further comment regarding issues 1 and 2 of the set of
> proposed corrections and fixes to the OAI-PMH protocol document and schema
> that I sent back in June (copied below, alternatively see:
> http://openarchives.org/pipermail/oai-implementers/2004-June/001216.html).
> These are really the same issue repeated for both setSpec and
> metadataPrefix. Both cases involve the same two parts which I describe
> below: part A I assume is not controversial; part B Hussein commented on.
> A lack of other comments presumably indicates lack of other objections but
> I'd like to confirm that since the proposal will involve minor changes in
> some implementations.
>
>
> A) The values of setSpec and metadataPrefix permitted protocol document
> and the by the schema simply do not agree. This should be corrected.
>
> The meaning of the current wording "any characters that are safe in a
> query component of a URI" is unclear and cannot be construed to agree with
> the schema. I suggest the simplest way to clarify and fix this is to
> rephrase as "a string consisting of any valid URI 'unreserved' characters"
> which would give the following changes in allowed values (both of which
> add ~ and disallow $ and + ):
>
> setSpec from:
> <pattern value="([A-Za-z0-9_!'$\(\)\+\-\.\*])+(:[A-Za-z0-9_!'$\(\)\+\-\.\*]+)*"/>
> to:
> <pattern value="([A-Za-z0-9\-_\.!~\*'\(\)])+(:[A-Za-z0-9\-_\.!~\*'\(\)]+)*"/>
>
> metadataPrefix from:
> <pattern value="[A-Za-z0-9_!'$\(\)\+\-\.\*]+"/>
> to:
> <pattern value="[A-Za-z0-9\-_\.!~\*'\(\)]+"/>
>
> The setSpec pattern is more complicated because elements are separated by
> colons [:].
>
>
> B) There should be some standard way to permit straightforward use,
> perhaps via escaping, of setSpec and metadataPrefix values native to
> repositories.
>
> The suggestion is to permit URI "escaped" characters (%xx where xx are two
> hex digits). I note that a number of repositories have already adopted
> encoding using hex but that in most cases the escape character is simply
> omitted; in a few cases another escape character has been chosen (e.g. *)
> because % is not permitted. The fact that implementers are already doing
> this demonstrates a desire to encode values native to other systems.
> Permitting URI "escaped" characters is a simple way to standardize this
> using and well-known escaping mechanism without significantly increasing
> complexity.
>
> Alternatives include:
>
> 1) Use another escaping mechanism. Another obvious choice would be to use
> XML numeric entities (e.g. ':' (decimal) or ':' (hex) for a
> quotation mark). These entities would themselves have to be escaped in
> XML responses (otherwise you have alternative 2) so responses might
> include XML of the form <setSpec>&#x3A;</setSpec> to encode a setSpec
> which is internally a colon [:]. One might also want to restrict to
> just-decimal or just-hex to reduce complexity. It seems to me that one
> ends up with a complex set of restrictions on XML entity encoding which
> largely negate any benefit of adopting that standard. Perhaps there is
> another good option?
>
> 2) Permit a much larger character set in the first place (the limit being
> "anything" - the XML schema "string" type). I see three issues with this.
> First, when OAI-PMH was first designed we decided on a limited character
> set to make implementation easier, I think this still has some merit.
> Second, in the setSpec there will always be a potential need to escape a
> colon [:], since that has special meaning in OAI-PMH (which may not
> correspond to use in values native to a repository). Third, this would be
> a significant change requiring updates to most harvesting software.
> Significant extension of the character set is beyond the scope of the
> present proposal.
>
> 3) Do not include a standard way to permit the use of setSpec and
> metadataPrefix values native to repositories (simply make the protocol
> document and schema agree as described in A).
>
> Note that this issue is quite separate from URL-encoding of OAI requests
> made over HTTP. Characters used in any escaping mechanism for setSpec and
> metadataPrefix may need to be further escaped when used in URLs.
>
> On Mon, 21 Jun 2004, Hussein Suleman wrote:
> ...
>
>>1/2: i have some reservations about us requiring URL-encoding within
>>XML. this mixes syntax with intended semantics of use and further
>>entrenches the implicit support for URL-encoding, which is irrelevant
>>if, for example, OAI-PMH makes the jump to using a SOAP request/response
>>model. the model and abstractions must be clean and separable, they
>>arent quite so already and i would prefer they didnt get more complicated.
>
>
> In response, I don't think the proposal was to _require_ URL-encoding. It
> was to allow it at a data-provider's choice; service providers should (in
> the absence of other information, e.g. oai_dc is special) treat both
> setSpec and metadataPrefix values as opaque tokens. OAI-PMH's special use
> of the colon means that this issue would not entirely go away even if
> OAI-PMH used an XML-clean transport such as SOAP, and we were no longer
> concerned about the burden on harvesters of permitting any string to be
> used.
>
>
> Ug, that got longer than I hoped...
>
> Cheers,
> Simeon
>
>
>
>>Simeon Warner wrote:
>>
>>>...
>>>PROPOSED FIXES TO OAI PROTOCOL DOCUMENT AND SCHEMA
>>>--------------------------------------------------
>>>
>>>1) Correct protocol document and schema definition of setSpec to be
>>>consistent, and also to permit the use of URL encoding.
>>>
>>>1.1) Motivation
>>>
>>>First, the protocol document and the schema simply do not agree. The use
>>>of the wording "any characters that are safe in a query component of a
>>>URI" is unclear and cannot be construed to agree with the schema. Second,
>>>many repositories are using URL-like encoding to create setSpecs so it
>>>seems better to permit the recognized URL encoding. The practical change
>>>to meet both of these criteria is very small: the schema regular
>>>expression should be changed to remove $ and +, and to add ~ and %xx (URL
>>>encoding). This will bring the protocol document in line with the terms
>>>"escaped" and "unreserved" as used in the URI RFC.
>>>
>>>1.2) Impact
>>>
>>>The only conforming repository that we know of using setSpecs affected by
>>>this change is Jeff Young's OpenURL repository
>>>(http://alcme.oclc.org/openurl/servlet/OAIHandler) where he uses '+' as
>>>an encoding for space. Jeff agrees that a change would be sensible and
>>>that he could be replace '+' with '%20'. Repositories using URL-like
>>>encodings will not be affected although they may choose to change to use
>>>real URL encoding. All OAI software maintainers should, however, review
>>>the change and update their parsing code accordingly.
>>>
>>>1.3) Changes
>>>
>>>1.3.1) Change wording in protocol document
>>>http://www.openarchives.org/OAI/2.0/openarchivesprotocol.htm#Set
>>>from:
>>>
>>>a setSpec -- a colon [:] separated list indicating the path from the root
>>>of the set hierarchy to the respective node. Each element in the list is
>>>a string consisting of any characters that are safe in a query component
>>>of a URI , which must not contain any colons [ :]. Since a setSpec forms
>>>a unique identifier for the set within the repository, it must be unique
>>>for each set. Flat set organizations have only sets with setSpec that do
>>>not contain any colons [ :].
>>>
>>>to:
>>>
>>>a setSpec -- a colon [:] separated list indicating the path from the root
>>>of the set hierarchy to the respective node. Each element in the list is a
>>>string consisting of any valid URI "unreserved" and "escaped" characters.
>>>A setTag must not contain URI "reserved" characters, for example the colon
>>>[:] which is used to delimit setTags. Since a setSpec forms a unique
>>>identifier for the set within the repository, it must be unique for each
>>>set. Flat set organizations have only sets with setSpec that do not
>>>contain any colons [:].
>>>
>>>The corresponding parts of the specification of allowed characters in URIs
>>>are:
>>>
>>>unreserved = alphanum | mark
>>>mark = "-" | "_" | "." | "!" | "~" | "*" | "'" |
>>> "(" | ")"
>>>escaped = "%" hex hex
>>>hex = digit | "A" | "B" | "C" | "D" | "E" | "F" |
>>> "a" | "b" | "c" | "d" | "e" | "f"
>>>
>>>
>>>1.3.2) Change definition of setSpecType in the schema to match the definition
>>>from:
>>>
>>> <simpleType name="setSpecType">
>>> <restriction base="string">
>>> <pattern value=
>>> "([A-Za-z0-9_!'$\(\)\+\-\.\*])+(:[A-Za-z0-9_!'$\(\)\+\-\.\*]+)*"/>
>>> </restriction>
>>> </simpleType>
>>>
>>>to:
>>>
>>> <simpleType name="setSpecType">
>>> <restriction base="string">
>>> <pattern value="([A-Za-z0-9\-_\.!~\*'\(\)]|(%[A-Fa-f0-9]{2}))+(:([A-Za-z0-9\-_\.!~\*'\(\)]|(%[A-Fa-f0-9]{2}))+)*"/>
>>> </restriction>
>>> </simpleType>
>>>
>>>
>>>2) Correct protocol document and schema definition for metadataPrefix to
>>>be consistent, and also to match the revised setSpec definition.
>>>
>>>2.1) Motivation
>>>
>>>The protocol document uses the same imprecise wording for metadataPrefix
>>>as it does for setSpec ("any characters that are safe in a query
>>>component of a URI") and the schema does not even follow a reasonable
>>>interpretation of this wording. It seems sensible to use the same
>>>character restrictions in a consistent fashion. This will bring the
>>>protocol document in line with the terms "escaped" and "unreserved" as
>>>used in the URI RFC.
>>>
>>>2.2) Impact
>>>
>>>This change is not expected to impact any known repository. All OAI
>>>software maintainers should, however, review the change and update their
>>>parsing code accordingly.
>>>
>>>2.3) Changes
>>>
>>>2.2.1) Change wording in protocol document
>>>http://www.openarchives.org/OAI/2.0/openarchivesprotocol.htm#metadataPrefix
>>>from:
>>>
>>>The metadataPrefix - a string to specify the metadata format in OAI-PMH
>>>requests issued to the repository. metadataPrefix consists of any
>>>characters that are safe in a query component of a URI. metadataPrefix
>>>arguments are used in ListRecords, ListIdentifiers, and GetRecord
>>>requests to retrieve records, or the headers of records that include
>>>metadata in the format specified by the metadataPrefix;
>>>
>>>to:
>>>
>>>The metadataPrefix - a string to specify the metadata format in OAI-PMH
>>>requests issued to the repository. metadataPrefix consists of any valid
>>>URI "unreserved" and "escaped" characters. A metadataPrefix must not
>>>contain URI "reserved" characters. metadataPrefix arguments are used in
>>>ListRecords, ListIdentifiers, and GetRecord requests to retrieve records,
>>>or the headers of records that include metadata in the format specified
>>>by the metadataPrefix;
>>>
>>>2.3.2) Change definition of metadataPrefixType in schema to match the
>>>definition from:
>>>
>>> <simpleType name="metadataPrefixType">
>>> <restriction base="string">
>>> <pattern value="[A-Za-z0-9_!'$\(\)\+\-\.\*]+"/>
>>> </restriction>
>>> </simpleType>
>>>
>>>to:
>>>
>>> <simpleType name="metadataPrefixType">
>>> <restriction base="string">
>>> <pattern value="([A-Za-z0-9\-_\.!~\*'\(\)]|(%[A-Fa-f0-9]{2}))+"/>
>>> </restriction>
>>> </simpleType>
>
>
>
> ----------------------------------------------------------
> Simeon Warner Email: simeon at cs.cornell.edu
> Cornell Information Science Tel: 607-254-8605
> 301 College Ave Fax: 607-255-5196
> Ithaca, NY 14850-4623, USA
>
>
> _______________________________________________
> OAI-implementers mailing list
> List information, archives, preferences and to unsubscribe:
> http://openarchives.org/mailman/listinfo/oai-implementers
>
--
=====================================================================
hussein suleman ~ hussein at cs.uct.ac.za ~ http://www.husseinsspace.com
=====================================================================
More information about the OAI-implementers
mailing list