Open Archives Initiative Object Reuse and Exchange |
DO NOT USE THIS SPECIFICATION, see instead the CURRENT ORE SPECIFICATIONS.
This document was part of an alpha release and has been superseded.
Crawlers or harvesters must discover Resource Maps (ReMs) before the aggregations described by them can be understood. ReMs can be discovered in any number of ways and this document discusses some of the recommended discovery mechanisms. Other discovery mechanisms may evolve over time and vary based on the practices of particular communities. This user guide is one of several documents comprising the OAI-ORE specification and user guide.
1. Introduction
1.1 Notational Conventions
2. Batch Discovery
2.1 ReMs in OAI-PMH
2.2 ReMs in SiteMaps
2.3 ReMs in Syndication Feeds
2.4 Combining OAI-PMH with Other Approaches
3. Resource Embedding
3.1 HTML Link Element
3.2 HTML A and IMG Elements
3.3 Non-HTML Resources
3.4 Showing ReMs in HTML Pages
4. Response Embedding
4.1 HTTP Link Header
5. Methods Not Recommended for ReM Discovery
5.1 ReMs in Simple Files
5.2 URI Conflation
6. References
A. Acknowledgments
B. Change Log
Resource Map (ReMs) discovery is a precondition of use. There is no single, best method for discovering ReMs. This document covers a variety of suggested ReM discovery mechanisms, grouped into the categories of: Batch Discovery, Resource Embedding and Response Embedding and examples are explored for each category. Additional categories and examples are expected to evolve over time.
The keywords "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [IETF RFC 2119].
Batch discovery exists so agents can discover ReMs en masse. Note that ReMs are not limited to describing aggregations on the server where the ReMs reside. Although ReMs can be serialized in a number of formats, the initial serialization is in the Atom Syndication Format [RFC4287]. Thus, in each section a table is provided to clearly map between concepts of identification and datestamps between the transport protocol/format and the Resource Map Profile of Atom [ReMProfileofAtom].
It is possible to define a new metadataPrefix
in the Open Archives
Initiative Protocol for Metadata Harvesting (OAI-PMH)[OAI-PMH]
that contains ReMs. For example, this OAI-PMH request:
http://www.foo.edu/oai?verb=GetRecord&identifier=oai:foo.edu:object1&metadataPrefix=oai_rem
Would yield this response:
<?xml version="1.0" encoding="UTF-8"?> <OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/ http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd"> <responseDate>2007-02-08T08:55:46Z</responseDate> <request verb="GetRecord" identifier="oai:foo.edu:object1" metadataPrefix="oai_rem">http://foo.edu/oai2</request> <GetRecord> <record> <header> <identifier>oai:foo.edu:object1</identifier> <datestamp>2007-01-06</datestamp> </header> <metadata> <!-- Insert ReM here --> </metadata> </record> </GetRecord> </OAI-PMH>
Identification | OAI-PMH record/header/identifier MUST NOT equal either ReM Atom /feed/id or /feed/link[@rel="self"]/@href |
---|---|
Datestamp | OAI-PMH record/header/datestamp MUST be equal to ReM Atom /feed/updated |
It is possible to construct a SiteMap [SiteMap] that consists of just ReMs, or possibly includes ReMs in its list of regular resources. For example, dereferencing this SiteMap URI:
http://www.foo.edu/sitemap-rem.xml
Would yield this response:
<?xml version="1.0" encoding="UTF-8"?> <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> <url> <loc>http://www.foo.edu/objects/object1.atom</loc> <lastmod>2007-01-06</lastmod> </url> <url> <loc>http://www.foo.edu/objects/object2.atom</loc> <lastmod>2007-08-11</lastmod> <changefreq>weekly</changefreq> </url> <url> <loc>http://www.foo.edu/objects/object3.atom</loc> <lastmod>2007-03-15T18:30:02Z</lastmod> <priority>0.3</priority> </url> ... </urlset>
Note that SiteMaps have a URI path hierarchy limitation for the resources for which they can describe. For example, this SiteMap:
http://www.foo.edu/a/b/sitemap-rem.xml
Can list the ReMs:
http://www.foo.edu/a/b/bar2.atom
and
http://www.foo.edu/a/b/c/bar3.atom
But not:
http://www.foo.edu/bar1.atom
Identification | SiteMap /urlset/url/loc MUST equal /feed/link[@rel="self"]/@href for corresponding ReM, but MUST NOT equal /feed/id |
---|---|
Datestamp | When present, SiteMap /urlset/url/lastmod MUST be equal to ReM Atom /feed/updated |
Even though the preliminary serialization of ReMs is in the Atom Syndication Format, there is no reason preventing the use of syndication formats such as Atom or RSS [RSS] for ReM discovery. However, care must be taken to separate conceptually the Resource Map from the syndication file listing the Resource Maps. In particular, the id of an Atom entry listing the URI of a Resource Map MUST be neither the URI of the Resource Map nor the Atom feed id of the Resource Map. Furthermore, an explicit difference must be made between the Atom feed used for discovery and the Atom feed that is the ReM. For example, this Atom Feed:
http://www.foo.edu/all-rems.atom
When dereferenced would yield:
<?xml version="1.0" encoding="utf-8"?> <feed xmlns="http://www.w3.org/2005/Atom"> <title>ReMs at www.foo.edu</title> <link href="http://www.foo.edu/" /> <link href="http://www.foo.edu/all-rems.atom" rel="self"/> <updated>2007-08-15T18:30:02Z</updated> <author> <name>John Doe</name> <email>johndoe@foo.edu</email> </author> <id>urn:uuid:60a76c80-d399-11d9-b91C-0003939e0af6</id> <entry> <title>ReM For Object1</title> <link href="http://www.foo.org/objects/object1.atom"/> <id>urn:uuid:1225c695-cfb8-4ebb-aaaa-80da344efa6a</id> <updated>2007-01-06T00:00:00Z</updated> </entry> <entry> <title>ReM For Object2</title> <link href="http://www.foo.org/objects/object2.atom"/> <id>urn:uuid:9a2cc699-ccba-9e8b-132e-91da394e9a5c</id> <updated>2007-08-11T00:00:00Z</updated> </entry> <entry> <title>ReM For Object3</title> <link href="http://www.foo.org/objects/object3.atom"/> <id>urn:uuid:5225c895-cab8-8ebb-baaa-90da9d4efa6b</id> <updated>2007-03-15T18:30:02Z</updated> </entry> </feed>
Identification | Syndication Atom /feed/entry/id MUST NOT equal ReM Atom /feed/id ;Syndication Atom /feed/entry/link/@href MUST equal ReM Atom /feed/link[@rel="self"]/@href |
---|---|
Datestamp | Syndication Atom /feed/entry/updated MUST equal ReM Atom /feed/updated |
The same ReMs could be exposed via RSS 2.0. For example, this RSS feed:
http://www.foo.edu/all-rems.rss
When dereferenced would yield:
<?xml version="1.0"?> <rss version="2.0"> <channel> <title>ReMs at www.foo.edu</title> <link>http://www.foo.edu/</link> <description>All of the Resource Maps for resources at www.foo.edu</description> <item> <title>ReM for Object 1</title> <link>http://www.foo.org/objects/object1.atom</link> <description>ReM for Object 1</description> <pubDate>Sat, 06 Jan 2007 00:00:00 GMT</pubDate> </item> <item> <title>ReM for Object 2</title> <link>http://www.foo.org/objects/object2.atom</link> <description>ReM for Object 2</description> <pubDate>Sat, 11 Aug 2007 00:00:00 GMT</pubDate> </item> <item> <title>ReM for Object 3</title> <link>http://www.foo.org/objects/object2.atom</link> <description>ReM for Object 3</description> <pubDate>Thu, 15 Mar 2007 08:30:02 GMT</pubDate> </item> </channel> </rss>
Identification | RSS 2.0 /rss/item/link MUST NOT equal ReM Atom /feed/id ;RSS 2.0 /rss/item/link MUST equal ReM Atom /feed/link[@rel="self"]/@href |
---|---|
Datestamp | RSS 2.0 /rss/item/pubDate MUST equal ReM Atom /feed/updated (after conversion
from RFC-822 format to ISO 8601 format) |
Resource Map Documents [ORE
Model] can be included as metadata records in
an OAI-PMH response. However, the OAI-PMH constructs
must be removed before the Resource Map Document can
be used as such. This has implications with respect
to embedding the Resource Map in a resource (discussed below). OAI-PMH repositories issue
OAI-PMH responses of MIME type text/xml
or application/xml
. These
OAI-PMH responses must be processed into ReM responses
(currently in Atom Syndication Format and of MIME type
application/atom+xml
). We envision these
services taking an OAI-PMH GetRecord request as an argument,
such as:
http://some.gateway.org/pmh2ore?=http://foo.edu/oai2?verb=GetRecord&metadataPefix=oai_rem&identifier=oai:foo.edu:object1
OCLC has already developed one such service. It takes an OAI-PMH
GetRecord URI as an argument and strips out out the OAI-PMH
elements, leaving only the child element of the OAI-PMH's
<metadata>
element. For example, this
OAI-PMH GetRecord request:
http://alcme.oclc.org/oaicat/OAIHandler?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:oaicat.oclc.org:2002/ocm11992160
When submitted as an argument to the OCLC service, produces just the
<oai_dc>
element:
http://purl.org/OAIUtil?getRecordURL=http://alcme.oclc.org/oaicat/OAIHandler?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:oaicat.oclc.org:2002/ocm11992160
The values of the OAI-PMH <responseDate>
and <request>
elements are retained as
HTTP response headers. The above example could also be combined
with syndication formats. For example, if a repository has its
ReMs in OAI-PMH, it could export the ReMs in an Atom Feed for
applications that are not OAI-PMH aware:
<?xml version="1.0" encoding="utf-8"?> <feed xmlns="http://www.w3.org/2005/Atom"> <title>ReMs at www.foo.edu</title> <link href="http://www.foo.edu/" /> <link href="http://www.foo.edu/all-rems.atom" rel="self"/> <updated>2007-08-15T18:30:02Z</updated> <author> <name>John Doe</name> <email>johndoe@foo.edu</email> </author> <id>urn:uuid:60a76c80-d399-11d9-b91C-0003939e0af6</id> <entry> <title>ReM For Object1</title> <link href="http://purl.org/OAIUtil?getRecordURL=http://foo.edu/oai2?verb=GetRecord&metadataPefix=oai_rem&identifier=oai:foo.edu:object1"/> <id>urn:uuid:1225c695-cfb8-4ebb-aaaa-80da344efa6a</id> <updated>2007-01-06T00:00:00Z</updated> </entry> <entry> <title>ReM For Object2</title> <link href="http://purl.org/OAIUtil?getRecordURL=http://foo.edu/oai2?verb=GetRecord&metadataPefix=oai_rem&identifier=oai:foo.edu:object1"/> <id>urn:uuid:9a2cc699-ccba-9e8b-132e-91da394e9a5c</id> <updated>2007-08-11T00:00:00Z</updated> </entry> <entry> <title>ReM For Object3</title> <link href="http://purl.org/OAIUtil?getRecordURL=http://foo.edu/oai2?verb=GetRecord&metadataPefix=oai_rem&identifier=oai:foo.edu:object1"/> <id>urn:uuid:5225c895-cab8-8ebb-baaa-90da9d4efa6b</id> <updated>2007-03-15T18:30:02Z</updated> </entry> </feed>
A common scenario for ReM discovery is for a human readable page in an aggregation to link to its corresponding ReM. This is most commonly accomplished using the HTML link element [HTML]. Alternatively, HTML A and IMG elements may point to ReMs, or the URI of the ReM can be exposed as an opaque string for human agents to paste into ORE-aware utilities.
We also envision the future availability of browser utilities such as Mozilla plugins that detect the presence of corresponding ReMs when embedded in resources and help guide the user in the (re)use of the aggregated resources.
The HTML link element can be used to direct agents from the aggregated HTML file to a corresponding ReM which describes the aggregation to which the HTML file is part. While this is a common case, there are actually four different scenarios regarding members of an aggregation and knowledge about their corresponding ReMs:
Note that the above scenarios are relative to a particular ReM. It is possible for aggregated resources to simultaneously have full knowledge about one ReM (typically authored by the same creators of the resources) and have zero knowledge about third party ReMs that describe aggregations of the same resources. Below is an example of how an HTML page could link to its corresponding ReM. Assuming this HTML page associated JPEGs form the aggregation, and the JPEGS do not use HTTP headers to link to the corresponding ReM (see below), this is an example of a limited knowledge scenario since only this HTML page links to the ReM.
<html> <head> <title>Hello World.</title> <link href="http://example.net/hw.atom" type="application/atom+xml" rel="resourcemap" > </head> <body> <img src="hello.jpeg"> <img src="world.jpeg"> </html>
In the above example, the HTML page links only to a single ReM. It could link to multiple ReMs, in which case it is the responsibility of the agent to differentiate the two aggregations. Next we consider an example where an HTML page is aware that it is aggregated, but does not the location of its ReM. Instead, it links to a page that does know the location of the ReM. There could be any number of these redirections. It is up to the author or maintainer of the resources and ReMs to choose which scenario best fits their usage profile.
<html> <head> <title>Chapter Twelve.</title> <link href="http://mybook.com/toc.html" type="text/html" rel="indirectresourcemap" > </head> <body> Welcome to chapter twelve... </body> </html>
Since the HTML specification defines the values of rel attributes to be CDATA, we can use values of "resourcemap" and "indirectresourcemap" and still have valid XHTML.
A similar but different scenario is when it is desirable to acknowledge relationships to other Aggregations [ORE Model]. In this scenario, we wish to cite not the ReM that describes the aggregation containing the current HTML page, but rather we wish to cite the ReM that describes the aggregation where the resource we are linking to (with the A or IMG elements) was originally discovered. This is accomplished using a separate attribute for the A or IMG elements. The example below shows how an HTML page cites the ReMs used to discover a PDF document about frogs and toads as well as examples images of each.
<html> ... Here is a helpful reference for distinguishing <a href="http://example.org/pics/f-t.pdf" resourcemap="http://example.org/amphibians.atom">frogs vs. toads</a>. <p> Here is a frog <img src="http://weluvfrogs.org/imgs/frog12.jpeg" resourcemap="http://frogs.org/frogs.atom"> and here is a toad <img src="http://toadsrule.org/toad.gif" resourcemap="http://toadsrule.org/toads.atom">. ... </html>
This approach uses the non-standard attribute resourcemap
. This can be used to provide
hints to the ORE-aware user-agent, but is not guaranteed
to be recognized, and is not valid XHTML. The only way to
unambiguously link to other Aggregations or ReMs is to create a new ReM.
See [ORE User Guide
Resource Map] for how to do this.
Another approach to specifying the appropriate Resource Map without introducing
a non-standard HTML attribute would be to place the Resource Map URI in an
existing HTML attribute. For example, the rel
attribute for the
A
element takes a space separated list of values in which we
could place the Resource Map, but the IMG
element does not share
this attribute. Below is an example of how the Resource Map URI could be
placed in the rel
attribute, with the IMG
elments
placed inside a A
element (with no href
attribute).
<html> ... Here is a helpful reference for distinguishing <a href="http://example.org/pics/f-t.pdf" rel="resourcemap=http://example.org/amphibians.atom">frogs vs. toads</a>. <p> Here is a frog <a rel="resourcemap=http://frogs.org/frogs.atom"> <img src="http://weluvfrogs.org/imgs/frog12.jpeg"> </a> and here is a toad <a rel="resourcemap=http://toadsrule.org/toads.atom"> <img src="http://toadsrule.org/toad.gif"> </a>. ... </html>
It may be possible to embed links to ReMs in non-HTML resources, such as PDF or images, but these methods are considered too preliminary to discuss at this time.
We propose exposing ReM URIs as opaque strings to facilitate future usage scenarios in which people copy and paste ReM URIs into applications such as blogs, forums or repository systems. This is commonly done with sites such as YouTube and Photobucket, and classified listings where strings are provided to the user to facilitate reuse (i.e., copy-n-paste) of the components in email, instant messaging systems, forums and HTML pages. We provide an example of how this could look for using an arXiv pre-print as an example.
If we wish to have resources link to their corresponding ReMs, but not all of the aggregated resources are HTML, and thus cannot use the HTML link element, we can embed the link of the ReM in the response. For the moment, this means putting the URI of the ReM in an HTTP response header.
The concept of a link HTTP response header existed in earlier versions of the HTTP protocol [RFC2068], but the lack of a compelling use case probably led to it being removed from the current HTTP specification. A recent Internet Draft proposes a method for converting HTML link element semantics into HTTP Link response headers [HTTP Header Linking]. Although this draft has yet to be promoted to an RFC, the approach is straightforward. If we wanted to promote the hello world example above from limited knowledge to full knowledge, the JPEGs could link to their corresponding ReM with the HTTP link response header. The example below shows an HTTP request and response with the ReM in a link header.
(request) HEAD http://www.example.net/hello.jpeg HTTP/1.1 Host: www.example.net Connection: close (response) HTTP/1.1 200 OK Date: Sat, 26 May 2007 22:43:10 GMT Server: Apache/2.2.0 Last-Modified: Sat, 26 May 2007 19:32:04 GMT ETag: "c3596-816-92123500" Accept-Ranges: bytes Content-Length: 2070 Link: <http://example.net/hw.atom>; type="application/atom+xml"; rel="resourcemap" Content-Type: image/jpeg Connection: close
It is possible to create an HTML page consisting of ReMs and link it from a web site for robots to discover, such as:
<a href="http://www.foo.edu/objects/object1.atom">ReM 1</a> <a href="http://www.foo.edu/objects/object2.atom">ReM 2</a> <a href="http://www.foo.edu/objects/object3.atom">ReM 3</a> ...
While this would not be incorrect and would result in exposing ReMs to web crawlers, it could lead to confusion if human agents were to accidently load this page. Attempts to hide such a page from human agents and present it only to crawlers would likely be detected as link spam.
The Data Model document [ORE Model] explicitly prohibits a URI of a ReM (URI-R) ever returning anything other than a ReM. This allows multiple representations to be associated with URI-R, such as using content negotiation to return ReMs in different languages, character sets, or compression encodings. But it does not allow URI-R to return a human readable "splash page", either by HTTP content negotiation or redirection. For example, clients MUST NOT merge with content negotiation the following URI pair that would correspond to a ReM and a "splash page" for an object:
(ReM) http://www.foo.edu/objects/object1.atom (Splash Page) http://www.foo.edu/objects/object1.html (Conflated URI) http://www.foo.edu/objects/object1
Similarly, clients MUST NOT refer to the ReM using the conflated URI constructed along the lines of HTTP 303 redirection [DFKI TM-07-01]:
(ReM) http://www.foo.edu/data/objects/object1 (Splash Page) http://www.foo.edu/page/objects/object1 (Conflated URI) http://www.foo.edu/resource/objects/object1
The purpose of these restrictions is to allow URI-R to be an unambiguous identifier for the ReM and not be conflated with identifiers for other resources (especially resources that are likely to be a member of the aggregation described by the ReM, such as human readable splash pages).
Note that these restrictions do not prevent a ReM from being used as a the basis or "ingredient" of a splash page. Servers MAY choose to include stylesheets with ReMs to make them suitable for use by human agents. Although this is an option, clients should note that there is no requirement for ReMs and splash pages to be transformable from one to another; a ReM may not have the same URIs as a splash page and vice versa.
This document is the work of the Open Archives Initiative. Funding for Open Archives Initiative Object Reuse and Exchange is provided by the Andrew W. Mellon Foundation, Microsoft, and the National Science Foundation. Additional support is provided by the Coalition for Networked Information.
This document is based on meetings of the OAI-ORE Technical Committee (ORE-TC), with participation from the OAI-ORE Liaison Group (ORE-LG). Members of the ORE-TC are: Chris Bizer (Freie Universität Berlin), Les Carr (University of Southampton), Tim DiLauro (Johns Hopkins University), Leigh Dodds (Ingenta), David Fulker (UCAR), Tony Hammond (Nature Publishing Group), Pete Johnston (Eduserv Foundation), Richard Jones (Imperial College), Peter Murray (OhioLINK), Michael Nelson (Old Dominion University), Ray Plante (NCSA and National Virtual Observatory), Rob Sanderson (University of Liverpool), Simeon Warner (Cornell University), and Jeff Young (OCLC). Members of ORE-LG are: Leonardo Candela (DRIVER), Tim Cole (DLF Aquifer and UIUC Library), Julie Allinson (JISC), Jane Hunter (DEST), Savas Parastatidis (Microsoft), Sandy Payette (Fedora Commons), Thomas Place (DARE and University of Tilburg), Andy Powell (DCMI), and Robert Tansley (Google, Inc. and DSpace)
We also acknowledge comments from the OAI-ORE Advisory Committee (ORE-AC).
This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported License.
Use of this page is tracked to collect anonymous traffic data. See OAI privacy policy.