DO NOT USE, SEE CURRENT ResourceSync SPECIFICATIONS

ResourceSync Framework Specification - Beta Draft

11 September 2013

This version:
http://www.openarchives.org/rs/0.9.1/resourcesync
Latest version:
http://www.openarchives.org/rs/resourcesync
Previous version:
http://www.openarchives.org/rs/0.9/resourcesync
Editors:
Martin Klein, Robert Sanderson, Herbert Van de Sompel - Los Alamos National Laboratory
Simeon Warner - Cornell University
Graham Klyne - University of Oxford
Bernhard Haslhofer - University of Vienna
Michael Nelson - Old Dominion University
Carl Lagoze - University of Michigan

Abstract

This ResourceSync specification describes a synchronization framework for the web consisting of various capabilities that allow third party systems to remain synchronized with a server's evolving resources. The capabilities can be combined in a modular manner to meet local or community requirements. This specification also describes how a server can advertise the synchronization capabilities it supports and how third party systems can discover this information. The specification repurposes the document formats defined by the Sitemap protocol and introduces extensions for them.

This specification is one of several documents comprising the ResourceSync Framework Specifications. This specification focuses on pull-based methods.

Status of this Document

This specification is a beta draft released for public comment. Feedback is most welcome on the ResourceSync Google Group.

Table of Contents

1. Introduction
    1.1 Motivating Examples
    1.2 Terminology
    1.3 Notational Conventions
2. Walkthrough
3. Synchronization Processes
    3.1 Source Perspective
    3.2 Destination Perspective
    3.3 Summary
4. Framework Organization
    4.1 Structure
    4.2 Navigation
    4.3 Discovery
        4.3.1 ResourceSync Well-Known URI
        4.3.2 Links
        4.3.3 robots.txt
5. Sitemap Document Formats
6. Describing the Source
7. Advertising Capabilities
8. Describing Resources
    8.1 Resource List
    8.2 Resource List Index
9. Packaging Resources
    9.1 Resource Dump
        9.1.1 Resource Dump Manifest
10. Describing Changes
    10.1 Change List
    10.2 Change List Index
11. Packaging Changes
    11.1 Change Dump
        11.1.1 Change Dump Manifest
12. Linking to Related Resources
    12.1 Mirrored Content
    12.2 Alternate Representations
    12.3 Patching Content
    12.4 Resources and Metadata about Resources
    12.5 Prior Versions of Resources
    12.6 Collection Membership
    12.7 Republishing Resources
13. References

Appendices

A. Time Attribute Requirements
B. Acknowledgements
C. Change Log

1. Introduction

The web is highly dynamic, with resources continuously being created, updated, and deleted. As a result, using resources from a remote server involves the challenge of remaining in step with its changing content. In many cases, there is no need to reflect a server's evolving content perfectly, and therefore well established resource discovery techniques, such as crawling, suffice as an updating mechanism. However, there are significant use cases that require low latency and high accuracy in reflecting a remote server's changing content. These requirements have typically been addressed by ad-hoc technical approaches implemented within a small group of collaborating systems. There have been no widely adopted, web-based approaches.

This ResourceSync specification introduces a range of easy to implement capabilities that a server may support in order to enable remote systems to remain more tightly in step with its evolving resources. It also describes how a server can advertise the capabilities it supports. Remote systems can inspect this information to determine how best to remain aligned with the evolving data.

Each capability provides a different synchronization functionality, such as a list of the server's resources or its recently changed resources, including what the nature of the change was: create, update, or delete. All capabilities are implemented on the basis of the document formats introduced by the Sitemap protocol. Capabilities can be combined to achieve varying levels of functionality and hence meet different local or community requirements. This modularity provides flexibility and makes ResourceSync suitable for a broad range of use cases.

This document is structured as follows:

1.1. Motivating Examples

Many projects and services have synchronization needs and have implemented ad hoc solutions. ResourceSync provides a standard synchronization method that will reduce implementation effort and facilitate easier reuse of resources. This section describes motivating examples with differing needs and complexities.

Consider first the case of a website for a small museum collection. The website may contain just a few dozen static web pages. The maintainer can create a Resource List of these web pages and expose it to services that leverage ResourceSync.

When building services over Linked Data it is often desirable to maintain a local copy of data for improved access and availability. Harvesting can be enabled by publishing a Resource List for the dataset. In many cases resource representations exposed as Linked Data are small and so retrieving them via individual HTTP GET requests is slow because of the large number of round-trips for a small amount of content. Publishing a Resource Dump that points to content packaged and described in ZIP files makes this more efficient for the client and less burdensome for the server. Continued synchronization is enabled by recurrently publishing an up-to-date Resource List or Resource Dump, or, more efficiently, by publishing a Change List that provides information about resource changes only.

The arXiv.org collection of scientific articles propagates resource changes to a set of mirror sites and interacting services on a daily basis. As of July 2013 the collection contains about 2.6 million resources and there are about 1,600 changes (creates, updates) per day. The mirroring system operated since 1994 uses HTTP with custom change descriptions, and occasionally rsync to verify the copies and to cope with any errors in the incremental updates. The approach assumes a tight connection between arXiv.org and its mirrors. It would be desirable to have a solution that allows any third party system to accurately synchronize with arXiv.org using commodity software. arXiv.org could publish both metadata records and full-text content as separate web resources with their own URI. Use of ResourceSync capabilities including Resource Lists, Resource Dumps, Change Lists, and Change Dumps, both mirrors and new parties could remain accurately in sync with the collection. This would extend the openly available metadata sharing capabilities provided by arXiv.org, currently implemented via OAI-PMH, to full-text sharing in a web-friendly fashion.

1.2. Terminology

The following terms are introduced and are used throughout the ResourceSync Framework Specifications:

1.3. Notational Conventions

This specification uses the terms "resource", "representation", "request", "response", "content negotiation", "client", and "server" as described in [Architecture of the World Wide Web].

Throughout this document, the following namespace prefix bindings are used:

PrefixNamespace URIDescription
http://www.sitemaps.org/schemas/sitemap/0.9 Sitemap XML elements defined in the Sitemap protocol
rshttp://www.openarchives.org/rs/terms/ Namespace for elements introduced in this specification

Table 1.1: Namespace prefix bindings used in this document

2. Walkthrough

Let's assume a Source, http://example.com/, that exposes changing content that others would like to remain synchronized with. A first step towards making this easy for Destinations is for the Source to publish a Resource List that conveys the URIs of resources available for synchronization. This Resource List is expressed as a Sitemap. As shown in Example 2.1, the Source conveys the URI of each resource as the value of the <loc> child element of a <url> element. Note the <rs:md> child element of the <urlset> root element, which expresses that the Sitemap implements ResourceSync's Resource List capability. It also conveys that the Resource List reflects the state of the Source's resources at the datetime provided in the at attribute. This datetime allows a Destination to quickly determine whether it has previously processed this specific Resource List.

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
        xmlns:rs="http://www.openarchives.org/rs/terms/">
  <rs:md capability="resourcelist"
         at="2013-01-03T09:00:00Z"/>
  <url>
      <loc>http://example.com/res1</loc>
  </url>
  <url>
      <loc>http://example.com/res2</loc>
  </url>
</urlset>

Example 2.1: A Resource List

The Source can provide additional information in the Resource List to help the Destination optimize the process of collecting content and verifying its accuracy. For example, when the Source expresses the datetime of the most recent modification for a resource, a Destination can determine whether or not it already holds the current version, minimizing the number of HTTP requests it needs to issue in order to remain up-to-date. Example 2.2 shows this information conveyed using Sitemap's <lastmod> element. When the Source also conveys a hash for a specific bitstream, a Destination can verify whether the process of obtaining it was successful. The example shows this information conveyed using the hash attribute on the <rs:md> element. In addition, the Source can provide links to related resources using the <rs:ln> element. The example shows a link to a mirror copy of the second listed resource, indicating that the Source would prefer a Destination to obtain the resource from it.

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
        xmlns:rs="http://www.openarchives.org/rs/terms/">
  <rs:md capability="resourcelist"
         at="2013-01-03T09:00:00Z"/>
  <url>
      <loc>http://example.com/res1</loc>
      <lastmod>2013-01-02T13:00:00Z</lastmod>
      <rs:md hash="md5:1584abdf8ebdc9802ac0c6a7402c03b6"/>
  </url>
  <url>
      <loc>http://example.com/res2</loc>
      <lastmod>2013-01-02T14:00:00Z</lastmod>
      <rs:md hash="md5:1e0d5cb8ef6ba40c99b14c0237be735e"/>
      <rs:ln rel="duplicate"
             href="http://mirror.example.com/res2"/>
  </url>
</urlset>

Example 2.2: A Resource List with additional information

In order to describe its changing content in a more timely manner, the Source can increase the frequency at which it publishes an up-to-date Resource List. However, changes may be so frequent or the size of the content collection so vast that regularly updating a complete Resource List may be impractical. In such cases, the Source can implement an additional capability that communicates information about changes only. To this end, ResourceSync introduces Change Lists. A Change List enumerates resource changes, along with the nature of the change (create, update, or delete) and the time that the change occurred. A Destination can recurrently obtain a Change List from the Source, inspect the listed changes to discover those it has already acted upon, and process the remaining ones. Changes in a Change List are provided in forward chronological order, making it straightforward for a Destination to determine which changes it already processed. In addition, a Change List also contains datetimes that convey the start time and the end time of the temporal interval covered by the Change List. These times convey that all resource changes that occurred during the interval are described in the Change List. ResourceSync does not specify for how long change lists must continue to be available once they have been produced. The longer that Change Lists are maintained by the Source, the better the odds are for a Destination to catch up on changes it missed because it was offline, for example.

Example 2.3 shows a Change List. The value of the capability attribute of the <rs:md> child element of <urlset> makes it clear that, this time, the Sitemap is a Change List and not a Resource List. The from and until attributes inform about the temporal interval covered by the Change List. The Change List shown below conveys two resource changes, one being an update and the other a deletion, as can be seen from the value of the change attribute of the <rs:md> element. The example also shows the use of the <lastmod> element to convey the time of the changes. Note that these times are used to order the Change List chronologically.

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
        xmlns:rs="http://www.openarchives.org/rs/terms/">
  <rs:md capability="changelist"
         from="2013-01-02T00:00:00Z"
         until="2013-01-03T00:00:00Z"/>
  <url>
      <loc>http://example.com/res2.pdf</loc>
      <lastmod>2013-01-02T13:00:00Z</lastmod>
      <rs:md change="updated"/>
  </url>
  <url>
      <loc>http://example.com/res3.tiff</loc>
      <lastmod>2013-01-02T18:00:00Z</lastmod>
      <rs:md change="deleted"/>
  </url>
</urlset>

Example 2.3: A Change List

A Destination can issue HTTP GET requests against each resource URI listed in a Resource List. For large Resource Lists, issuing all of these requests may be cumbersome. Therefore, ResourceSync introduces a capability that a Source can use to make packaged content available. A Resource Dump, implemented as a Sitemap, contains pointers to packaged content. Each content package referenced in a Resource Dump is a ZIP file that contains the Source's bitstreams along with a Resource Dump Manifest that describes each. The Resource Dump Manifest itself is also implemented as a Sitemap. A Destination can retrieve a Resource Dump, obtain content packages by dereferencing the contained pointers, and unpack the retrieved packages. Since the Resource Dump Manifest also lists the URI the Source associates with each bitstream, a Destination is able to achieve the same result as obtaining the data by dereferencing the URIs listed in a Resource List. Example 2.4 shows a Resource Dump that points at a single content package. Dereferencing the URI of that package leads to a ZIP file that contains the Resource Dump Manifest shown in Example 2.5. It indicates that the Source's ZIP file contains two bitstreams. The path attribute of the <rs:md> element conveys the file path of the bitstream in the ZIP file (the relative file system path where the bitstream would reside if the ZIP were unpacked), whereas the <loc> attribute conveys the URI associated with the bitstream at the Source.

An additional capability, the Change Dump, provides a functionality similar to a Resource Dump but pertains to packaging bitstreams of resources that have changed during a temporal interval, instead of packaging a snapshot of resource bitstreams at a specific moment in time.

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
        xmlns:rs="http://www.openarchives.org/rs/terms/">
  <rs:md capability="resourcedump"
         at="2013-01-03T09:00:00Z"/>
  <url>
      <loc>http://example.com/resourcedump.zip</loc>
      <lastmod>2013-01-03T09:00:00Z</lastmod>
  </url>
</urlset>

Example 2.4: A Resource Dump

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
        xmlns:rs="http://www.openarchives.org/rs/terms/">
  <rs:md capability="resourcedump-manifest"
         at="2013-01-03T09:00:00Z"/>
  <url>
      <loc>http://example.com/res1</loc>
      <lastmod>2013-01-03T03:00:00Z</lastmod>
      <rs:md hash="md5:1584abdf8ebdc9802ac0c6a7402c03b6"
             path="/resources/res1"/>
  </url>
  <url>
      <loc>http://example.com/res2</loc>
      <lastmod>2013-01-03T04:00:00Z</lastmod>
      <rs:md hash="md5:1e0d5cb8ef6ba40c99b14c0237be735e"
             path="/resources/res2"/>
  </url>
</urlset>

Example 2.5: A Resource Dump Manifest detailing the content of a ZIP file

ResourceSync also introduces a Capability List, which is a way for the Source to describe the capabilities it supports for one set of resources. Example 2.6 shows an example of such a description. It indicates that the Source supports the Resource List, Resource Dump, and Change List capabilities and it lists their respective URIs. Note the inclusion of a <rs:ln> child element of <urlset> that links by means of a describedby relation to a description of the set of resources covered by the Capability List. Because these capabilities are conveyed in the same Capability List, they uniformly apply to this set of resources. For example, if a given resource appears in the Resource List then it must also appear in a Resource Dump and changes to the resource must be reported in the Change List.

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
        xmlns:rs="http://www.openarchives.org/rs/terms/">
  <rs:ln rel="describedby"
         href="http://example.com/info_about_set1_of_resources.xml"/>
  <rs:ln rel="up"
         href="http://example.com/resourcesync_description.xml"/>
  <rs:md capability="capabilitylist"/>
  <url>
      <loc>http://example.com/dataset1/resourcelist.xml</loc>
      <rs:md capability="resourcelist"/>
  </url>
  <url>
      <loc>http://example.com/dataset1/resourcedump.xml</loc>
      <rs:md capability="resourcedump"/>
  </url>
  <url>
      <loc>http://example.com/dataset1/changelist.xml</loc>
      <rs:md capability="changelist"/>
  </url>
</urlset>

Example 2.6: A Capability List enumerating the ResourceSync capabilities a Source supports for a set of its resources

There are three ways by which a Destination can discover whether and how a Source supports ResourceSync: a Source-wide approach, a resource-specific approach, and an approach that leverages existing practice for discovering Sitemaps. The Source-wide approach leverages the well-known URI specification and consists of the Source making a Source Description, like the one shown in Example 2.7, available at /.well-known/resourcesync. The Source Description enumerates the Capability Lists a Source offers, one Capability List per set of resources. If a Source only has one set of resources and hence only one Capability List, the mandatory Source Description contains only one pointer. The resource-specific discovery approach consists of a Source providing a link in an HTML document or in an HTTP Link header that points at a Capability List that covers the resource that provides the link. Note, in Example 2.6, the inclusion of a <rs:ln> child element of <urlset> that links by means of an up relation to the Source Description, allowing for navigation from a Capability List to a Source Description. Yet another approach follows the established practice for discovering Sitemaps via a Source's robots.txt file. Since a Resource List is a Sitemap it can be made discoverable by including its URI in the robots.txt file as the value of the Sitemap directive. A navigational up link included in the Resource List allows discovery of a Capability List pertaining to the set of resources covered by that Resource List, and a further up link in the Capability List leads to the Source Description.

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
        xmlns:rs="http://www.openarchives.org/rs/terms/">
  <rs:ln rel="describedby"
         href="http://example.com/info-about-source.xml"/>
  <rs:md capability="description"/>
  <url>
      <loc>http://example.com/dataset1/capabilitylist.xml</loc>
      <rs:md capability="capabilitylist"/>
      <rs:ln rel="describedby"
             href="http://example.com/info_about_set1_of_resources.xml"/>
  </url>
</urlset>

Example 2.7: A Source Description with a pointer to the Capability List for the single set of resources offered by a Source

In some cases, there is a need to split the documents described so far into parts. For example, the Sitemap protocol currently prescribes a maximum of 50,000 resources per Sitemap and a Source may have more resources that are subject to synchronization. The ResourceSync framework follows these community defined limits and hence, in such cases, publishes multiple Resource Lists as well as a Resource List Index that points to each of them. The Resource List Index is expressed using Sitemap's <sitemapindex> document format. Example 2.8 shows a Resource List Index that points at two Resource Lists.

<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
              xmlns:rs="http://www.openarchives.org/rs/terms/">
  <rs:md capability="resourcelist"
         at="2013-01-03T09:00:00Z"/>
  <sitemap>
      <loc>http://example.com/resourcelist-part1.xml</loc>
  </sitemap>
  <sitemap>
      <loc>http://example.com/resourcelist-part2.xml</loc>
  </sitemap>
</sitemapindex>

Example 2.8: A Resource List Index expressed using the <sitemapindex> document format

3. Synchronization Processes

The previous section provides a concrete walkthrough of some capabilities that a Source can implement and describes how a Destination can use those capabilities to remain synchronized with the Source's changing data. This section provides a high-level overview of the various ResourceSync capabilities and shows how these fit in a Destination's processes aimed at remaining in step with changes.

3.1. Source Perspective

From the perspective of a Source, the ResourceSync capabilities that can be supported to enable Destination processes to remain in sync with its changing data can be summarized as follows:

Describing Content - In order to describe its data, a Source can maintain an up-to-date Resource List. A basic Resource List minimally provides the URIs of resources that the Source makes available for synchronization. However, additional information can be added to the Resource List to optimize the Destination's process of obtaining the Source's resources, including the most recent modification time of resources and fixity information such as content-based checksum or hash and length. Figure 1 shows a Source publishing up-to-date Resource Lists at times t2 and t4. At t4, too many resources need to be listed to fit in a single Resource List and hence multiple Resource Lists are published and grouped in a Resource List Index.

Packaging Content - In order to make its data available for download, a Source can recurrently make an up-to-date Resource Dump of its content available. A Resource Dump points at one or more packages, each of which contains bitstreams associated with resources hosted by the Source. Each package also contains a Resource Dump Manifest that provides metadata about the bitstreams contained in the package, minimally including their associated URI and their file path in the ZIP file. Figure 1 shows a Source publishing up-to-date Resource Dumps at times t1 and t3. At time t3, multiple Resource Dumps are published and grouped in a Resource Dump Index.

Describing Changes - In order to achieve lower synchronization latency and/or to improve transfer efficiency, a Source may publish a Change List that provides information about changes to its resources. It is up to the Source to decide what the temporal interval is that is covered by a Change List, for example, expressing all the changes that occurred during the previous hour, the current day, or since the most recent publication of a Resource List. Per resource change, a Change List minimally conveys the URI of the changed resource as well as the datetime and nature of the change (create, update, delete). Since a Change List is organized on the basis of changes, it may list the same resource multiple times, once per change. Figure 1 shows three Change Lists. The first Change List covers resource changes that occurred between t1 and t3, the second between t3 and t5, and the third between t5 and t7. Since too many changes occurred between t5 and t7 to fit in a single Change List, multiple Change Lists are published and grouped in a Change List Index.

Packaging Changes - In order to make content changes available for download, a Source can publish a Change Dump. A Change Dump points at one or more packages, each of which contains bitstreams that correspond to the state of resources after they changed. Each package also contains a Change Dump Manifest that provides metadata about the bitstreams provided in the Change Dump. Per bitstream, the Change Dump Manifest minimally includes the associated URI, the datetime when the change that resulted in the bitstream occurred, the nature of the change (create, update, delete) and, where appropriate, the file path of the bitstream in the ZIP file. It is up to a Source to decide the temporal interval covered by a Change Dump, for example, covering all the resource changes that occurred during the previous hour, the current day, or since the most recent publication of a Resource Dump. Since a Change Dump is organized on the basis of changes, the package(s) it points at may contain multiple bitstreams associated with any given resource, one per change. Figure 1 shows three Change Dumps. The first Change Dump covers resource changes that occurred between t2 and t4, the second between t4 and t6, and the third between t6 and t8. During the time period between t6 and t8, multiple Change Dumps are published and grouped in a Change Dump Index.

Linking to Related Resources - There are several reasons to provide additional links from a resource subject to synchronization to related resources, including:

ResourceSync Source Perspective

Figure 1: ResourceSync Source perspective

3.2. Destination Perspective

From the perspective of a Destination, three key processes are enabled by the ResourceSync capabilities; Figure 2 provides an overview:

Baseline Synchronization - In order to become synchronized with a Source, the Destination must make an initial copy of the Source's data. A Destination can obtain the Resource List that conveys the URIs of the Source's resources, and subsequently dereference those URIs one by one. A Destination can also obtain a Resource Dump that conveys the URIs of one or more content packages each of which contains bitstreams associated with the Source's resources. A Destination can dereference those URIs and subsequently unpack the retrieved content packages, guided by the contained Resource Dump Manifest.

Incremental Synchronization - A Destination can remain in sync with a Source by repeatedly performing a Baseline Synchronization. To increase efficiency and decrease latency, a Source may communicate information about changes to its resources via Change Lists. This allows a Destination to obtain up-to-date content by dereferencing the URIs of newly created and updated resources listed in the Change List. It also allows a Destination to remove its copies of deleted resources, if needed. A Source can also make a Change Dump available that points at one or more packages, each of which contains bitstreams that correspond to the state of resources after they changed. In this case the Destination first obtains the Change Dump, then obtains the package(s) by dereferencing the URI(s) listed in the Change Dump, and subsequently unpacks those, guided by the contained Change Dump Manifest.

Audit - In order to verify whether it is in sync with the Source, a Destination must be able to check that the content it obtained matches the current resources hosted by the Source both regarding coverage and accuracy. This requires an up-to-date list of resources hosted by the Source, which can be compiled on the basis of a Resource List and Change Lists. It also requires these Lists to contain metadata per resource that characterizes its most recent state, such as last modification time, length, and content-based hash.

ResourceSync Destination Perspective

Figure 2: ResourceSync Destination perspective

3.3. Summary

Table 3.1 provides a summary of this section. The table lists Destination processes as columns and Source capabilities as rows, with cells indicating the applicability of a capability for a given process.

Source CapabilitiesDestination Processes
 Baseline SynchronizationIncremental SynchronizationAudit
Describing the SourceXXX
Advertising CapabilitiesXXX
Describing Resources 
     Resource ListX X
Packaging Resources 
     Resource DumpX 
Describing Changes 
     Change List XX
Packaging Changes 
     Change Dump X 
Linking to Related Resources 
     Mirrored ContentXXX
     Alternate RepresentationsXXX
     Patching Content XX
     Resources and Metadata about ResourcesXXX
     Prior Versions of ResourcesXX
     Collection MembershipXXX
     Republishing ResourcesXXX

Table 3.1: Source capabilities versus Destination processes

4. Framework Organization

4.1. Structure

All capabilities in the ResourceSync framework are implemented on the basis of the <urlset> and <sitemapindex> Sitemap document formats. Figure 3 depicts the overall structure of the set of documents that is used:

The Resource List branch of Figure 3 is fully compatible with the existing Sitemap specification, whereas the other branches are extensions introduced to support resource synchronization that leverage the Sitemap document formats.

ResourceSync Framework Structure

Figure 3: ResourceSync framework structure

4.2. Navigation

The following mechanisms are introduced to support navigating the document hierarchy described in the previous section; they are illustrated in Figure 4 and Figure 5:

ResourceSync Navigation Upwards

Figure 4: ResourceSync upwards navigation

ResourceSync Navigation Downwards

Figure 5: ResourceSync downwards navigation

4.3. Discovery

ResourceSync provides three ways for a Destination to discover whether and how a Source supports ResourceSync: a Source-wide approach detailed in Section 4.3.1, a resource-specific approach detailed in Section 4.3.2, and an approach that leverages the existing practice of Sitemap discovery via the robots.txt file described in Section 4.3.3. All approaches are summarized in Figure 6.

ResourceSync Discovery

Figure 6: Discovery of Source Description and Capability List

4.3.1. ResourceSync Well-Known URI

A Source must publish a Source Description, such as the one shown in Example 2.7, and it should be published at the well-known URI [RFC 5785] /.well-known/resourcesync defined in this specification. The Source Description document enumerates a Source's Capability Lists and as such is an appropriate entry point for Destinations interested in understanding a Source's capabilities.

4.3.2. Links

A Capability List can be made discoverable by means of links provided either in an HTML document [HTML Links, XHTML Links] or in an HTTP Link header [RFC 5988].

In order to include a discovery link in an HTML document, a <link> element is introduced in the <head> of the document that points to a Capability List. This <link> must have a rel attribute with a value of resourcesync. The Capability List that is made discoverable in this way must pertain to the resource that provides the link. This means that the resource must be covered by the capabilities listed in the linked Capability List. Example 4.1 shows the structure of a web page that contains a link to a Capability List.

As shown in Example 2.6 the Source Description can be discovered from the Capability List by following the link provided in the <rs:ln> element with the relation type up.

<html>
  <head>
    <link rel="resourcesync"
          href="http://www.example.com/dataset1/capabilitylist.xml"/>
    ...
  </head>
  <body>...</body>
</html>

Example 4.1: Discovery by means of an HTML link

A Capability List can also be made discoverable by means of an HTTP Link header that can be included with a representation of a resource of any content-type. In order to do so, a link is introduced in the HTTP Link header. The target IRI of this link is the URI of a Capability List and the value of its rel attribute is resourcesync. The Capability List that is made discoverable in this way must pertain to the resource that provides the link. This means that the resource must be covered by the capabilities listed in the linked Capability List. Example 4.2 shows an excerpt of an HTTP response header that illustrates this approach.

As shown in Example 2.6 the Source Description can be discovered from the Capability List by following the link provided in the <rs:ln> element with the relation type up.

HTTP/1.1 200 OK
  Date: Thu, 21 Jan 2010 00:02:12 GMT
  Server: Apache
  Link: <http://www.example.com/dataset1/capabilitylist.xml>;
         rel="resourcesync"
  ...

Example 4.2: Discovery by means of an HTTP link

4.3.3. robots.txt

A Resource List is a Sitemap and hence can be made discoverable via the established approach of adding a Sitemap directive to a Source's robots.txt file that has the URI of the Resource List as its value. If a Source supports multiple sets of resources, multiple directives can be added, one for each Resource List associated with a specific set of resources. In case a Source supports both regular Sitemaps and ResourceSync Sitemaps (Resource Lists) they can be made discoverable, again, by including multiple Sitemap directives.

Once a Resource List for a set of resources has been discovered in this manner, the corresponding Capability List can be discovered by following a link with the up relation type provided in the Resource List. Next, the Source Description can be discovered by following yet another link with the up relation type provided in the Capability List.

User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Sitemap: http://example.com/dataset1/resourcelist.xml

Example 4.3: A robots.txt file that points at a Resource List

5. Sitemap Document Formats

In order to convey information pertaining to resources in the ResourceSync framework, the Sitemap (root element <urlset>) and Sitemap index (root element <sitemapindex>) document formats introduced by the Sitemap protocol are used for a variety of purposes. The <sitemapindex> document format is used when is it necessary to group multiple documents of the <urlset> format. The ResourceSync framework follows community defined limits for when to publish multiple documents of the <urlset> format. At time of publication of this specification, the limit is 50,000 items per document and a document size of 50MB.

The document formats, as well as their ResourceSync extension elements, are shown in Table 5.1. The <rs:md> and <rs:ln> elements are introduced to express metadata and links, respectively. Both are in the ResourceSync XML Namespace and can have attributes. The attributes of these elements defined by ResourceSync are listed in Table 5.2 and detailed below. As shown in the examples, these attributes must not have an XML Namespace prefix. The <rs:ln> element as well as several of the ResourceSync attributes are based upon other specifications and in those cases inherit the semantics defined there; the "Specification" column of Table 5.2 refers to those specifications. Communities may introduce additional attributes when needed but must use an XML Namespace other than that of ResourceSync and must appropriately use namespace prefixes for those attributes.

SitemapSitemap Index
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
        xmlns:rs="http://www.openarchives.org/rs/terms/">
  <rs:md />
  <rs:ln />
  <url>
      <loc />
      <lastmod />
      <rs:md />
      <rs:ln />
  </url>
  <url>
      ...
  </url>
</urlset>
  
<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="..."
              xmlns:rs="...">
  <rs:md />
  <rs:ln />
  <sitemap>
      <loc />
      <lastmod />
      <rs:md />
      <rs:ln />
  </sitemap>
  <sitemap>
      ...
  </sitemap>
</sitemapindex>
  

Table 5.1: The Sitemap document formats including the ResourceSync extensions

The overall structure of the ResourceSync documents is as follows:

Table 5.2 lists the elements used in ResourceSync documents and for each shows the attributes defined by ResourceSync that can be used with them. The "Specification" column refers to the specification where elements or attributes were introduced that ResourceSync equivalents are based upon and inherit their semantics from. A mark in the "Representation" column for an attribute indicates that it can only be used when a specific representation of a resource is concerned, whereas a mark in the "Resource" column indicates it is usable for a resource in general. A W3C XML Schema is provided to validate the elements introduced by ResourceSync.

Relation types other than the ones listed above can be used in the Resourcesync framework. Valid relation types must be registered in the IANA Link Relation Type Registry or expressed as URIs as specified in RFC 5988, Sec. 4.2. The document [Relation Types Used in the ResourceSync Framework] attempts to provide an up-to-date overview.

Element/AttributeSpecificationResourceRepresentation
<urlset> or <sitemapindex>Sitemap protocol
    <rs:md>This specification
        atThis specification
        capabilityThis specification
        completedThis specification
        fromThis specification
        untilThis specification
    <rs:ln>RFC4287
        hrefRFC4287
        relRFC4287
    <url> or <sitemap>Sitemap protocol
        <loc>Sitemap protocol
        <lastmod>Sitemap protocol
        <changefreq>Sitemap protocol
        <rs:md>This specification
            atThis specification
            capabilityThis specification
            changeThis specificationXX
            completedThis specification
            encodingRFC2616X
            fromThis specification
            hashAtom Link ExtensionsX
            lengthRFC4287X
            pathThis specificationX
            typeRFC4287X
            untilThis specification
        <rs:ln>This specification
            encodingRFC2616X
            hashAtom Link ExtensionsX
            hrefRFC4287XX
            lengthRFC4287X
            modifiedAtom Link ExtensionsXX
            pathThis specificationX
            priRFC6249XX
            relRFC4287XX
            typeRFC4287X

Table 5.2: Elements and associated attributes defined for the ResourceSync documents

6. Describing the Source

A Source Description is a mandatory document that enumerates the Capability Lists offered by a Source. Since a Source has one Capability List per set of resources that it distinguishes, the Source Description will enumerate as many Capability Lists as the Source has distinct sets of resources.

The Source Description is based on the <urlset> format. It has the <urlset> root element and the following structure:

The <lastmod> elements should be omitted from the Source Description unless the Source updates the Source Description every time it updates one of the Capability Lists.

Example 6.1 shows a Source Description where the Source offers three Capability Lists.

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
        xmlns:rs="http://www.openarchives.org/rs/terms/">
  <rs:ln rel="describedby"
         href="http://example.com/info_about_source.xml"/>
  <rs:md capability="description"/>
  <url>
      <loc>http://example.com/capabilitylist1.xml</loc>
      <rs:md capability="capabilitylist"/>
      <rs:ln rel="describedby"
             href="http://example.com/info_about_set1_of_resources.xml"/>
  </url>
  <url>
      <loc>http://example.com/capabilitylist2.xml</loc>
      <rs:md capability="capabilitylist"/>
      <rs:ln rel="describedby"
             href="http://example.com/info_about_set2_of_resources.xml"/>
  </url>
  <url>
      <loc>http://example.com/capabilitylist3.xml</loc>
      <rs:md capability="capabilitylist"/>
      <rs:ln rel="describedby"
             href="http://example.com/info_about_set3_of_resources.xml"/>
  </url>
</urlset>

Example 6.1: A Source Description

If a Source needs to or chooses to publish multiple Source Descriptions, it must group them by means of a Source Description Index.

7. Advertising Capabilities

A Capability List is a document that enumerates all capabilities supported by a Source for a specific set of resources. The Source defines which resources are part of the set of resources described by the Capability List. If there is more than one such set, then the Source must distinguish them with different capability lists. The choice of which resources are part of which set can derive from a variety of criteria, including media type, collection membership, change frequency, subject of the resource and many others.

A Capability List points at the capability documents for its set of resources: Resource List, Resource Dump, Change List, and Change Dump as introduced in Section 8, Section 9.1, Section 10, and Section 11.1, respectively. A Capability List can only contain one entry per capability.

Capabilities that are conveyed in the same Capability List uniformly apply to the set of resources covered by that Capability List. For example, if a Capability List enumerates a Resource List, a Resource Dump, and a Change List, then a given resource that appears in a Resource List must also appear in a Resource Dump, and changes to the resource must be conveyed in the Change List.

The Capability List is based on the <urlset> format. It has the <urlset> root element and the following structure:

The <lastmod> elements should be omitted from the Capability List unless the Source updates the Capability List every time it updates one of the capability documents.

Example 7.1 shows a Capability List where the Source offers four capabilities: a Resource List, a Resource Dump, a Change List, and a Change Dump. A Destination cannot determine from the Capability List whether a Source provides, for example, a Resource List Index or a single Resource List. The capability document must be downloaded to make this determination: a document with a <sitemapindex> root element is an index, a document with a <urlset> root element is not.

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
        xmlns:rs="http://www.openarchives.org/rs/terms/">
  <rs:ln rel="describedby"
         href="http://example.com/info_about_set1_of_resources.xml"/>
  <rs:ln rel="up"
         href="http://example.com/resourcesync_description.xml"/>
  <rs:md capability="capabilitylist"/>
  <url>
      <loc>http://example.com/dataset1/resourcelist.xml</loc>
      <rs:md capability="resourcelist"/>
  </url>
  <url>
      <loc>http://example.com/dataset1/resourcedump.xml</loc>
      <rs:md capability="resourcedump"/>
  </url>
  <url>
      <loc>http://example.com/dataset1/changelist.xml</loc>
      <rs:md capability="changelist"/>
  </url>
  <url>
      <loc>http://example.com/dataset1/changedump.xml</loc>
      <rs:md capability="changedump"/>
  </url>
</urlset>

Example 7.1: A Capability List

ResourceSync defines only a small number of capabilities, and enumerating those does not approach the limits of a single Capability List. Extensions or revisions of this specification may introduce the use of Capability List Indexes, but Sources should not generate such structures for the features introduced in this version of the ResourceSync specification.

8. Describing Resources

A Source may publish a description of the resources it makes available for synchronization. This information enables a Destination to make an initial copy of some or all of those resources, or to update a local copy to remain synchronized with changes.

8.1. Resource List

A Resource List is introduced to list and describe the resources that a Source makes available for synchronization. It presents a snapshot of a Source's resources at a particular point in time.

A Resource List is based on the <urlset> document format introduced by the Sitemap protocol. It has the <urlset> root element and the following structure:

Example 8.1 shows a Resource List with two resources. The at attribute allows to determine that neither of the listed resources have undergone a change between their respective last modification datetimes, 2013-01-02T13:00:00Z and 2013-01-02T14:00:00Z, and the datetime that is the value of the at attribute, 2013-01-03T09:00:00Z.

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
        xmlns:rs="http://www.openarchives.org/rs/terms/">
  <rs:ln rel="up"
         href="http://example.com/dataset1/capabilitylist.xml"/>
  <rs:md capability="resourcelist"
         at="2013-01-03T09:00:00Z"
         completed="2013-01-03T09:01:00Z"/>
  <url>
      <loc>http://example.com/res1</loc>
      <lastmod>2013-01-02T13:00:00Z</lastmod>
      <rs:md hash="md5:1584abdf8ebdc9802ac0c6a7402c03b6"
             length="8876"
             type="text/html"/>
  </url>
  <url>
      <loc>http://example.com/res2</loc>
      <lastmod>2013-01-02T14:00:00Z</lastmod>
      <rs:md hash="md5:1e0d5cb8ef6ba40c99b14c0237be735e 
                   sha-256:854f61290e2e197a11bc91063afce22e43f8ccc655237050ace766adc68dc784"
             length="14599"
             type="application/pdf"/>
  </url>
</urlset>

Example 8.1: A Resource List

8.2. Resource List Index

The ResourceSync framework adopts the community defined limits for publishing documents of the <urlset> format and introduces a Resource List Index for grouping multiple Resource Lists. The union of the Resource Lists referred to in the Resource List Index represents the entire set of resources that a Source makes available for synchronization. This set of resources, regardless of whether it is conveyed in a single Resource List or in multiple Resource Lists via a Resource List Index, represents the state of the Source's data at a point in time.

A Resource List Index is based on the <sitemapindex> document format introduced by the Sitemap protocol. It has the <sitemapindex> root element and the following structure:

The Destination can determine whether it has reached a Resource List or a Resource List Index based on whether the root element is <urlset> or <sitemapindex> respectively. A Resource List Index that points to three Resource Lists is shown in Example 8.2.

<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
              xmlns:rs="http://www.openarchives.org/rs/terms/">
  <rs:ln rel="up"
         href="http://example.com/dataset1/capabilitylist.xml"/>
  <rs:md capability="resourcelist"
         at="2013-01-03T09:00:00Z"
         completed="2013-01-03T09:10:00Z"/>
  <sitemap>
      <loc>http://example.com/resourcelist1.xml</loc>
      <rs:md at="2013-01-03T09:00:00Z"/>
  </sitemap>
  <sitemap>
      <loc>http://example.com/resourcelist2.xml</loc>
      <rs:md at="2013-01-03T09:03:00Z"/>
  </sitemap>
  <sitemap>
      <loc>http://example.com/resourcelist3.xml</loc>
      <rs:md at="2013-01-03T09:07:00Z"/>
  </sitemap>
</sitemapindex>

Example 8.2: A Resource List Index

Example 8.3 shows the content of the Resource List identified by the URI http://example.com/resourcelist1.xml. Structurally, it is identical to the Resource List shown in Example 8.1 but it contains an additional <rs:ln> child element of <urlset> that provides a navigational link with the relation type index to the parent Resource List Index shown in Example 8.2. This link is meant to ease navigation for Destinations and their adoption is therefore recommended.

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
        xmlns:rs="http://www.openarchives.org/rs/terms/">
  <rs:ln rel="up"
         href="http://example.com/dataset1/capabilitylist.xml"/>
  <rs:ln rel="index"
         href="http://example.com/dataset1/resourcelist-index.xml"/>
  <rs:md capability="resourcelist"
         at="2013-01-03T09:00:00Z"/>
  <url>
      <loc>http://example.com/res3</loc>
      <lastmod>2013-01-02T13:00:00Z</lastmod>
      <rs:md hash="md5:1584abdf8ebdc9802ac0c6a7402c8753"
             length="4385"
             type="application/pdf"/>
  </url>
  <url>
      <loc>http://example.com/res4</loc>
      <lastmod>2013-01-02T14:00:00Z</lastmod>
      <rs:md hash="md5:4556abdf8ebdc9802ac0c6a7402c9881"
             length="883"
             type="image/png"/>
  </url>
</urlset>

Example 8.3: A Resource List with a navigational link to its parent Resource List Index

9. Packaging Resources

In order to provide Destinations with an efficient way to copy a Source's data using a small number of HTTP requests, a Source may provide packaged bitstreams for its resources.

9.1. Resource Dump

A Source can publish a Resource Dump, which provides links to packages of the resources' bitstreams. The Resource Dump represents the Source's state at a point in time. It may be used to transfer resources from the Source in bulk, rather than the Destination having to make many separate requests.

The ResourceSync framework specifies the use of the ZIP file format as the packaging format. Communities can define their own packaging format. A Resource Dump should only point to packages of the same format.

A Resource Dump is based on the <urlset> document format introduced by the Sitemap protocol. It has the <urlset> root element and the following structure:

Example 9.1 shows a Resource Dump that points to three ZIP files. Included in each <url> element is a pointer to the Resource Dump Manifest associated with the package. While this pointer is optional and intended for the Destination's convenience, if provided, the Source needs to ensure that the referred Manifest corresponds with the Manifest included in the bitstream package.

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
        xmlns:rs="http://www.openarchives.org/rs/terms/">
  <rs:ln rel="up"
         href="http://example.com/dataset1/capabilitylist.xml"/>
  <rs:md capability="resourcedump"
         at="2013-01-03T09:00:00Z"
         completed="2013-01-03T09:04:00Z"/>
  <url>
      <loc>http://example.com/resourcedump-part1.zip</loc>
      <rs:md type="application/zip"
             length="4765"
             at="2013-01-03T09:00:00Z"
             completed="2013-01-03T09:02:00Z"/>
      <rs:ln rel="contents"
             href="http://example.com/resourcedump_manifest-part1.xml"
             type="application/xml"/>
  </url>
  <url>
      <loc>http://example.com/resourcedump-part2.zip</loc>
      <rs:md type="application/zip"
             length="9875"
             at="2013-01-03T09:01:00Z"
             completed="2013-01-03T09:03:00Z"/>
      <rs:ln rel="contents"
             href="http://example.com/resourcedump_manifest-part2.xml"
             type="application/xml"/>
  </url>
  <url>
      <loc>http://example.com/resourcedump-part3.zip</loc>
      <rs:md type="application/zip"
             length="2298"
             at="2013-01-03T09:03:00Z"
             completed="2013-01-03T09:04:00Z"/>
      <rs:ln rel="contents"
             href="http://example.com/resourcedump_manifest-part3.xml"
             type="application/xml"/>
  </url>
</urlset>

Example 9.1: A Resource Dump

If a Source needs to or chooses to publish multiple Resource Dumps, it must group them using a Resource Dump Index, in a manner that is similar to what was described in Section 8.2.

9.1.1. Resource Dump Manifest

Each ZIP package referred to from a Resource Dump must contain a Resource Dump Manifest file that describes the package's constituent bitstreams. The file must be named manifest.xml and must be located at the top level of the ZIP package.

The Resource Dump Manifest is based on the <urlset> format. It has the <urlset> root element and the following structure:

Example 9.2 shows a Resource Dump Manifest for a ZIP file that contains two bitstreams.

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
        xmlns:rs="http://www.openarchives.org/rs/terms/">
  <rs:ln rel="up"
         href="http://example.com/dataset1/capabilitylist.xml"/>
  <rs:md capability="resourcedump-manifest"
         at="2013-01-03T09:00:00Z"
         completed="2013-01-03T09:02:00Z"/>
  <url>
      <loc>http://example.com/res1</loc>
      <lastmod>2013-01-02T13:00:00Z</lastmod>
      <rs:md hash="md5:1584abdf8ebdc9802ac0c6a7402c03b6"
             length="8876"
             type="text/html"
             path="/resources/res1"/>
  </url>
  <url>
      <loc>http://example.com/res2</loc>
      <lastmod>2013-01-02T14:00:00Z</lastmod>
      <rs:md hash="md5:1e0d5cb8ef6ba40c99b14c0237be735e
                   sha-256:854f61290e2e197a11bc91063afce22e43f8ccc655237050ace766adc68dc784"
             length="14599"
             type="application/pdf"
             path="/resources/res2"/>
  </url>
</urlset>

Example 9.2: A Resource Dump Manifest

10. Describing Changes

A Source may publish a record of the changes to its resources. This enables Destinations to efficiently learn about those changes and hence to synchronize incrementally.

10.1. Change List

A Change List is a document that contains a description of changes to a Source's resources. It is up to the Source to determine the publication frequency of Change Lists, as well as the temporal interval they cover. For example, a Source may choose to publish a fixed number of changes per Change List, or all the changes in a period of fixed length, such as an hour, a day, or a week. All entries in a Change List must be provided in forward chronological order: the least recently changed resource must be listed at the beginning of the Change List, while the most recently changed resource must be listed at the end of the document. If a resource underwent multiple changes in the period covered by a Change List, then it will be listed multiple times, once per change.

A Change List is based on the <urlset> document format introduced by the Sitemap protocol. It has the <urlset> root element and the following structure:

The temporal interval covered by a Change List is conveyed by means of the from and until attributes of the <rs:md> child element of the <urlset> root element. The from attribute indicates that the Change List includes all changes that occurred to the set of resources at the Source since the datetime expressed as the value of the attribute. If it exists, the until attribute indicates that the Change List includes all changes that occurred to the set of resources at the Source up until the datetime expressed as the value of the attribute. Its use is optional for Change Lists:

The from and until attributes help a Destination to determine whether it has or has not fully processed a Change List. The forward chronological order of changes in a Change List, the datetime of a resource change, and the URI of a changed resource help the Destination to determine what the first unprocessed change in a not fully processed Change List is. The Destination can start processing there; it can retrieve a representation of a changed resource by dereferencing its URI provided in the <loc> child element of the <url> element that conveys the change. In order for the determination of the first unprocessed change to be accurate, the combination of the URI of a changed resource and the datetime of its change should be unique. Hence, a Source should provide change datetime values at a sufficiently fine granularity.

Example 10.1 shows a Change List that indicates that four resource changes occurred since 2013-01-03T00:00:00Z: one creation, two updates, and one deletion. One resource underwent two of these changes and hence is listed twice. The Change List has no until attribute, which indicates that it will report further changes; a Destination should keep polling this Change List.

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
        xmlns:rs="http://www.openarchives.org/rs/terms/">
  <rs:ln rel="up"
         href="http://example.com/dataset1/capabilitylist.xml"/>
  <rs:md capability="changelist"
         from="2013-01-03T00:00:00Z"/>
  <url>
      <loc>http://example.com/res1.html</loc>
      <lastmod>2013-01-03T11:00:00Z</lastmod>
      <rs:md change="created"/>
  </url>
  <url>
      <loc>http://example.com/res2.pdf</loc>
      <lastmod>2013-01-03T13:00:00Z</lastmod>
      <rs:md change="updated"/>
  </url>
  <url>
      <loc>http://example.com/res3.tiff</loc>
      <lastmod>2013-01-03T18:00:00Z</lastmod>
      <rs:md change="deleted"/>
  </url>
  <url>
      <loc>http://example.com/res2.pdf</loc>
      <lastmod>2013-01-03T21:00:00Z</lastmod>
      <rs:md change="updated"/>
  </url>
</urlset>

Example 10.1: An open Change List describing four resource changes

10.2. Change List Index

If a Source needs to publish multiple Change Lists, it must group them in a Change List Index. A Change List Index must enumerate Change Lists in forward chronological order.

A Change List Index is based on the <sitemapindex> document format introduced by the Sitemap protocol. It has the <sitemapindex> root element and the following structure:

The Destination can determine whether it has reached a Change List or a Change List Index based on whether the root element is <urlset> or <sitemapindex> respectively.

A Change List Index that points to three Change Lists is shown in Example 10.2. Two of those Change Lists are closed, as indicated by the provision of <lastmod>, and one is open, as indicated by its absence. The closed Change List http://example.com/20130102-changelist.xml is shown in Example 10.3. Note that the value for <lastmod> for this Change List in the Change List Index is the same as the value of the until attribute in the Change List: 2013-01-02T23:59:59Z. The open Change List could be the one shown in Example 10.1, in which case that list would have an additional link with an index relation type pointing to the Change List Index.

<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
        xmlns:rs="http://www.openarchives.org/rs/terms/">
  <rs:ln rel="up"
         href="http://example.com/dataset1/capabilitylist.xml"/>
  <rs:md capability="changelist"
         from="2013-01-01T00:00:00Z"/>
  <sitemap>
      <loc>http://example.com/20130101-changelist.xml</loc>
      <rs:md from="2013-01-01T00:00:00Z" 
             until="2013-01-02T00:00:00Z"/>
  </sitemap>
  <sitemap>
      <loc>http://example.com/20130102-changelist.xml</loc>
      <rs:md from="2013-01-02T00:00:00Z" 
             until="2013-01-03T00:00:00Z"/>
  </sitemap>
  <sitemap>
      <loc>http://example.com/20130103-changelist.xml</loc>
      <rs:md from="2013-01-03T00:00:00Z"/>
  </sitemap>
</sitemapindex>

Example 10.2: A Change List Index

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
        xmlns:rs="http://www.openarchives.org/rs/terms/">
  <rs:ln rel="up"
         href="http://example.com/dataset1/capabilitylist.xml"/>
  <rs:ln rel="index"
         href="http://example.com/dataset1/changelist.xml"/>
  <rs:md capability="changelist"
         from="2013-01-02T00:00:00Z"
         until="2013-01-03T00:00:00Z"/>
  <url>
      <loc>http://example.com/res7.html</loc>
      <lastmod>2013-01-02T12:00:00Z</lastmod>
      <rs:md change="created"/>
  </url>
  <url>
      <loc>http://example.com/res9.pdf</loc>
      <lastmod>2013-01-02T13:00:00Z</lastmod>
      <rs:md change="updated"/>
  </url>
  <url>
      <loc>http://example.com/res5.tiff</loc>
      <lastmod>2013-01-02T19:00:00Z</lastmod>
      <rs:md change="deleted"/>
  </url>
  <url>
      <loc>http://example.com/res7.html</loc>
      <lastmod>2013-01-02T20:00:00Z</lastmod>
      <rs:md change="updated"/>
  </url>
</urlset>

Example 10.3: A closed Change List pointing back to its Index

11. Packaging Changes

In order to reduce the number of requests required to obtain resource changes, a Source may provide packaged bitstreams for changed resources.

11.1. Change Dump

To make content changes available for download, a Source can publish Change Dumps. A Change Dump is a document that points to packages containing bitstreams for the Source's changed resources. The ResourceSync framework specifies the use of the ZIP file format as the packaging format. Communities can define their own packaging format. A Change Dump should only point to packages of the same format.

It is up to the Source to determine the publication frequency of these packages, as well as the temporal interval they cover. For example, a Source may choose to publish a fixed number of changes per package, or all the changes in a period of fixed length, such as an hour, a day, or a week. If a resource underwent multiple changes in the period covered by a package, then the package will contain multiple bitstreams for the resource, one per change. As new packages are published, new entries are added to the Change Dump that points at them. All entries in a Change Dump should be provided in forward chronological order: the least recently published package listed at the beginning of the Change Dump, the most recent package listed at the end of the document.

A Change Dump is based on the <urlset> document format introduced by the Sitemap protocol. It has the <urlset> root element and the following structure:

Example 11.1 shows a Change Dump with pointers to three bitstream packages associated with changed resources. The absence of the until attribute indicates that further packages will be added. The example also includes within each <url> element a pointer to a copy of the Change Dump Manifest associated with the package. This pointer is optional and intended to allow a Destination to determine whether the package should be downloaded. If such pointers are provided, the Source must ensure that the Manifest referred to matches the Manifest included in the bitstream package.

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
        xmlns:rs="http://www.openarchives.org/rs/terms/">
  <rs:ln rel="up"
         href="http://example.com/dataset1/capabilitylist.xml"/>
  <rs:md capability="changedump"
         from="2013-01-01T00:00:00Z"/>
  <url>
      <loc>http://example.com/20130101-changedump.zip</loc>
      <lastmod>2013-01-01T23:59:59Z</lastmod>
      <rs:md type="application/zip" 
             length="3109"
             from="2013-01-01T00:00:00Z"
             until="2013-01-02T00:00:00Z"/>           
      <rs:ln rel="contents"
             href="http://example.com/20130101-changedump-manifest.xml"
             type="application/xml"/>
  </url>
  <url>
      <loc>http://example.com/20130102-changedump.zip</loc>
      <lastmod>2013-01-02T23:59:59Z</lastmod>
      <rs:md type="application/zip"
             length="6629"
             from="2013-01-02T00:00:00Z"
             until="2013-01-03T00:00:00Z"/>
      <rs:ln rel="contents"
             href="http://example.com/20130102-changedump-manifest.xml"
             type="application/xml"/>
  </url>
  <url>
      <loc>http://example.com/20130103-changedump.zip</loc>
      <lastmod>2013-01-03T23:59:59Z</lastmod>
      <rs:md type="application/zip"
             length="8124"
             from="2013-01-03T00:00:00Z"
             until="2013-01-04T00:00:00Z"/>
      <rs:ln rel="contents"
             href="http://example.com/20130103-changedump-manifest.xml"
             type="application/xml"/>
  </url>
</urlset>

Example 11.1: A Change Dump

If a Source needs to publish multiple Change Dumps, it must group them in a Change Dump Index, in a manner similar to what was described in Section 10.2.

11.1.1. Change Dump Manifest

Each ZIP package referred to from a Change Dump must contain a Change Dump Manifest file that describes the constituent bitstreams of the package. The file must be named manifest.xml and must be located at the top level of the ZIP package. All entries in a Change Dump Manifest must be provided in forward chronological order: the bitstream associated with the least recent resource change is listed first, and the bitstream associated with the most recent change is listed last.

The Change Dump Manifest is based on the <urlset> format. It has the <urlset> root element and the following structure:

Example 11.2 shows the Change Dump Manifest associated with the second entry in the Resource Dump from Example 11.1. The Manifest must be named manifest.xml at the top level of the ZIP package. A copy of the Manifest may also be provided at a location indicated by an optional <rs:ln> element with the relation type contents in the Change Dump, http://example.com/20130102-changedump-manifest.xml in Example 11.1. The Manifest covers the same changes as conveyed in the closed Change List of Example 10.3. The resource http://example.com/res7.html is listed twice, once because it was created, and once because it was updated. Both entries have the same URI. The ZIP package in which this Manifest is contained has two bitstreams for this resource, available at different paths in the package.

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
        xmlns:rs="http://www.openarchives.org/rs/terms/">
  <rs:ln rel="up"
         href="http://example.com/dataset1/capabilitylist.xml"/>
  <rs:md capability="changedump-manifest"
         from="2013-01-02T00:00:00Z"
         until="2013-01-03T00:00:00Z"/>
  <url>
      <loc>http://example.com/res7.html</loc>
      <lastmod>2013-01-02T12:00:00Z</lastmod>
      <rs:md change="created"
             hash="md5:1c1b0e264fa9b7e1e9aa6f9db8d6362b"
             length="4339"
             type="text/html"
             path="/changes/res7.html"/>
  </url>
  <url>
      <loc>http://example.com/res9.pdf</loc>
      <lastmod>2013-01-02T13:00:00Z</lastmod>
      <rs:md change="updated"
             hash="md5:f906610c3d4aa745cb2b986f25b37c5a"
             length="38297"
             type="application/pdf"
             path="/changes/res9.pdf"/>
  </url>
  <url>
      <loc>http://example.com/res5.tiff</loc>
      <lastmod>2013-01-02T19:00:00Z</lastmod>
      <rs:md change="deleted"/>
  </url>
  <url>
      <loc>http://example.com/res7.html</loc>
      <lastmod>2013-01-02T20:00:00Z</lastmod>
            <rs:md change="updated"
             hash="md5:0988647082c8bc51778894a48ec3b576"
             length="5426"
             type="text/html"
             path="/changes/res7-v2.html"/>
  </url>
</urlset>

Example 11.2: A Change Dump Manifest

12. Linking to Related Resources

In order to facilitate alternative approaches to obtain content for a resource that is subject to synchronization or to provide additional information about it, a Source may provide links from that resource to related resources. Such links can occur in Resource Lists, Resource Dump Manifests, Change Lists, and Change Dump Manifests. The following cases are considered, and detailed (in examples of Change Lists) in the remainder of this section:

As always, the <loc> child element of <url> conveys the URI of the resource that is subject to synchronization. The related resource is provided by means of the <rs:ln> child element of <url>. The possible attributes for <rs:ln> as well as the link relation types used to address the aforementioned use cases are detailed in Section 5. Links to meet needs other than the ones listed may be provided, and appropriate relation types may be selected from the IANA Link Relation Type Registry or expressed as URIs as specified in RFC 5988, Sec. 4.2.

In case a Destination is not able to adequately interpret the information conveyed in a <rs:ln> element, it should refrain from accessing the related resource and rather use the URI provided in <loc> to retrieve the resource.

12.1. Mirrored Content

In order to reduce the load on its primary access mechanism, a Source may convey one or mirror locations for a resource. A <rs:ln> element is introduced to express each mirror location for the resource. This element has the following attributes:

Example 12.1 shows how a Source conveys information about prioritized mirror locations for a resource. Since the three locations conveyed by <rs:ln> elements point to duplicates of the resource specified in <loc>, the values for each of the attributes of <rs:md> are expected to be identical for the resource and its mirrors. Hence, they should be omitted from the <rs:ln> elements. The last <rs:ln> element points to a mirror location where the resource is accessible via a protocol other than HTTP as can be seen from the URI scheme. Even though the resources are duplicates, their last modified datetimes may vary.

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
        xmlns:rs="http://www.openarchives.org/rs/terms/">
  <rs:ln rel="up"
         href="http://example.com/dataset1/capabilitylist.xml"/>
  <rs:md capability="changelist"
         from="2013-01-03T00:00:00Z"/>
  <url>
      <loc>http://example.com/res1</loc>
      <lastmod>2013-01-03T18:00:00Z</lastmod>
      <rs:md change="updated"
             hash="md5:1584abdf8ebdc9802ac0c6a7402c03b6"
             length="8876"
             type="text/html"/>
      <rs:ln rel="duplicate"
             pri="1"
             href="http://mirror1.example.com/res1"
             modified="2013-01-03T18:00:00Z"/>
      <rs:ln rel="duplicate"
             pri="2"
             href="http://mirror2.example.com/res1"
             modified="2013-01-03T18:00:00Z"/>
      <rs:ln rel="duplicate"
             pri="3"
             href="gsiftp://gridftp.example.com/res1"
             modified="2013-01-03T18:00:00Z"/>
  </url>
</urlset>

Example 12.1: Mirrored content

12.2. Alternate Representations

A resource may have multiple representations available from different URIs. A resource may, for example, be identified by a generic URI such as http://example.com/res1. After performing content negotiation with the server, a client may, for example, obtain the resource's HTML representation available from the specific URI http://example.com/res1.html. Another client may ask for and retrieve the PDF representation of the resource from the specific URI http://example.com/res1.pdf. Which representation a client obtains, can, amongst others, depend on its preferences in terms of Media Type and language, its geographical location, and its device type.

A Source can express that a resource is subject to synchronization by conveying its generic URI in <loc>. In this case, per alternate representation that the Source wants to advertise, a <rs:ln> element is introduced. This element has the following attributes:

Cases exist in which there is no generic URI for a resource, only specific URIs. This may occur, for example, when a resource has different representations available for different devices. In this case the URI in <loc> will be a specific URI, and <rs:ln> elements with an alternate relation type are still used to refer to alternate representations available from other specific URIs.

Example 12.2 shows how to promote a generic URI in <loc> while also pointing to alternate representations available from specific URIs, for example, through content negotiation.

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
        xmlns:rs="http://www.openarchives.org/rs/terms/">
  <rs:ln rel="up"
         href="http://example.com/dataset1/capabilitylist.xml"/>
  <rs:md capability="changelist"
         from="2013-01-03T11:00:00Z"/>
  <url>
      <loc>http://example.com/res1</loc>
      <lastmod>2013-01-03T18:00:00Z</lastmod>
      <rs:md change="updated"/>
      <rs:ln rel="alternate"
             href="http://example.com/res1.html"
             modified="2013-01-03T18:00:00Z"
             type="text/html"/>
      <rs:ln rel="alternate"
             href="http://example.com/res1.pdf"
             modified="2013-01-03T18:00:00Z"
             type="application/pdf"/>
  </url>
</urlset>

Example 12.2: Generic URI and alternates with specific URIs

In cases where a particular representation is considered the subject of synchronization, its specific URI is provided in <loc>. The associated generic URI, if one exists, can be provided using a <rs:ln> element. This element has the following attributes:

This approach might be most appropriate for Resource Dump Manifests and Change Dump Manifests that describe bitstreams contained in a ZIP file.

Example 12.3 shows a Source promoting a specific URI in <loc> while also pointing to the resource's generic URI by means of an <rs:ln> element. Metadata pertaining to the representation available from that specific URI is conveyed by means of attributes of the <rs:md> element.

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
        xmlns:rs="http://www.openarchives.org/rs/terms/">
  <rs:ln rel="up"
         href="http://example.com/dataset1/capabilitylist.xml"/>
  <rs:md capability="changelist"
         from="2013-01-03T00:00:00Z"/>
  <url>
      <loc>http://example.com/res1.html</loc>
      <lastmod>2013-01-03T18:00:00Z</lastmod>
      <rs:md change="updated"
             hash="md5:1584abdf8ebdc9802ac0c6a7402c03b6"
             length="8876"/>
      <rs:ln rel="canonical"
             href="http://example.com/res1"
             modified="2013-01-03T18:00:00Z"/>
  </url>
</urlset>

Example 12.3: Specific URI and alternate with generic URI

12.3. Patching Content

In order to increase the efficiency of updating a resource, a Source may make a description of the changes that the resource underwent available, in addition to the entire changed resource. Especially when frequent minor changes and/or changes to large resources are concerned, such an approach may be attractive. It will, however, require an unambiguous way to describe the changes in such a way that a Destination can construct the most recent version of the resource by appropriately patching the previous version with the description of the changes.

A Source can express that it makes a description of resource changes available by providing the URI of the resource in <loc>, as usual, and by introducing a <rs:ln> element with the following attributes:

Example 12.4 shows a Source that expresses changes that a JSON resource underwent expressed using the application/json-patch Media Type introduced in JSON Patch. It also shows the Source conveying changes to a large TIFF file using an experimental Media Type that may, for example, be described in a community specification. A Destination that does not understand the Media Type should ignore the description of changes and use the URI in <loc> to obtain the most recent version of the resource. Another example of a well-specified Media Type for expressing changes to XML document is application/patch-ops-error+xml, as specified in RFC 5261.

Expressing resource changes in this manner is only applicable to Change Lists (as in Example 12.4) and Change Dumps. When doing so for a Change Dump, the entry in the Change Dump Manifest must have the path attribute for the <rs:ln> element that points to the change description that is included in the content package.

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
        xmlns:rs="http://www.openarchives.org/rs/terms/">
  <rs:ln rel="up"
         href="http://example.com/dataset1/capabilitylist.xml"/>
  <rs:md capability="changelist"
         from="2013-01-03T00:00:00Z"/>
  <url>
      <loc>http://example.com/res4</loc>
      <lastmod>2013-01-03T17:00:00Z</lastmod>
      <rs:md change="updated"
             hash="sha-256:f4OxZX_x_DFGFDgghgdfb6rtSx-iosjf6735432nklj"
             length="56778"
             type="application/json"/>
      <rs:ln rel="http://www.openarchives.org/rs/terms/patch"
             href="http://example.com/res4-json-patch"
             modified="2013-01-03T17:00:00Z"
             hash="sha-256:y66dER_t_HWEIKpesdkeb7rtSc-ippjf9823742opld"
             length="73"
             type="application/json-patch"/>
  </url>
  <url>
      <loc>http://example.com/res5-full.tiff</loc>
      <lastmod>2013-01-03T18:00:00Z</lastmod>
      <rs:md change="updated"
             hash="sha-256:f4OxZX_x_FO5LcGBSKHWXfwtSx-j1ncoSt3SABJtkGk"
             length="9788456778"
             type="image/tiff"/>
      <rs:ln rel="http://www.openarchives.org/rs/terms/patch"
             href="http://example.com/res5-diff"
             modified="2013-01-03T18:00:00Z"
             hash="sha-256:h986gT_t_87HTkjHYE76G558hY-jdfgy76t55sadJUYT"
             length="4533"
             type="application/x-tiff-diff"/>
  </url>
</urlset>

Example 12.4: A Change List with links to document that detail how to patch resources

12.4. Resources and Metadata about Resources

Cases exist where both resources and metadata about those resources must be synchronized. From the ResourceSync perspective, both the resource and the metadata about it are regarded as resources with distinct URIs that are subject to synchronization. As usual, each gets its distinct <url> block and each URI is conveyed in a <loc> child element of the respective block. If required, the inter-relationship between both resources is expressed by means of a <rs:ln> element with appropriate relation types added to each block. The <rs:ln> element has the following attributes:

Example 12.5 shows how a Source can express this inter-relationship between the two resources. Note that a Destination can use the metadata that describes a resource as a filtering mechanism to only synchronize with those resources that meet its metadata-based selection criteria. Note also in the <url> element that conveys the metadata record update, the use of the link with a profile relation type [RFC 6906] to express the kind of metadata that is used to describe the resource, in this case, expressed by means of its XML Namespace. The link should provide a URI that supports the Destination in interpreting the metadata information. For example, it could refer to a namespace, an XML schema, or a description of MARC.

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
        xmlns:rs="http://www.openarchives.org/rs/terms/">
  <rs:ln rel="up"
         href="http://example.com/dataset1/capabilitylist.xml"/>
  <rs:md capability="changelist"
         from="2013-01-03T00:00:00Z"/>
  <url>
      <loc>http://example.com/res2.pdf</loc>
      <lastmod>2013-01-03T18:00:00Z</lastmod>
      <rs:md change="updated"
             hash="md5:1584abdf8ebdc9802ac0c6a7402c03b6"
             length="8876"
             type="application/pdf"/>
      <rs:ln rel="describedby"
             href="http://example.com/res2_dublin-core_metadata.xml"
             modified="2013-01-01T12:00:00Z"
             type="application/xml"/>
  </url>
  <url>
      <loc>http://example.com/res2_dublin-core_metadata.xml</loc>
      <lastmod>2013-01-03T19:00:00Z</lastmod>
      <rs:md change="updated"
             type="application/xml"/>
      <rs:ln rel="describes"
             href="http://example.com/res2.pdf"
             modified="2013-01-03T18:00:00Z"
             hash="md5:1584abdf8ebdc9802ac0c6a7402c03b6"
             length="8876"
             type="application/pdf"/>
      <rs:ln rel="profile"
             href="http://purl.org/dc/elements/1.1/"/>
  </url>
</urlset>

Example 12.5: Linking between a resource and metadata about a resource in a Change List

12.5. Prior Versions of Resources

A Source may provide access to prior versions of a resource to allow Destinations to obtain a historical perspective, rather than just remaining synchronized with the most recent version. The approach to do so leverages a common resource versioning paradigm that consists of:

When communicating about the resource, its time-generic URI is provided in <loc> and <lastmod> must be used to to provide the resource's last modification time.

A first approach consists of conveying the time-specific URI of the resource that corresponds with the time of last modification, as given in the <lastmod> element. This is achieved by introducing a single <rs:ln> element with the following attributes:

A second approach consists of pointing to a TimeGate associated with the time-generic resource. A TimeGate supports negotiation in the datetime dimension, as introduced in the Memento protocol [Memento Internet Draft], to obtain a version of the resource as it existed at a specified moment in time. This allows the Destination to obtain the version that existed at the time of last modification by using the <lastmod> value in the datetime negotiation process, but also allows the Destination to obtain other versions by using different datetime values. A pointer to a TimeGate is introduced by using a <rs:ln> element with the following attributes:

A third approach consists of pointing to a TimeMap associated with the time-generic resource. A TimeMap, as introduced in the Memento protocol [Memento Internet Draft], enables Destinations to retrieve a comprehensive list of all time-specific resources known to a server. This allows Destinations to choose a particular version of a resource from that list. A pointer to a TimeMap is introduced by using a <rs:ln> element with the following attributes:

Example 12.6 shows a Change List with a link to a prior version of a resource, a link to a TimeGate, as well as a link to a TimeMap. Note that the values of the hash, length, and type attributes are identical between the <rs:md> child element and the <rs:ln> child element that points to the prior version.

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
        xmlns:rs="http://www.openarchives.org/rs/terms/">
  <rs:ln rel="up"
         href="http://example.com/dataset1/capabilitylist.xml"/>
  <rs:md capability="changelist"
         from="2013-01-03T00:00:00Z"/>
  <url>
      <loc>http://example.com/res1</loc>
      <lastmod>2013-01-03T18:00:00Z</lastmod>
      <rs:md hash="md5:1584abdf8ebdc9802ac0c6a7402c03b6"
             length="8876"
             type="text/html"
             change="updated"/>
      <rs:ln rel="memento"
             href="http://example.com/20130103070000/res1"
             modified="2013-01-02T18:00:00Z"
             hash="md5:1584abdf8ebdc9802ac0c6a7402c03b6"
             length="8876"
             type="text/html"/>
      <rs:ln rel="timegate"
             href="http://example.com/timegate/http://example.com/res1"/>
      <rs:ln rel="timemap"
             href="http://example.com/timemap/http://example.com/res1"
             type="application/link-format"/>
  </url>
</urlset>

Example 12.6: Links to a resource version, and a Memento TimeGate and TimeMap

12.6. Collection Membership

A Source can express that a resource is a member of a collection such as an OAI-ORE Aggregation or an OAI-PMH Set. A Source can express collection membership of a resource that is subject to synchronization by providing the URI of that resource in <loc>, as usual, and by introducing a <rs:ln> element with the following attributes:

Example 12.7 shows a Change List with one resources that is a member of an OAI-ORE Aggregation.

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
        xmlns:rs="http://www.openarchives.org/rs/terms/">
  <rs:ln rel="up"
         href="http://example.com/dataset1/capabilitylist.xml"/>
  <rs:md capability="changelist"
         from="2013-01-03T00:00:00Z"/>
  <url>
      <loc>http://example.com/res1</loc>
      <lastmod>2013-01-03T07:00:00Z</lastmod>
      <rs:md hash="md5:1584abdf8ebdc9802ac0c6a7402c03b6"
             length="8876"
             type="text/html"
             change="updated"/>
      <rs:ln rel="collection"
             href="http://example.com/aggregation/0601007"/>
  </url>
</urlset>

Example 12.7: A resource as a member of a collection

12.7. Republishing Resources

A special kind of Destination, henceforth called an aggregator, may retrieve content from a Source, republish it, and in its turn act as a Source for the republished content. In such an aggregator scenario, it may be important for a Destination that synchronizes with the aggregator to understand the provenance of the content and to be able to verify its accuracy with the original Source of the content. When communicating about a republished resource, the aggregator can provide such provenance information by introducing a <rs:ln> element with the following attributes:

Example 12.8 shows a Change List in which a Source publishes information about a change to a single resource.

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
        xmlns:rs="http://www.openarchives.org/rs/terms/">
  <rs:ln rel="up"
         href="http://example.com/dataset1/capabilitylist.xml"/>
  <rs:md capability="changelist"
         from="2013-01-03T00:00:00Z"/>
  <url>
      <loc>http://original.example.com/res1.html</loc>
      <lastmod>2013-01-03T07:00:00Z</lastmod>
      <rs:md hash="md5:1584abdf8ebdc9802ac0c6a7402c03b6"
             length="8876"
             type="text/html"
             change="updated"/>
  </url>
</urlset>

Example 12.8: An original Source publishes

Example 12.9 shows a primary aggregator's Change List that refers to the original Source's resource. It includes a link with the relation type via that has attributes such as href to convey information about the origin of the resource. This information corresponds with the data provided in the <url> block of the Change List shown in Example 12.8. For example, the value of the href attribute in Example 12.9 equals the value of the <loc> child element in the <url> block in Example 12.8.

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
        xmlns:rs="http://www.openarchives.org/rs/terms/">
  <rs:ln rel="up"
         href="http://aggregator1.example.com/dataset1/capabilitylist.xml"/>
  <rs:md capability="changelist"
         from="2013-01-03T11:00:00Z"/>
  <url>
      <loc>http://aggregator1.example.com/res1.html</loc>
      <lastmod>2013-01-03T20:00:00Z</lastmod>
      <rs:md hash="md5:1584abdf8ebdc9802ac0c6a7402c03b6"
             length="8876"
             type="text/html"
             change="updated"/>
      <rs:ln rel="via"
             href="http://original.example.com/res1.html"
             modified="2013-01-03T07:00:00Z"
             hash="md5:1584abdf8ebdc9802ac0c6a7402c03b6"
             length="8876"
             type="text/html"/>
  </url>
</urlset>

Example 12.9: A primary aggregator republishes

If a secondary aggregator obtains the changed resource by consuming the Change List of the primary aggregator and republishes its Change List, a chain of aggregations is created. In this case each aggregator should maintain only the existing via link in order to convey information about the origin of the resource.

Example 12.10 shows the Change List of a secondary aggregator with information about the changed resource and the via link equal to Example 12.9. The data conveyed with the link corresponds to the data provided in the <url> block in Example 12.8.

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
        xmlns:rs="http://www.openarchives.org/rs/terms/">
  <rs:ln rel="up"
         href="http://aggregator2.example.com/dataset1/capabilitylist.xml"/>
  <rs:md capability="changelist"
         from="2013-01-03T12:00:00Z"/>
  <url>
      <loc>http://aggregator2.example.com/res1.html</loc>
      <lastmod>2013-01-04T09:00:00Z</lastmod>
      <rs:md hash="md5:1584abdf8ebdc9802ac0c6a7402c03b6"
             length="8876"
             type="text/html"
             change="updated"/>
      <rs:ln rel="via"
             href="http://original.example.com/res1.html"
             modified="2013-01-03T07:00:00Z"
             hash="md5:1584abdf8ebdc9802ac0c6a7402c03b6"
             length="8876"
             type="text/html"/>
  </url>
</urlset>

Example 12.10: A second aggregator republishes

The values of the at, completed, from and until attributes must always be expressed from the perspective of the Source that publishes the document that contains them. Hence it is possible that the from datetime of a Change List is more recent than the <lastmod> datetime of the original Source's resource described in the Change List, which is conveyed using an <rs:ln> link with the via relation type.

An aggregator should be cautious when inheriting links, other than the one with the via relation type, from a Source that precedes it in an aggregation chain. It should make sure that each such link remains appropriate from its own perspective and refrain from inheriting it when it is not. For example, a link with the relation type collection or canonical expressed by the original Source may not be appropriate in the context of the aggregator's copy, and hence should not be included in the description of the changed resource in the aggregator's capability document.

13. References

[Atom Link Extensions]
Atom Link Extensions, J. Snell, 08 June 2012.
[HTML4.01 Links]
HTML 4.01 Specification: 12 Links, Dave Raggett, Arnaud Le Hors, Ian Jacobs (editors), World Wide Web Consortium, 24 December 1999
[JSON-Patch]
JSON Patch, P.Bryan, M. Nottingham, Draft, January 2013.
[Memento Internet Draft]
Memento Internet Draft, H. Van de Sompel, M. L. Nelson, R. D. Sanderson, May 2012
[The Open Archives Initiative Protocol for Metadata Harvesting]
The Open Archives Initiative Protocol for Metadata Harvesting, C. Lagoze, H. Van de Sompel, Michael Nelson, Simeon Warner, December 2008
[ORE Specification - Abstract Data Model]
ORE Specification - Abstract Data Model, C. Lagoze, H. Van de Sompel, Pete Johnston, Michael Nelson, Robert Sanderson, Simeon Warner, October 2008
[Relation Types Used in the ResourceSync Framework]
Relation Types Used in the ResourceSync Framework, Martin Klein, Robert Sanderson, Herbert Van de Sompel, Simeon Warner, Graham Klyne, Bernhard Haslhofer, Michael Nelson, Carl Lagoze (editors), 5 August 2013.
[RFC 2616]
IETF RFC 2616 Hypertext Transfer Protocol -- HTTP/1.1, R. Fielding, et al., June 1999.
[RFC 4287]
IETF RFC 4287: The Atom Syndication Format, M. Nottingham, R. Sayre, December 2005.
[RFC 5261]
IETF RFC 5261: An Extensible Markup Language (XML) Patch Operations Framework Utilizing XML Path Language (XPath) Selectors, J. Urpalainen, September 2008.
[RFC 5785]
IETF RFC 5785: Defining Well-Known Uniform Resource Identifiers (URIs), M. Nottingham, E. Hammer-Lahav, April 2010.
[RFC 5988]
IETF RFC 5988: Web Linking, M. Nottingham, October 2010.
[RFC 6249]
IETF RFC 6249: Metalink/HTTP: Mirrors and Hashes, A. Bryan, N. McNab, T. Tsujikawa, P. Poeml, H. Nordstrom, June 2011.
[RFC 6596]
IETF RFC 6596: The Canonical Link Relation, M. Ohye, J. Kupke, April 2012.
[RFC 6906]
IETF RFC 6906: The 'profile' Link Relation Type, E. Wilde, March 2013.
[Sitemaps]
Sitemaps XML format and protocol, sitemaps.org, 27 February 2008.
[W3C Datetime]
Date and Time Formats, Misha Wolf, Charles Wicksteed, 15 September 1997.
[Web Architecture]
Architecture of the World Wide Web, Volume One, I. Jacobs and N. Walsh (editors), World Wide Web Consortium, 15 January 2004.
[XHTML1.1 Links]
XHTML Modularization 1.1 - Second Edition: 5.19. Link Module, Shane McCarron et al. (editors), World Wide Web Consortium, 29 July 2010.
[.ZIP File Format Specification]
.ZIP File Format Specification, PKWARE Inc., September 2012

A. Time Attribute Requirements

Table A.1 provides an overview of the requirements for use of the at and from attributes in ResourceSync documents. The top label in the column headings represents the <sitemapindex> root element for index documents, and the <urlset> root element for all other documents. The child label in the column headings represents the <sitemap> child element for index documents, and the <url> child element for all other documents.

The optional attributes completed and until are not shown in the table as they can be added wherever the corresponding at and from attributes are mandatory, recommended or optional.

Table A.1 shows that, for example, a Change List must contain the <rs:md> child element of the <urlset> root element with the attribute from to convey the temporal interval covered by the Change List. The table also shows that the <url> child element of the <urlset> root element in a Change List must have the <lastmod> child element to convey the last modification time of a resource. Both mandatory attributes are, for example, shown in Example 10.1 in Section 10.1.

Capability Document/top/rs:md/@at/top/rs:md/@from/top/child/rs:md/@at/top/child/rs:md/@from/top/child/lastmod
Resource List Mandatory X X X Optional
Resource List Index Mandatory X Optional X Optional
Resource Dump Mandatory X Optional X Optional
Resource Dump Index Mandatory X Optional X Optional
Resource Dump Manifest Mandatory X X X Optional
Change List X Mandatory X X Mandatory
Change List Index X Mandatory X Recommended Optional
Change Dump X Mandatory X Recommended Optional
Change Dump Index X Mandatory X Recommended Optional
Change Dump Manifest X Mandatory X X Mandatory

Table A.1: Required and optional use of at and from attributes in ResourceSync documents

B. Acknowledgements

This specification is the collaborative work of NISO and the Open Archives Initiative. Funding for ResourceSync is provided by the Alfred P. Sloan Foundation. UK participation is supported by Jisc.

We also thank numerous individual contributors including: Martin Haye (California Digital Library), Richard Jones (Cottage Labs), Stuart Lewis (University of Edinburgh), Peter Murray (Lyrasis), David Rosenthal (LOCKSS), Shlomo Sanders (Ex Libris, Inc.), Ed Summers (Library of Congress), Paul Walk (UKOLN), Vincent Wehren (Microsoft), Zhiwu Xie (Virginia Tech), and Jeff Young (Online Computer Library Center).

C. Change Log

Date Editor Description
2013-09-11 martin, herbert, simeon edit section 12.7
2013-09-04 martin, simeon fix typo
2013-08-21 martin, herbert, rob, simeon correct language, typos, namespace description
2013-08-05 martin, herbert, rob, simeon version 0.9.1
2013-06-07 martin, herbert, rob, simeon version 0.9
2013-05-01 martin, herbert, rob, simeon version 0.6
2013-02-01 martin, herbert, rob, simeon beta spec draft
2012-08-13 martin, herbert, simeon, bernhard first alpha spec draft

Creative Commons License
This work is licensed under a Creative Commons Attribution-Share Alike 3.0 Unported License.