Open Archives Initiative ResourceSync Framework Specification |
The ResourceSync specifications describe a synchronization framework for the web consisting of various capabilities that allow third party systems to remain synchronized with a server's evolving resources. This ResourceSync Archives specification describes additional capabilities that extend the core specification to provide historical information about a set of resources.
This specification is one of several documents comprising the ResourceSync Framework Specifications.
This specification is a beta draft released for public comment. Feedback is most welcome on the ResourceSync Google Group.
1. Introduction
1.1 Motivating Examples
1.2 Structure
1.3 Notational Conventions
2. Advertising Archive Capabilities
2.1 Inclusion of Archives in a Capability List
2.2 Linking to Archives
3. Resource List Archives
3.1 Resource List Archive Index
4. Resource Dump Archives
5. Change List Archives
6. Change Dump Archives
7. References
A. Acknowledgements
B. Change Log
The ResourceSync specifications introduce a range of easy to implement capabilities that a server may support in order to enable remote systems to remain more tightly in step with its evolving resources. They also describe how a server can advertise the capabilities it supports. Remote systems can inspect this information to determine how best to remain aligned with the evolving data.
This ResourceSync Archives specification adds to the framework capabilities that allow a server to provide historical data based on archives of the core capabilities (Resource Lists, Resource Dumps, Change Lists, and Change Dumps). Like all other capabilities, Archives are implemented using the document formats introduced by the Sitemap protocol. Each archive capability is optional and may be implemented independently of any other archive capability. Archives need not be implemented in order to support synchronization with ResourceSync, but may facilitate certain use cases.
For example, a Change List Archive allows a server to list a timestamped set of historical Change Lists, thus allowing description of changes over an extended period without placing addition requirements on the generation and rotation of the current Change List. A Resource Dump Archive allows a server to list a timestamped set of historical Resource Dumps, providing snapshots of the server's resources at different times. A remote server may select an appropriate historical Resource Dump to synchronize with a past state of the server's resources.
This document is structured as follows:
All archive capabilities may have indexes to allow extension to very large numbers of entries in the same manner as the core capabilities. This is described in detail for the Resource List Archive Index in Section 3.1.
Many projects and services have synchronization needs and have implemented ad hoc solutions. ResourceSync provides a standard synchronization method that will reduce implementation effort and facilitate easier reuse of resources. Archive capabilities allow historical data to be described within the same framework as current synchronization information. This section describes motivating examples for the archive capabilities.
The way in which a ResourceSync Source generates Change Lists will be determined by the particular technical configuration of the Source, the frequency of changes, and the intended use. While Change Lists that use the Sitemap index format and a set of Sitemaps may have a very large number of entries, it may be convenient to rotate individual lists of changes frequently and avoid generating a very large Change List. Change List Archives add flexibility while retaining the ability for a Source to make available a complete change history enabling incremental synchronization from any past state. A Source with very frequent changes might create separate Sitemap files as part of a Change List at hourly intervals, and perhaps each month (about 720 hours) start a new Change List while archiving the old one. If all the resource states were recorded in addition to the change information, then Change Dumps and a Change Dump Archive could be used to optimize download of the changed resources.
Many services provide snapshots of historical content either as stable reference points, or to permit the evolution of the service's resources to be studied in situations where describing all updates would be difficult. Examples include Wikipedia Snapshots and Nature Linked Data Snapshots. The Resource Dump Archive capability provides the opportunity to describe such snapshots in a consistent and machine-navigable way.
Resource List Snapshots provide the ability for servers to describe the state of their resources at particular points in time. This would allow clients to investigate changes expressed in the metadata or to compare the current state with historical state.
The capabilities introduced in this specification extend the framework structure described in ResourceSync Core: Structure. Figure 1 shows how the archive capabilities fit into the ResourceSync Framework:
This specification uses the terms "resource", "representation", "request", "response", "content negotiation", "client", and "server" as described in [Architecture of the World Wide Web].
Throughout this document, the following namespace prefix bindings are used:
Prefix | Namespace URI | Description |
---|---|---|
http://www.sitemaps.org/schemas/sitemap/0.9 |
Sitemap XML elements defined in the Sitemap protocol | |
rs | http://www.openarchives.org/rs/terms/ |
Namespace for elements introduced by ResourceSync |
In order to make use of the capabilities that a Source provides, a Destination must first determine which capabilities are supported, and the URIs of the corresponding capability documents. The archive capabilities described in this specification may be added to a Capability List in the same manner as other ResourceSync capabilities (see ResourceSync Core: Capability List). Archives may also be linked to from the corresponding core capability documents such as a Resource Dump or a Change List.
The Capability List format is described in detail in
ResourceSync Core: Capability List.
Each capability is listed within a <url>
element that
contains the URI of the capability document in a <loc>
element, and the capability type in the capability
attribute
of an <rs:md>
element.
The four additional archive capabilities described in this specification
that a Source can provide are indicated with capability types:
resourcelist-archive
, resourcedump-archive
,
changelist-archive
, and changedump-archive
.
These values are shown in the <rd:md capability="...">
attributes in Example 3.1,
Example 4.1, Example 5.1,
and Example 6.1. A Capability List may contain only
one entry per capability.
A resource that is covered by one capability listed in a Capability List must also be covered by all other capabilities that are enumerated in that Capability List. With this understanding, Destinations can select from the capabilities offered the best one to serve their synchronization goal for the particular set of resources.
Example 2.1 shows a Capability List where the Source offers eight capabilities: a Resource List, a Resource Dump, a Change List, a Change Dump, a Resource List Archive, a Resource Dump Archive, a Change List Archive, and a Change Dump Archive.
<?xml version="1.0" encoding="UTF-8"?> <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <rs:ln rel="describedby" href="http://example.com/info_about_set1_of_resources.xml" type="application/xml"/> <rs:ln rel="up" href="http://example.com/resourcesync_description.xml"/> <rs:md capability="capabilitylist"/> <url> <loc>http://example.com/dataset1/resourcelist.xml</loc> <rs:md capability="resourcelist"/> </url> <url> <loc>http://example.com/dataset1/resourcedump.xml</loc> <rs:md capability="resourcedump"/> </url> <url> <loc>http://example.com/dataset1/changelist.xml</loc> <rs:md capability="changelist"/> </url> <url> <loc>http://example.com/dataset1/changedump.xml</loc> <rs:md capability="changedump"/> </url> <url> <loc>http://example.com/dataset1/resourcelist-archive.xml</loc> <rs:md capability="resourcelist-archive"/> </url> <url> <loc>http://example.com/dataset1/resourcedump-archive.xml</loc> <rs:md capability="resourcedump-archive"/> </url> <url> <loc>http://example.com/dataset1/changelist-archive.xml</loc> <rs:md capability="changelist-archive"/> </url> <url> <loc>http://example.com/dataset1/changedump-archive.xml</loc> <rs:md capability="changedump-archive"/> </url> </urlset>
The provision of archive capabilities and their inclusion in one or more Capability Lists does not change how a source would expose a Source Description (see ResourceSync Core: Describing the Source), or the discovery of the Source Description document (see ResourceSync Core: Discovery).
Individual capability documents such as a Change List or Change List Index may
provide links to the corresponding archive using a top level <rs:ln>
element with the relation type archives
(defined in the
Link Relations Registry).
Example 2.2 shows a Change List with a link
to a Change List Archive. A Destination cannot determine from the
archives
link whether a Source provides, for example, a
Change List Archive Index or a single Change List Archive. The archive
document must be downloaded to make this determination: a document with
a <sitemapindex>
root element is an index, a document with
a <urlset>
root element is not.
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:rs="http://www.openarchives.org/rs/terms/">
<rs:ln rel="up"
href="http://example.com/dataset1/capabilitylist.xml"/>
<rs:ln rel="archives"
href="http://example.com/dataset1/changelistarchive.xml"/>
<rs:md capability="changelist"
from="2013-01-01T11:00:00Z"
until="2013-01-03T11:00:00Z"/>
<url>
<loc>http://example.com/res2.pdf</loc>
<lastmod>2013-01-02T13:00:00Z</lastmod>
<rs:md change="updated"/>
</url>
<url>
<loc>http://example.com/res3.tiff</loc>
<lastmod>2013-01-02T18:00:00Z</lastmod>
<rs:md change="deleted"/>
</url>
</urlset>
As part of the regular update of its Resource List, a Source might maintain old Resource Lists (ResourceSync Core: Resource List) to provide historical snapshot views of its content. Such Resource List Archives provide an easy way for a Destination to compare the states of the resources at different times.
A Resource List Archive is based on the <urlset>
document
format introduced by the Sitemap protocol. It has the <urlset>
root element and the following structure:
<rs:md>
child element of <urlset>
must have a capability
attribute that has a value of
resourcelist-archive
.<rs:ln>
child element of <urlset>
points to the Capability List with the relation type up
.<rs:ln>
child element of <urlset>
points to it with the relation type index
.<url>
child element of <urlset>
per
Resource List. This element does not have attributes, but uses child elements to convey
information about the Resource List. The <url>
element has the
following child elements:
<loc>
child element provides the URI of the Resource List.<lastmod>
child element conveys the last modification time of
the resource with the URI provided in <loc>
, the Resource List in this
case. The value is expressed as a W3C Datetime; the use of a
complete date and time expressed
in UTC using the format YYYY-MM-DDThh:mm:ss[.s]Z
is recommended.<rs:md>
child element with an at
attribute
and possibly a completed
attribute to convey the datetime at which the process
of taking a snapshot of resources to create the archived Resource List respectively
started and ended.Example 3.1 shows a Resource List Archive that points three Resource Lists that were saved on a monthly schedule.
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:rs="http://www.openarchives.org/rs/terms/">
<rs:ln rel="up"
href="http://example.com/dataset1/capabilitylist.xml"/>
<rs:md capability="resourcelist-archive"/>
<url>
<loc>http://example.com/resourcelist1.xml</loc>
<rs:md at="2012-11-03T09:00:00Z"/>
</url>
<url>
<loc>http://example.com/resourcelist2.xml</loc>
<rs:md at="2012-12-03T09:00:00Z"/>
</url>
<url>
<loc>http://example.com/resourcelist3.xml</loc>
<rs:md at="2013-01-03T09:00:00Z"/>
</url>
</urlset>
The ResourceSync framework adopts the community defined limits for publishing
documents of the <urlset>
format and uses index documents
with <sitemapindex>
format for grouping them.
Archive Indexes are similar to core capability indexes such as a
ResourceSync Core: Resource List Index,
and provide for very large archives or flexibility in archiving or rotation schemes.
A Resource List Archive Index is based on the <sitemapindex>
document format introduced by the Sitemap protocol. It has the
<sitemapindex>
root element and the following structure:
<rs:md>
child element of
<sitemapindex>
must have a capability
attribute
that has a value of resourcelist-archive
.<rs:ln>
child element of
<sitemapindex>
points to the Capability List with the relation
type up
.<sitemap>
child element of <sitemapindex>
per Resource List Archive. This element does not have attributes, but uses child
elements to convey information about the Resource List. The <sitemap>
element has the following child elements:
<loc>
child element provides the URI of the
Resource List Archive.<lastmod>
child element with semantics as described
in Section 3. A <lastmod>
should not
be provided unless the Source updates the Resource List Archive Index every time it
updates the Resource List Archive.
A Destination can determine whether it has reached a Resource List Archive or
a Resource List Archive List Index based on whether the root element is
<urlset>
or <sitemapindex>
respectively.
Example 3.2 shows a Resource List Archive Index
that points to two Resource List Archives. This specification does not define
how a Source should group entries in the Resource List Archives referred to.
It might be based simply on the capacity of each Resource List Archive document,
or according to the Source's archiving scheme (say yearly collections).
In order to discover the times of the archived Resource Lists a Destination
must inspect the Resource List Archives referred to, and potentially the
Resource Lists that they in turn refer to.
<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:rs="http://www.openarchives.org/rs/terms/">
<rs:ln rel="up"
href="http://example.com/dataset1/capabilitylist.xml"/>
<rs:md capability="resourcelist-archive"/>
<sitemap>
<loc>http://example.com/resourcelistarchive00001.xml</loc>
</sitemap>
<sitemap>
<loc>http://example.com/resourcelistarchive00002.xml</loc>
</sitemap>
</sitemapindex>
As part of the regular maintenance of its data, a Source might maintain old Resource Dumps. For a Destination that wishes to compare or archive versions of the data over time, access to these Resource Dumps allows the packaged historical data to be downloaded all at once, rather than requiring the Source to support access to the individual resource versions, and for the Destination to collect them one at a time.
A Resource Dump Archive points to a set of previously created and published Resource Dumps. Each of these Resource Dumps represents a snapshot of the Source's data at a certain point in time as described in ResourceSync Core: Resource Dump.
A Resource Dump Archive is based on the <urlset>
document format introduced by the Sitemap protocol. It has the
<urlset>
root element and the following structure:
<rs:md>
child element of <urlset>
must have a
capability
attribute that has a value of resourcedump-archive
.<rs:ln>
child element of <urlset>
points to the Capability List with the relation type up
.<rs:ln>
child element of <urlset>
points to it with the relation type index
.<url>
child element of <urlset>
per Resource Dump. This element does not have attributes,
but uses child elements to convey information about the Resource Dump. The <url>
element has the following child elements:
<loc>
child element provides the URI of the Resource Dump.<lastmod>
child element with semantics as described in Section 3.<rs:md>
child element with an at
attribute
and possibly a completed
attribute to convey the datetime at which the process
of taking a snapshot of resources to create the archived Resource Dump respectively
started and ended.Example 4.1 shows a Resource Dump Archive that points to the two historical Resource Dumps created on a monthly schedule.
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:rs="http://www.openarchives.org/rs/terms/">
<rs:ln rel="up"
href="http://example.com/dataset1/capabilitylist.xml"/>
<rs:md capability="resourcedump-archive"/>
<url>
<loc>http://example.com/resourcedump1.xml</loc>
<lastmod>2012-11-03T09:05:42Z</lastmod>
<rs:md at="2012-11-03T09:00:00Z"
completed="2012-11-03T09:05:01Z"/>
</url>
<url>
<loc>http://example.com/resourcedump2.xml</loc>
<lastmod>2012-12-03T09:06:12Z</lastmod>
<rs:md at="2012-12-03T09:00:00Z"
completed="2012-12-03T09:05:17Z"/>
</url>
</urlset>
If a Source needs to or chooses to publish multiple Resource Dump Archives, it must group them using a Resource Dump Archive Index, in a manner similar to that described in Section 3.1.
A Change List (ResourceSync Core: Change List) describes the changes in a Source's resources over a certain period of time. The Source determines the length of that time interval. If a Source wishes to offer Change Lists covering prior temporal intervals, it can provide a Change List Archive. A Change List Archive provides a list of pointers to individual Change Lists which would usually represent consecutive lists of changes.
A Change List Archive is based on the <urlset>
document format introduced by the Sitemap protocol. It has the
<urlset>
root element and the following structure:
<rs:md>
child element of <urlset>
must have a
capability
attribute that has a value of changelist-archive
.<rs:ln>
child element of <urlset>
points to the Capability List with the relation type up
.<rs:ln>
child element of <urlset>
points to it with the relation type index
.<url>
child element of <urlset>
per Change List. This element does not have attributes,
but uses child elements to convey information about the Change List. The <url>
element has the following child elements:
<loc>
child element provides the URI of the Change List.<lastmod>
child element with semantics as described in Section 3.<rs:md>
child element with optional from
and until
attributes to convey the temporal interval covered by the archived
Change List.
The pointers in a Change List Archive must be in chronological order.
Either by inspecting from
and until
attributes
provided for each archived Change List, or by downloading the Change
Lists, a Destination may determine whether the Change Lists are consecutive
and without any time gaps.
Example 5.1 shows a Change List Archive that points to three Change Lists created on consecutive days.
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:rs="http://www.openarchives.org/rs/terms/">
<rs:ln rel="up"
href="http://example.com/dataset1/capabilitylist.xml"/>
<rs:md capability="changelist-archive"/>
<url>
<loc>http://example.com/changelist1.xml</loc>
<rs:md from="2013-01-01T09:00:00Z"
until="2013-01-02T09:00:00Z"/>
</url>
<url>
<loc>http://example.com/changelist2.xml</loc>
<rs:md from="2013-01-02T09:00:00Z"
until="2013-01-03T09:00:00Z"/>
</url>
<url>
<loc>http://example.com/changelist3.xml</loc>
<rs:md from="2013-01-03T09:00:00Z"
until="2013-01-04T09:00:00Z"/>
</url>
</urlset>
If a Source needs to or chooses to publish multiple Change List Archives, it must group them using a Change List Archive Index, in a manner similar to that described in Section 3.1.
If a Source decides to offer Change Dumps of prior temporal intervals, it may provide a Change Dump Archive. A Change Dump Archive points to a number of Change Dumps.
A Change Dump Archive is based on the <urlset>
document
format introduced by the Sitemap protocol. It has the
<urlset>
root element and the following structure:
<rs:md>
child element of <urlset>
must have a
capability
attribute that has a value of changedump-archive
.<rs:ln>
child element of <urlset>
points to the Capability List with the relation type up
.<rs:ln>
child element of <urlset>
points to it with the relation type index
.<url>
child element of <urlset>
per Change Dump. This element does not have attributes, but uses child elements to convey information about the Change Dump. The <url>
element has the following child elements:
<loc>
child element provides the URI of the Change Dump.<lastmod>
child element with semantics as described in Section 3.<rs:md>
child element with optional from
and until
attributes to convey the temporal interval covered by the archived
Change Dump.
The pointers in a Change Dump Archive must be in chronological order.
Either by inspecting from
and until
attributes
provided for each archived Change Dump, or by downloading the Change
Dumps, a Destination may determine whether the Change Dumps are consecutive
and without any time gaps.
An example for a Change Dump Archive is shown in Example 6.1 below. It points to two Change Dumps that were created for consecutive weeks.
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:rs="http://www.openarchives.org/rs/terms/">
<rs:ln rel="up"
href="http://example.com/dataset1/capabilitylist.xml"/>
<rs:md capability="changedump-archive"/>
<url>
<loc>http://example.com/changedump-w1.xml</loc>
<lastmod>2012-12-20T09:02:43Z</lastmod>
<rs:md from="2012-01-13T09:00:00Z"
until="2013-01-20T09:00:00Z"/>
</url>
<url>
<loc>http://example.com/changedump-w2.xml</loc>
<lastmod>2012-12-27T09:01:57Z</lastmod>
<rs:md from="2012-01-20T09:00:00Z"
until="2013-01-27T09:00:00Z"/>
</url>
</urlset>
If a Source needs to or chooses to publish multiple Change Dump Archives, it must group them using a Change Dump Archive Index, in a manner similar to that described in Section 3.1.
This specification is the collaborative work of NISO and the Open Archives Initiative. Funding for ResourceSync is provided by the Alfred P. Sloan Foundation. UK participation is supported by Jisc.
We also thank numerous individual contributors including: Martin Haye (California Digital Library), Richard Jones (Cottage Labs), Stuart Lewis (University of Edinburgh), Peter Murray (Lyrasis), David Rosenthal (LOCKSS), Shlomo Sanders (Ex Libris, Inc.), Ed Summers (Library of Congress), Paul Walk (UKOLN), Vincent Wehren (Microsoft), Zhiwu Xie (Virginia Tech), and Jeff Young (Online Computer Library Center).
Date | Editor | Description |
---|---|---|
2013-08-21 | simeon, martin, herbert | reorder sections, add structure figure |
2013-08-05 | simeon, martin, herbert, rob | version 0.9.1 |
2013-06-07 | simeon | version 0.9 |
2013-05-06 | simeon | separated archives portion for version 0.6 |
2013-02-01 | martin, herbert, rob, simeon | beta spec draft |
2012-08-13 | martin, herbert, simeon, bernhard | first alpha spec draft |
This work is licensed under a Creative Commons Attribution-Share Alike 3.0 Unported License.