[OAI-implementers] repository auto-discovery
John S. Erickson
john.erickson at hp.com
Sun Nov 19 16:00:27 EST 2006
Michael says, "...we see at least 3 possible ways for robots to
"automatically" discover OAI-PMH baseURLs..."
I don't understand why you don't include baseURLs specified within
<friends> elements of a OAI-PMH response as another "way." Granted, this
might be more of a p2p approach, but it technically *could be* a way for
a robot to discover baseURLs.
This is how we're accomplishing "peer federation" in our pf-dspace
project, in which dspace instances tell their peers the baseURLs of
dspaces they know about by publishing lists of <friends> via oai-pmh.
Michael Nelson wrote:
>> well, i'm aiming at something much lower: namely, how to get the baseUrl
>> of an OAI PMH data provider? and it seems particularly embarassing, that
>> i have no standard way to advertise my own service to people (including
>> robots) surfing my own pages.
>
> Herbert and I talked about this some time ago and had a preference for
> adding to robots.txt to inform crawlers about baseURLs. At the time, few
> outside of the DL community were supporting OAI-PMH, but perhaps it is
> time to revisit this. Here is the proposal; the syntax could be tweaked
> w/ robots.txt "Allow:", HTML <link> etc., but this should give the idea:
>
> ===
>
> OAI-PMH baseURL discovery
>
> Drawing from our experience with mod_oai, we see at least 3 possible
> ways for robots to "automatically" discover OAI-PMH baseURLs:
>
> 1. develop a separate file, oaimph.txt, similar in spirit to robots.txt
>
> 2. add to the existing robots.txt file
>
> 3. use HTML link or META tags for robots
>
> We do not prefer #1 - a separate file for robots to check seems unlikely
> to encourage widespread adoption.
>
> We prefer #2 because it injects OAI-PMH into the regular web
> mechanics where it belongs. Robots already look for this file -
> why not put OAI-PMH statements where they expect to find guidance?
>
> #3 can be used in some cases, but it makes an assumption that every
> repository we would like a robot to find has an HTML presence. #2 and #3
> can be used separately since they address separate use cases.
>
> robots.txt
> ----------
>
> The "problem" with robots.txt is that the syntax is very simple and is
> focused on telling robots what they can't do and not on what they should
> do. So in addition to having a line such as:
>
> OAIPMHbaseURL=http://cs1.ist.psu.edu/cgi-bin/oai.cgi
>
> We would like to expand the syntax of the "Disalllow:" tag to include
> alternatives:
>
> Disallow: /citations/ http://cs1.ist.psu.edu/cgi-bin/oai.cgi
>
> Where the 2nd line is the alternate access for how to get at the
> information prohibited in the Disallow. Depending on how robust
> robots are with respect to extended syntax, we could repeat the line
> in case the extended line is not understood:
>
> Disallow: /citations/
> Disallow: /citations/ http://cs1.ist.psu.edu/cgi-bin/oai.cgi
>
> HTML Tags for Robots
> --------------------
>
> It would be useful to tie an existing HTML page back to the original
> OAI-PMH repository from which it came, such as:
>
> http://uk.arxiv.org/abs/astro-ph/0502028
>
>
> having something like:
>
> <META NAME="ROBOTS" OAIPMHbaseURL="http://www.arxiv.org/oai2">
>
> It would also be useful to tie the HTML representation back to
> the structured metadata from which it came:
>
> <META NAME="ROBOTS"
> OAIPMHrecord="http://www.arxiv.org/oai2?verb=GetRecord&metad
> ataPrefix=oai_dc&identifier=oai:arXiv.org:astro-ph/0502028">
>
> <META NAME="ROBOTS"
> OAIPMHrecord="http://www.arxiv.org/oai2?verb=GetRecord&metad
> ataPrefix=oai_marc&identifier=oai:arXiv.org:astro-ph/0502028">
>
> This is similar to inverse of a DC.Identifier field -- instead of mapping
> from structured to un/semi-strucutred, it maps from un/semi-strucutred
> to structured.
>
>
>
>
> ----
> Michael L. Nelson mln at cs.odu.edu http://www.cs.odu.edu/~mln/
> Dept of Computer Science, Old Dominion University, Norfolk VA 23529
> +1 757 683 6393 +1 757 683 4900 (f)
>
> _______________________________________________
> OAI-implementers mailing list
> List information, archives, preferences and to unsubscribe:
> http://www.openarchives.org/mailman/listinfo/oai-implementers
>
More information about the OAI-implementers
mailing list