[OAI-implementers] RSS Feed Form the UIUC OAI Registry
Thomas G. Habing
thabing@uiuc.edu
Fri, 21 Nov 2003 10:53:31 -0600
Michael Nelson wrote:
> On Wed, 5 Nov 2003, Thomas G. Habing wrote:
>
>>Last night I also ran my gOAIglePop script. This script programatically
>>does some Google searches, looking for OAI repositories. If it finds a URL
>>which appears to be an OAI repository it issues an Identify request. If it
>>gets a valid response, its found a repository. The results of the script
>>run can be found at http://gita.grainger.uiuc.edu/registry/gOAIgle.xml The
>>latest run found three previously unknown repositories (at least to the
>>registry). If anyone is interested, the best Google query I've found for
>>finding OAI repositories is 'allinurl:verb=Identify'. Type this into the
>>Google query textbox and press Search.
>
>
> also very cool... suggestion: perhaps do some normalization of the URLs?
> or at least normalize based on the Identify response? for example, you
> found at least one of my repos twice:
>
> <baseURL>http://naca.larc.nasa.gov/oai2.0/</baseURL>
>
> <baseURL>http://naca.larc.nasa.gov/oai2.0/index.cgi</baseURL>
>
> which are the same repositories and give the same responses in Identify.
>
This is something I've been struggling with. I've actually done a fair
amount of manual cleanup in my registry to get rid of duplicate repositories
that appear with slightly different baseURLs. Discovering these can
actually be kind of tricky because of domain name aliases, redirects, and
other reasons. In some cases I've found three or more different baseURLs
for the same repository.
Duplicates seem to arise for various reasons:
Domain name aliases
URLs that sometimes use the numeric IP address and sometimes the domain
name
URLs that sometimes explicitly include the port # 80 and sometimes not
URLs that sometimes explicitly include the script name and other times
rely on the default, as the above examples
HTTP redirects
Probably other reasons...
The rules that I've used for resolving duplicates include:
If the baseURL returned by the Identify response is the same regardless of
the URL originally requested, and that baseURL actually works (on rare
occasions they haven't) I use that baseURL.
For many repositories, it seems that the baseURL reported in the Identify
response, simple reflects the URL originally used for the request. In these
cases, if I've discovered multiple URLs for the same repository, I will use
the baseURL which is shortest.
Anyway, now that I have a good size registry built up, I am being more
careful in adding new repositories to prevent duplicates. I am also working
on ideas to better automate the discovery of possible duplicates, such as
URL normalization, domain name lookups, or Identify response comparisons.
If anyone has any ideas please share them.
Thanks,
Tom
--
Thomas Habing
Research Programmer, Digital Library Projects
University of Illinois at Urbana-Champaign
155 Grainger Engineering Library Information Center, MC-274
thabing@uiuc.edu, (217) 244-4425
http://dli.grainger.uiuc.edu