[OAI-implementers] harvester tools
Kat Hagedorn
khage at umich.edu
Mon Aug 9 11:56:01 EDT 2004
Thank you to everyone who responded about harvester tools. Each person
seems to be using a different harvester (!), but the comments have
given us an idea of where we want to focus our attentions. Two people
also let me know in person about harvesters they use. Those are listed
first below.
Thanks again,
- Kat
On Jul 20, 2004, at 2:15 PM, Kat Hagedorn wrote:
> Hello all,
>
> We are investigating switching to a different harvester tool and
> thought that a good first step would be to poll this list about their
> use of harvesters.
>
> If you harvest OAI records:
>
> 1. What harvester tool do you use? Version number?
>
> 2. Are you pleased with the tool? What do you like and not like about
> it?
>
> Please send responses directly to me and I'll summarize for the list.
> (Anonymously if preferred.)
>
> Thanks,
> - Kat
>
> -------------------
> Kat Hagedorn
> OAIster/Metadata Harvesting Librarian
> DLXS Bibliographic Class Coordinator
> DLXS Text Class Collections Co-coordinator
> Digital Library Production Service
> University of Michigan
>
> http://www.oaister.org/
> http://www.dlxs.org/
> email: khage at umich.edu
> phone: 734-615-7618
>
>
> _______________________________________________
> OAI-implementers mailing list
> List information, archives, preferences and to unsubscribe:
> http://openarchives.org/mailman/listinfo/oai-implementers
-------------------------------------------------------------
Virginia Tech Perl Harvester
http://oai.dlib.vt.edu/odl/software/harvest/
-------------------------------------------------------------
Simeon's Perl Harvester
contact Simeon Warner for more info (simeon at cs.cornell.edu)
-------------------------------------------------------------
I really like the simple perl harvester (MyOAI)
not sure if there is any more development on this
harvester but it is great to trouble shoot problems
with your broker/provider and it pretty much harvest
most of the proviser sites but there is no gui frontend
pretty much a unix type of application that you have to
configure files and then run it on the unix shell command
line.
It was a big help to us when we were setting seven new
data providers with various poblems and was able to
turn logging on ands see each http request being sent
for all of the OAI verbs (ListRecords, ListIdentifeirs...etc)
-------------------------------------------------------------
I've written several harvesters now, and I'm not happy with any of
them. The
problem is that so many repositories have badly encoded characters that
I
can't rely on DOM or SAX during the harvesting process without having
them
choke on the bad characters.
Harvesters are trivial to write. Thom Hickey wrote one with a single
page of
Python code (http://www.oclc.org/research/software/oai/2page.htm) and I
wrote one that is even simpler (albeit a bit longer) that I wrote in
XSLT
(http://errol.oclc.org/oai:xmlregistry.oclc.org:xoai/xoaiharvester.xsl).
Because they all rely on the data being good, though, they fail way too
often.
My advice is to find an implementation that captures the responses as
raw
bytes and then greps for the resumptionToken rather than rely on XML
tools
to parse for it. A page or two of code is all it should take.
-------------------------------------------------------------
I just started using REAP from UIUC. It is Windows based. After using it
only two days it seems quite capable. It may prove to be weak in spots
but those I probably won't be aware of for a few weeks.
-------------------------------------------------------------
We used to use ARC from Old Dominion, but my digital library research
crew now
just codes up ad hoc harvesters for different applications. We've
developed
various code chunks that do various parts of the process. We've also
experimented with Greenstone's harvester module for smaller
applications.
-------------------------------------------------------------
I use 'harvester2' from Jeff Young (OCLC) :
http://www.oclc.org/research/software/oai/harvester2.htm
I like this tool because it is a simple library. I needed this type of
library for my project.
But there is some problems (bugs) : with the 'retry later' it seems to
retry indefinitly. With compression, if the Content-Encoding is null
and the content encoded, it does not detect it.
-------------------------------------------------------------
Internally, in the LANL repository infrastructure, we use OCLC's
OAIHarvester version 1 for _big time_ harvesting of complex objects
(not DC records but actual content represented using MPEG-21 DIDL).
We have built the OAI-PMH Federator (see our JCDL paper
http://lib-www.lanl.gov/~herbertv/papers/jcdl2004-submitted-draft.pdf)
on the basis of this Harvester.
We love it. It's faster than OAIHarvester2. Jeff Young keeps
supporting it, and actually implemented optimizations as a result of
our feedback. On demand. What more can one ask for!?
-------------------------------------------------------------
I use celestial http://celestial.eprints.org/. last version and I
update the version when I can.
Yes. [pleased with the tool]
I like the web interface and it is written in perl.
I don't like:
-how it manage the sign '&' [It trasform them in "&"
-I can't harvest a selection of sets. I need to harvest all site.
More information about the OAI-implementers
mailing list