[OAI-implementers] Experience with large-scale harvesting
Hickey,Thom
hickey@oclc.org
Fri, 13 Jun 2003 15:59:59 -0400
This message is in MIME format. Since your mail reader does not understand
this format, some or all of this message may not be legible.
------_=_NextPart_001_01C331E6.5E75CAB2
Content-Type: text/plain;
charset="iso-8859-1"
Since creating a one-page Python OAI-PMH harvester (see an improved, even
shorter, version at http://purl.oclc.org/net/hickey/oai/harvest.py
<http://purl.oclc.org/net/hickey/oai/harvest.py> ) , I've been seeing how
our OAI repositories perform on full harvests.
OCLC Research runs two main repositories of metadata about theses and
dissertations:
XTCat ( http://alcme.oclc.org/xtcat/ <http://alcme.oclc.org/xtcat/> ) with
some 4.3 million bibliographic records
NDLTD ( http://alcme.oclc.org/ndltd/ <http://alcme.oclc.org/ndltd/> ) which
has around 38,000 records.
My workstation can harvest XTCat in around 90 minutes if compression is used
(over a 10 megabit line). Without compression it takes at least half again
as long, and my machine is much busier. I was slightly surprised at the
difference in bytes-received that compression makes: 8:1 for the larger
database and 7:1 for the smaller.
Harvesting at home via a cable modem takes slightly less than 4 hours to
harvest the 4.3 million records. That is about 300 records/second. Each
record is about 1,000 bytes (uncompressed).
The 90 minute harvest is 800 records/second (800,000 bytes/second). The
best time observed for doing two harvests simultaneously was 120 minutes, or
1,200 records/second. The most records/second observed was slightly more
than 1,400 records/second when running four simultaneous harvests, probably
close to the maximum rate the repository can support.
Running multiple harvests simultaneously did find a weakness in the
repository code, which would occasionally run out of memory. We seem to
have that fixed now, but I expect that error recovery is important for
reliably accomplishing large harvests.
--Th
------_=_NextPart_001_01C331E6.5E75CAB2
Content-Type: text/html;
charset="iso-8859-1"
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">
<META content="MSHTML 6.00.2600.0" name=GENERATOR></HEAD>
<BODY>
<DIV><FONT face=Arial size=2><SPAN class=776144917-13062003>Since creating a
one-page Python OAI-PMH harvester (see an improved, even
shorter, version at <A
href="http://purl.oclc.org/net/hickey/oai/harvest.py">http://purl.oclc.org/net/hickey/oai/harvest.py</A>)
, I've been seeing how our OAI repositories perform on full
harvests.</SPAN></FONT></DIV>
<DIV><FONT face=Arial size=2><SPAN
class=776144917-13062003></SPAN></FONT> </DIV>
<DIV><FONT face=Arial size=2><SPAN class=776144917-13062003>OCLC Research runs
two main repositories of metadata about theses and
dissertations:</SPAN></FONT></DIV>
<DIV><FONT face=Arial size=2><SPAN
class=776144917-13062003></SPAN></FONT> </DIV>
<DIV><FONT face=Arial size=2><SPAN class=776144917-13062003>XTCat (<A
href="http://alcme.oclc.org/xtcat/">http://alcme.oclc.org/xtcat/</A>) with
some 4.3 million bibliographic records </SPAN></FONT></DIV>
<DIV><FONT face=Arial size=2><SPAN class=776144917-13062003>NDLTD (<A
href="http://alcme.oclc.org/ndltd/">http://alcme.oclc.org/ndltd/</A>) which
has around 38,000 records.</SPAN></FONT></DIV>
<DIV><FONT face=Arial size=2><SPAN
class=776144917-13062003></SPAN></FONT> </DIV>
<DIV><FONT face=Arial size=2><SPAN class=776144917-13062003>My workstation can
harvest XTCat in around 90 minutes if compression is used (over a 10 megabit
line). Without compression it takes at least half again as long, and my
machine is much busier. I was slightly surprised at the difference in
bytes-received that compression makes: 8:1 for the larger database and 7:1
for the smaller.</SPAN></FONT></DIV>
<DIV><FONT face=Arial size=2><SPAN
class=776144917-13062003></SPAN></FONT> </DIV>
<DIV><FONT face=Arial size=2><SPAN class=776144917-13062003>Harvesting at home
via a cable modem takes slightly less than 4 hours to harvest the 4.3 million
records. That is about 300 records/second. Each record is about
1,000 bytes (uncompressed).</SPAN></FONT></DIV>
<DIV><FONT face=Arial size=2><SPAN
class=776144917-13062003></SPAN></FONT> </DIV>
<DIV><FONT face=Arial size=2><SPAN class=776144917-13062003>The 90 minute
harvest is 800 records/second (800,000 bytes/second). The best time
observed for doing two harvests simultaneously was 120 minutes, or 1,200
records/second. The most records/second observed was slightly more than
1,400 records/second when running four simultaneous harvests, probably close to
the maximum rate the repository can support.</SPAN></FONT></DIV>
<DIV><FONT face=Arial size=2><SPAN
class=776144917-13062003></SPAN></FONT> </DIV>
<DIV><FONT face=Arial size=2><SPAN class=776144917-13062003>Running multiple
harvests simultaneously did find a weakness in the repository code, which would
occasionally run out of memory. We seem to have that fixed now, but I
expect that error recovery is important for reliably accomplishing large
harvests.</SPAN></FONT></DIV>
<DIV><FONT face=Arial size=2><SPAN
class=776144917-13062003></SPAN></FONT> </DIV>
<DIV><FONT face=Arial size=2><SPAN
class=776144917-13062003>--Th</SPAN></FONT></DIV></BODY></HTML>
------_=_NextPart_001_01C331E6.5E75CAB2--