[OAI-general] De-dupping Records

Tim Brody tim@tim.brody.btinternet.co.uk
Thu, 27 Mar 2003 11:10:37 +0000


In theory this is easy to do. Each time a service sees a new record it 
performs a search in it's existing collection for near-matches (e.g. 
title+author query). Where an existing record is found to be very 
similar the service bundles together the OAI records under a single 
point (which then links to each of the different locations).

Another advantage of this approach is bundling together multiple parts 
of the same document, or multiple versions (neither of which you want to 
see seperately in a search result).

If I recall Google has looked at this issue as well - but it is faced 
with the technical problem of creating an edit-distance graph between 3 
billion items, whereas for the research literature citations can be used 
to de-dupe.

I am addressing this is my next prototype for Citebase, although things 
become a little hairy building a system which has an abstraction above 
OAI records (a stream of data, which is not necessarily in order, that 
can have dupes, revisions, and parts).

All the best,
Tim.
Citebase Search

Peter Green wrote:
> Christopher
> 
> Excuse my ignorance, not currently having a comprehensive understanding of
> OAI, but is the bigger issue not one of duplication (rather than duplication
> of effort). If your physicist places the same item in both archives, and a
> service provider (say OAIster for example) harvests from both archives, is
> there a mechanism to de-dupe the two identical items? If not then it is
> likely that over time duplicate entires will present a major problem.
> 
> Is this being addressed??
> 
> Cheers
> 
> Peter
> 
> -----Original Message-----
> From: Christopher Gutteridge [mailto:cjg@ecs.soton.ac.uk] 
> Sent: Tuesday, 18 March 2003 4:10 PM
> To: BOAI Forum
> Cc: September 1998 American Scientist Forum; oai-eprints@fafner.openlib.org;
> OAI-general@oaisrv.nsdl.cornell.edu; SPARC-IR@arl.org
> Subject: [OAI-general] Re: [BOAI] Re: Cliff Lynch on Institutional Archives
> 
> 
> 
> Disciplinary/subject archives vs. Institutional/Organisation/Region based
> archives. This is going to be a key challenge now open archives begin to
> gain momentum. 
> 
> For example; we are planning a University-wide eprints archive. I am 
> concerned that some physisists will want to place their items in both the
> university eprints service AND the arXiv physics archive. They may 
> be required to use the university service, but want to use arXiv as it is
> the primary source for their discipline. This is a duplication of 
> effort and a potential irritation.
> 
> Ultimately, of course, I'd hope that diciplinary archives will be replaced
> with subject-specific OAI service providers harvesting from the
> institutional archives. But there is going to be a very long transition
> period in which the solution evolves from our experience.
> 
> What I'm asking is; has anyone given consideration to ways of smoothing over
> this duplication of effort? Possibly some negotiated automated process for
> insitutional archives uploading to the subject archive, or at least
> assisting the author in the process.
> 
> This isn't the biggest issue, but it'd be good to address it before it
> becomes more of a problem.
> 
>   Christopher Gutteridge
>   GNU EPrints Head Developer
>   http://software.eprints.org/
> 
> On Sun, Mar 16, 2003 at 02:15:56 +0000, Stevan Harnad wrote:
> 
>>On Sat, 15 Mar 2003, Thomas Krichel wrote:
>>
>>
>>>  Stevan Harnad writes:
>>>
>>>sh> There is no need -- in the age of OAI-interoperability -- for 
>>>sh> institutional archives to "feed" central disciplinary archives:
>>>
>>>  I do not share what I see as a  blind faith in interoperability
>>>  through a technical protocol.
>>
>>I am quite happy to defer to the technical OAI experts on this one, 
>>but let us put the question precisely:
>>
>>Thomas Krichel suggests that institutional (OAI) data-archives
>>(full-texts) should "feed" disciplinary (OAI) data-archives, because 
>>OAI-interoperability is somehow not enough. I suggest that 
>>OAI-interoperability (if I understand it correctly) should be enough. 
>>No harm in redundant archiving, of course, for backup and security, 
>>but not necessary for the usage and functionality itself. In fact, if 
>>I understand correctly the intent of the OAI distinction between OAI 
>>data-providers -- http://www.openarchives.org/Register/BrowseSites.pl
>>-- and OAI service-providers -- 
>>http://www.openarchives.org/service/listproviders.html
>>-- it is not the full-texts of data-archives that need to be "fed" to 
>>(i.e., harvested by) the OAI service providers, but only their 
>>metadata.
>>
>>Hence my conclusion that distributed, interoperable OAI institutional 
>>archives are enough (and the fastest route to open-access). No need to 
>>harvest their contents into central OAI discipline-based archives 
>>(except perhaps for redundancy, as backup). Their OAI interoperability 
>>should be enough so that the OAI service-providers can (among other 
>>things) do the "virtual aggregation" by discipline (or any other 
>>computable
>>criterion) by harvesting the metadata alone, without the need to harvest
>>full-text data-contents too.
>>
>>It should be noted, though, that Thomas Krichel's excellent RePec 
>>archive and service in Economics -- http://repec.org/ -- goes well 
>>beyond the confines of OAI-harvesting! RePec harvests non-OAI content 
>>too, along lines similar to the way ResearchIndex/citeseer -- 
>>http://citeseer.nj.nec.com/cs -- harvests non-OAI content in computer 
>>science. What I said about there being no need to "feed" institutional 
>>OAI archive content into disciplinary OAI archives certainly does not 
>>apply to *non-OAI* content, which would otherwise be scattered 
>>willy-nilly all over the net and not integrated in any way. Here 
>>RePec's and ResearchIndex's harvesting is invaluable, especially as 
>>RePec already does (and ResearchIndex has announced that it plans to) 
>>make all its harvested content OAI-compliant!
>>
>>To summarize: The goal is to get all research papers, pre- and 
>>post-peer-review, openly accessible (and OAI-interoperable) as soon as 
>>possible. (These are BOAI Strategies 1 [self-archiving] and 2 
>>[open-access journals]: http://www.soros.org/openaccess/read.shtml
>>). In principle this can be done by (1) self-archiving them in central 
>>OAI disciplinary archives like the Physics arXiv (the biggest and 
>>first of its kind) -- http://arxiv.org/show_monthly_submissions
>>-- by (2) self-archiving them in distributed institutional OAI 
>>Archives -- http://www.ecs.soton.ac.uk/~harnad/Temp/tim.ppt -- by (3) 
>>self-archiving them on arbitrary Web and FTP sites (and hoping they 
>>will be found or harvested by services like Repec or ResearchIndex) or 
>>by (4) publishing them in open-access journals (BOAI Strategy 2: 
>>http://www.soros.org/openaccess/journals.shtml ).
>>
>>My point was only that because researchers and their institutions
>>(*not* their disciplines) have shared interests vested in maximizing 
>>their joint research impact and its rewards, institution-based 
>>self-archiving (2) is a more promising way to go -- in the age of 
>>OAI-interoperability -- than discipline-based self-archiving (1), even 
>>though the latter began earlier. It is also obvious that both (1) and
>>(2) are preferable to arbitrary Web and FTP self-archiving (3), which 
>>began even earlier (although harvesting arbitrary Website and FTP 
>>contents into OAI-compliant Archives is still a welcome makeshift 
>>strategy until the practise of OAI self-archiving is up to speed). 
>>Creating new open-access journals and converting the established 
>>(20,000) toll-access journals to open-access is desirable too, but it 
>>is obviously a much slower and more complicated path to open access 
>>than self-archiving, so should be pursued in parallel.
>>
>>My conclusion in favor of institutional self-archiving is based on the 
>>evidence and on logic, and it represents a change of thinking, for I 
>>had originally advocated (3) Web/FTP self-archiving -- 
>>http://www.arl.org/scomm/subversive/toc.html -- then switched 
>>allegiance to central self-archiving (1), even creating a 
>>discipline-based archive: http://cogprints.ecs.soton.ac.uk/ But with 
>>the advent of OAI in 1999, plus a little reflection, it became 
>>apparent that institutional self-archiving (2) was the fastest, most 
>>direct, and most natural road to open access: http://www.eprints.org/ 
>>And since then its accumulating momentum seems to be confirming that 
>>this is indeed so: 
>>http://www.ecs.soton.ac.uk/~harnad/Hypermail/Amsci/2212.html
>>http://www.ecs.soton.ac.uk/~harnad/Temp/tim.ppt
>>
>>
>>>  The primary sense of belonging
>>>  of a scholar in her research activities is with the disciplinary
>>>  community of which she thinks herself a part... It certainly
>>>  is not with the institution.
>>
>>That may or may not be the case, but in any case it is irrelevant to 
>>the question of which is the more promising route to open-access. Our 
>>primary sense of belonging may be with our family, our community, our 
>>creed, our tribe, or even our species. But our rewards (research grant 
>>funding and overheads, salaries, postdocs and students attracted to 
>>our research, prizes and honors) are intertwined and shared with our 
>>institutions (our employers) and not our disciplines (which are often 
>>in fact the locus of competition for those same rewards!)
>>
>>
>>>  Therefore, if you want to fill
>>>  institutional archives---which I agree is the best long-run way
>>>  to enhance access and preservation to scholarly research--- [the]
>>>  institutional archive has to be accompanied by a discipline-based
>>>  aggregation process.
>>
>>But the question is whether this "aggregation" needs to be the 
>>"feeding" of institutional OAI archive contents into disciplinary OAI 
>>archives, or merely the "feeding" of OAI metadata into OAI services.
>>
>>
>>>   The RePEc project has produced such an aggregator
>>>  for economics for a while now. I am sure that other, similar
>>>  projects will follow the same aims, but, with the benefit of
>>>  hindsight, offer superior service. The lack of such services
>>>  in many disciplines,  or the lack of interoperability between
>>>  disciplinary and  institutional archives, are major obstacle to
>>>  the filling  the institutional archives.  There are no
>>>  inherent contradictions between institution-based archives
>>>  and disciplinary aggregators,
>>
>>There is no contradiction. In fact, I suspect this will prove to be a 
>>non-issue, once we confirm that (a) we agree on the need for 
>>OAI-compliance and (b) "aggregation" amounts to metadata-harvesting 
>>and OAI service-provision when the full-texts are in the institutional 
>>archive are OAI-compliant (and calls for full-text harvesting only 
>>if/when they are not). Content "aggregation," in other words, is a 
>>paper-based notion. In the online era, it merely means digital sorting 
>>of the pointers to the content.
>>
>>
>>>  In the paper that Stevan refers to, Cliff Lynch writes,
>>>  at http://www.arl.org/newsltr/226/ir.html
>>>
>>>cl> But consider the plight of a faculty member seeking only broader 
>>>cl> dissemination and availability of his or her traditional journal 
>>>cl> articles, book chapters, or perhaps even monographs through use 
>>>cl> of the network, working in parallel with the traditional 
>>>cl> scholarly publishing system.
>>>
>>>  I am afraid, there more and more such faculty members. Much
>>>  of the research papers found over the Internet are deposited
>>>  in the way. This trend is growing not declining.
>>
>>You mean self-archiving in arbitrary non-OAI author websites? There is 
>>another reason why institutional OAI archives and official 
>>institutional self-archiving policies (and assistance) are so 
>>important. In reality, it is far easier to deposit and maintain one's 
>>papers in institutional OAI archives like Eprints than to set up and 
>>maintain one's own website. All that is needed is a clear official 
>>institutional policy, plus some startup help in launching it. (No such 
>>thing is possible at a "discipline" level.) 
>>http://www.ecs.soton.ac.uk/~lac/archpol.html
>>http://www.eprints.org/self-faq/#institution-facilitate-filling
>>http://www.ecs.soton.ac.uk/~harnad/Temp/Ariadne-RAE.htm
>>http://paracite.eprints.org/cgi-bin/rae_front.cgi
>>
>>
>>>cl> Such a faculty member faces several time-consuming problems. He 
>>>cl> or she must exercise stewardship over the actual content and its
>>>cl> metadata: migrating the content to new formats as they evolve 
>>>cl> over time, creating metadata describing the content, and ensuring 
>>>cl> the metadata is available in the appropriate schemas and formats 
>>>cl> and through appropriate protocol interfaces such as open archives 
>>>cl> metadata harvesting.
>>>
>>>  Sure, but academics do not like their work-, and certainly
>>>  not their publishing-habits, [to] be interfered with by external
>>>  forces. Organizing academics is like herding cats!
>>
>>I am sure academics didn't like to be herded into publishing with the 
>>threat of perishing either. Nor did they like switching from paper to 
>>word-processors. Their early counterparts probably clung to the oral 
>>tradition, resisting writing too; and monks did not like be herded 
>>from their peaceful manuscript-illumination chambers to the clamour of 
>>printing presses. But where there is a causal contingency -- as there 
>>is between (a) the research impact and its rewards, which academics 
>>like as much as anyone else, and (b) the accessibility of their 
>>research -- academics are surely no less responsive than Prof. 
>>Skinner's pigeons and rats to those causal contingencies, and which 
>>buttons they will have to press in order to maximize their rewards! 
>>http://www.ecs.soton.ac.uk/~harnad/Temp/self-archiving.htm
>>
>>Besides, it is not *publishing* habits that need to be changed, but
>>*archiving* habits, which are an online supplement, not a substitute, 
>>for existing (and unchanged) publishing habits.
>>
>>
>>>cl> Faculty are typically best at creating new
>>>cl> knowledge, not maintaining the record of this process of 
>>>cl> creation. Worse still, this faculty member must not only manage 
>>>cl> content but must manage a dissemination system such as a personal 
>>>cl> Web site, playing the role of system administrator (or the 
>>>cl> manager of someone serving as a system administrator).
>>>
>>>  There are lot of ways in which to maintain a web site or to get
>>>  access to a maintained one. It is a customary activity these days and
>>>  no longer requires much technical expertise. A primitive integration
>>>  of the contents can be done by Google, it requires  no metadata.
>>>  Academics don't care  about long-run preservation, so that problem
>>>  remains unsolved. In the meantime, the academic who uploads papers to
> 
> a web
> 
>>>  site takes steps to resolve the most pressing problem, access.
>>
>>Agreed. And uploading it into a departmental OAI Eprints Archive is
>>by far the simplest way and most effective way to do all of that. All it
>>needs is a policy to mandate it:
>>http://www.ecs.soton.ac.uk/~lac/archpol.html
>>
>>
>>>cl> Over the past few years, this has ceased to be a reasonable 
>>>cl> activity for most amateurs; software complexity, security risks, 
>>>cl> backup requirements, and other problems have generally relegated 
>>>cl> effective operation of Web sites to professionals who can exploit 
>>>cl> economies of scale, and who can begin each day with a review of 
>>>cl> recently issued security patches.
>>>
>>>  These are technical concerns. When you operate a linux box
>>>  on the web you simply fire up a script that will download
>>>  the latest version. That is easy enough. Most departments
>>>  have separate web operations. Arguing for one institutional
>>>  archive for digital contents is akin to calling for a single web
>>>  site for an institution. The diseconomies of scale of central
>>>  administration impose other types of costs that the ones that it was
> 
> to
> 
>>>  reduce. The secret is to find a middle way.
>>
>>I couldn't quite follow all of this. The bottom line is this: The free 
>>Eprints.org software (for example) can be installed within a few days. 
>>It can then be replicated to handle all the departmental or research 
>>group archives a university wants, with minimal maintenance time or 
>>costs. The rest is just down to self-archiving, which takes a few 
>>minutes for the first paper, and even less time for subsequent papers 
>>(as the repeating metadata -- author, institution, etc., can be 
>>"cloned" into each new deposit template). An institution may wish to 
>>impose an institutional "look" on all of its separate eprints 
>>archives; but apart from that, they can be as autonomous and as 
>>distributed and as many as desired: OAI-interoperability works locally 
>>just as well as it does globally.
>>
>>
>>>cl> Today, our faculty time is being wasted, and expended 
>>>cl> ineffectively, on system administration activities and content 
>>>cl> curation. And, because system administration is ineffective, it 
>>>cl> places our institutions at risk: because faculty are generally 
>>>cl> not capable of responding to the endless series of security 
>>>cl> exposures and patches, our university networks are riddled with 
>>>cl> vulnerable faculty machines intended to serve as points of 
>>>cl> distribution for scholarly works.
>>>
>>>  This is the fight many faculty face every day, where they
>>>  want to innovate scholarly communication, but someone
>>>  in the IT department does not give the necessary permission
>>>  for network access...
>>
>>I don't think I need to get into this. It's not specific to 
>>self-archiving, and a tempest in a teapot as far as that is concerned. 
>>An efficient system can and will be worked out once there is an 
>>effective institutional self-archiving policy. There are already 
>>plenty of excellent examples, such as CalTech: 
>>http://library.caltech.edu/digital/
>>See also:
>>http://software.eprints.org/#ep2
>>
>>Stevan Harnad
> 
>