[OAI-implementers] XML encoding problems with DSpace at MIT
Hussein Suleman
hussein@vt.edu
Sat, 15 Feb 2003 17:31:36 +0200
hi
i think Tim poses a very relevant question: do we deal with the
so-called "real-world" encoding problems or do we try to encourage
people to fix their implementations? (of course, for research purposes,
we may end up doing both :))
personally, the code i distribute to others does quite a lot of XML
cleaning in the data provider, but none at all in the harvester. i think
the basic philosophy i'm following is: clean data as close to the source
as possible. also, i believe one of the reasons the adminEmail field in
Identify responses is required is so that a service provider can contact
the administrator if there are problems with the data.
and now that the hype about OAI2 is dying down, i wonder how much (if
any) more testing we need. i have some ideas to enhance, complement and
possibly even replace the repository explorer in the next year ... it
all depends on finding time and/or students/colleagues with time :)
ttfn,
----hussein
Tim Brody wrote:
>
http://celestial.eprints.org/cgi-bin/status?action=repository;metadataFormat=17
>
>
> Harvested 752 records - I've also implemented some
> character-substitution to fix encoding errors, although this is probably
> not as proficient as Simeon's!
>
> The question is, the more harvesters implement fixes the less pressure
> there is on repositories to fix their output, so should harvesters
> accept bad-XML?
> (once that question is answered, harvesters have to decide how much
> normalisation of metadata they do :-)
>
> All the best,
> Tim.
>
> Simeon Warner wrote:
>
>> In my recent post to oai-general
>>
http://www.openarchives.org/pipermail/oai-general/2003-February/000258.html
>>
>> I said I'd post a note about the current output of DSpace at MIT to this
>> list (which seems a more appropriate forum). I just ran a harvest
and got
>> the log shown below, I've added comments in [].
>>
>> Cheers,
>> Simeon.
>>
>>
>>
>> simeon@ice 14Feb03>more log oaiharvest.pl: Harvest from
>> http://hpds1.mit.edu/oai/ using POST
>> OAIGet: Doing POST to http://hpds1.mit.edu/oai/ args: verb=Identify
>> OAIGet: Note - Got Content-Encoding 'gzip', decoded with 'gunzip -c'
>> OAIGet: Got 200 OK (627bytes decoded to 1328bytes)
>>
>> [nice, DSpace implements gzip content coding]
>>
>> oaiharvest.pl: Identify reports OAI-PMH version 2.0
>> oaiharvest.pl: Doing complete harvest.
>> OAIGet: Doing POST to http://hpds1.mit.edu/oai/ args:
>> verb=ListMetadataFormats
>> OAIGet: Note - Got Content-Encoding 'gzip', decoded with 'gunzip -c'
>> OAIGet: Got 200 OK (307bytes decoded to 643bytes)
>> OAIGet: Doing POST to http://hpds1.mit.edu/oai/ args: verb=ListSets
>> OAIGet: Note - Got Content-Encoding 'gzip', decoded with 'gunzip -c'
>> OAIGet: Got 200 OK (715bytes decoded to 2140bytes)
>> OAIGet: Doing POST to http://hpds1.mit.edu/oai/ args:
>> metadataPrefix=oai_dc&verb=ListRecords
>> OAIGet: Note - Got Content-Encoding 'gzip', decoded with 'gunzip -c'
>> OAIGet: Got 200 OK (254096bytes decoded to 1392720bytes)
>> oaiharvest.pl: UTF-8/XML errors in ListRecords.1:
>>
>> [oops, expat parser fails on response
>> my harvester now attempts to do replacement on bad XML/UTF8 bytes/chars
>> using my utf8conditioner, details at
>> http://www.cs.cornell.edu/people/simeon/software/utf8conditioner/
>> Unless the response can be parsed we can't even know if there is a
>> resumptionToken...]
>> utf8conditioner: Line 320, char 81453, byte 81491: code not allowed in
>> XML: 0x000C, substituted 0x3F
>> Line 324, char 81713, byte 81751: code not allowed in XML: 0x000B,
>> substituted 0x3F
>> Line 1839, char 559834, byte 559890: code not allowed in XML: 0x000B,
>> substituted 0x3F
>> Line 1840, char 559919, byte 559975: code not allowed in XML: 0x000C,
>> substituted 0x3F
>> Line 1843, char 560213, byte 560269: code not allowed in XML: 0x000B,
>> substituted 0x3F
>> Line 1846, char 560475, byte 560531: code not allowed in XML: 0x000B,
>> substituted 0x3F
>> Line 1850, char 560807, byte 560863: code not allowed in XML: 0x000C,
>> substituted 0x3F
>> Line 1851, char 560911, byte 560967: code not allowed in XML: 0x000B,
>> substituted 0x3F
>> Line 2249, char 658132, byte 658188: code not allowed in XML: 0x000B,
>> substituted 0x3F
>> Line 2250, char 658230, byte 658286: code not allowed in XML: 0x000B,
>> substituted 0x3F
>> Line 2253, char 658449, byte 658505: code not allowed in XML: 0x000B,
>> substituted 0x3F
>> Line 2271, char 662207, byte 662263: code not allowed in XML: 0x000B,
>> substituted 0x3F
>> Line 2274, char 662411, byte 662467: code not allowed in XML: 0x000E,
>> substituted 0x3F
>> Line 2287, char 663373, byte 663429: code not allowed in XML: 0x000B,
>> substituted 0x3F
>>
>> [utf8conditioner detected and did replacements for a number of
>> characeters]
>> oaiharvest.pl: Got 752 records (running total: 752)
>> oaiharvest.pl: No resumptionToken, end of complete list.
>>
>> [expat could then parse response extracting 752 records, no
>> resumptionToken]
>>
>> oaiharvest.pl: Done.
>> simeon@ice 14Feb03>
>>
>>
>> [doing the same tests with Xerces...]
>>
>> simeon@ice 14Feb03>xercesCountElements lr
>> [Fatal Error] lr:320:76: An invalid XML character (Unicode: 0xc) was
>> found in the element content of the document.
>>
>> simeon@ice 14Feb03>cat lr | utf8conditioner -x > lrc
>> Line 320, char 81453, byte 81491: code not allowed in XML: 0x000C,
>> substituted 0x3F
>> [..etc, same output as above...]
>> Line 2287, char 663373, byte 663429: code not allowed in XML: 0x000B,
>> substituted 0x3F
>>
>> simeon@ice 14Feb03>xercesCountElements lrc
>> lrc: 4665;519;10 ms (18104 elems, 3013 attrs, 0 spaces, 740197 chars)
>>
>>
>> _______________________________________________
>> OAI-implementers mailing list
>> OAI-implementers@oaisrv.nsdl.cornell.edu
>> http://oaisrv.nsdl.cornell.edu/mailman/listinfo/oai-implementers
>>
>
> _______________________________________________
> OAI-implementers mailing list
> OAI-implementers@oaisrv.nsdl.cornell.edu
> http://oaisrv.nsdl.cornell.edu/mailman/listinfo/oai-implementers
--
=====================================================================
hussein suleman ~ hussein@cs.uct.ac.za ~ http://www.husseinsspace.com
=====================================================================