[OAI-implementers] XML encoding problems with DSpace at MIT
Xiaoming Liu
liu_x@cs.odu.edu
Sat, 15 Feb 2003 20:28:32 -0500 (EST)
I have a list of parsing errors from arc's recent harvest.
http://arc.cs.odu.edu/stat/parserror.txt
In summary, about 140 records from 10 archives did not pass the xerces
parser. This result is far from complete or accurate, and some of them
might be our mistakes.
regards,
liu
On Sat, 15 Feb 2003, Caroline Arms wrote:
>
> As a data provider, LC would like to know if it is generating invalid
> characters. The gradual migration to UNICODE is going to give us all
> problems, in part BECAUSE some systems work so hard to recognize different
> character encodings and adjust. I'm with Hussein. Notify data providers
> of problems (even if you do adjust) so that the problem can be fixed as
> close to home as possible.
>
> As a related aside, if anyone has a suggestion for an efficient way
> (preferably unix-based) to check that the metadata in a PDF file is stored
> in UTF-8 encoding (or consistently in any other UNICODE encoding), I'd be
> interested.
>
> Caroline Arms
> Office of Strategic Initiatives
> Library of Congress
>
> On Sat, 15 Feb 2003, Hussein Suleman wrote:
>
> > hi
> >
> > i think Tim poses a very relevant question: do we deal with the
> > so-called "real-world" encoding problems or do we try to encourage
> > people to fix their implementations? (of course, for research purposes,
> > we may end up doing both :))
> >
> > personally, the code i distribute to others does quite a lot of XML
> > cleaning in the data provider, but none at all in the harvester. i think
> > the basic philosophy i'm following is: clean data as close to the source
> > as possible. also, i believe one of the reasons the adminEmail field in
> > Identify responses is required is so that a service provider can contact
> > the administrator if there are problems with the data.
> >
> > and now that the hype about OAI2 is dying down, i wonder how much (if
> > any) more testing we need. i have some ideas to enhance, complement and
> > possibly even replace the repository explorer in the next year ... it
> > all depends on finding time and/or students/colleagues with time :)
> >
> > ttfn,
> > ----hussein
> >
> >
> > Tim Brody wrote:
> > >
> > http://celestial.eprints.org/cgi-bin/status?action=repository;metadataFormat=17
> >
> > >
> > >
> > > Harvested 752 records - I've also implemented some
> > > character-substitution to fix encoding errors, although this is probably
> > > not as proficient as Simeon's!
> > >
> > > The question is, the more harvesters implement fixes the less pressure
> > > there is on repositories to fix their output, so should harvesters
> > > accept bad-XML?
> > > (once that question is answered, harvesters have to decide how much
> > > normalisation of metadata they do :-)
> > >
> > > All the best,
> > > Tim.
> > >
> > > Simeon Warner wrote:
> > >
> > >> In my recent post to oai-general
> > >>
> > http://www.openarchives.org/pipermail/oai-general/2003-February/000258.html
> > >>
> > >> I said I'd post a note about the current output of DSpace at MIT to this
> > >> list (which seems a more appropriate forum). I just ran a harvest
> > and got
> > >> the log shown below, I've added comments in [].
> > >>
> > >> Cheers,
> > >> Simeon.
> > >>
> > >>
> > >>
> > >> simeon@ice 14Feb03>more log oaiharvest.pl: Harvest from
> > >> http://hpds1.mit.edu/oai/ using POST
> > >> OAIGet: Doing POST to http://hpds1.mit.edu/oai/ args: verb=Identify
> > >> OAIGet: Note - Got Content-Encoding 'gzip', decoded with 'gunzip -c'
> > >> OAIGet: Got 200 OK (627bytes decoded to 1328bytes)
> > >>
> > >> [nice, DSpace implements gzip content coding]
> > >>
> > >> oaiharvest.pl: Identify reports OAI-PMH version 2.0
> > >> oaiharvest.pl: Doing complete harvest.
> > >> OAIGet: Doing POST to http://hpds1.mit.edu/oai/ args:
> > >> verb=ListMetadataFormats
> > >> OAIGet: Note - Got Content-Encoding 'gzip', decoded with 'gunzip -c'
> > >> OAIGet: Got 200 OK (307bytes decoded to 643bytes)
> > >> OAIGet: Doing POST to http://hpds1.mit.edu/oai/ args: verb=ListSets
> > >> OAIGet: Note - Got Content-Encoding 'gzip', decoded with 'gunzip -c'
> > >> OAIGet: Got 200 OK (715bytes decoded to 2140bytes)
> > >> OAIGet: Doing POST to http://hpds1.mit.edu/oai/ args:
> > >> metadataPrefix=oai_dc&verb=ListRecords
> > >> OAIGet: Note - Got Content-Encoding 'gzip', decoded with 'gunzip -c'
> > >> OAIGet: Got 200 OK (254096bytes decoded to 1392720bytes)
> > >> oaiharvest.pl: UTF-8/XML errors in ListRecords.1:
> > >>
> > >> [oops, expat parser fails on response
> > >> my harvester now attempts to do replacement on bad XML/UTF8 bytes/chars
> > >> using my utf8conditioner, details at
> > >> http://www.cs.cornell.edu/people/simeon/software/utf8conditioner/
> > >> Unless the response can be parsed we can't even know if there is a
> > >> resumptionToken...]
> > >> utf8conditioner: Line 320, char 81453, byte 81491: code not allowed in
> > >> XML: 0x000C, substituted 0x3F
> > >> Line 324, char 81713, byte 81751: code not allowed in XML: 0x000B,
> > >> substituted 0x3F
> > >> Line 1839, char 559834, byte 559890: code not allowed in XML: 0x000B,
> > >> substituted 0x3F
> > >> Line 1840, char 559919, byte 559975: code not allowed in XML: 0x000C,
> > >> substituted 0x3F
> > >> Line 1843, char 560213, byte 560269: code not allowed in XML: 0x000B,
> > >> substituted 0x3F
> > >> Line 1846, char 560475, byte 560531: code not allowed in XML: 0x000B,
> > >> substituted 0x3F
> > >> Line 1850, char 560807, byte 560863: code not allowed in XML: 0x000C,
> > >> substituted 0x3F
> > >> Line 1851, char 560911, byte 560967: code not allowed in XML: 0x000B,
> > >> substituted 0x3F
> > >> Line 2249, char 658132, byte 658188: code not allowed in XML: 0x000B,
> > >> substituted 0x3F
> > >> Line 2250, char 658230, byte 658286: code not allowed in XML: 0x000B,
> > >> substituted 0x3F
> > >> Line 2253, char 658449, byte 658505: code not allowed in XML: 0x000B,
> > >> substituted 0x3F
> > >> Line 2271, char 662207, byte 662263: code not allowed in XML: 0x000B,
> > >> substituted 0x3F
> > >> Line 2274, char 662411, byte 662467: code not allowed in XML: 0x000E,
> > >> substituted 0x3F
> > >> Line 2287, char 663373, byte 663429: code not allowed in XML: 0x000B,
> > >> substituted 0x3F
> > >>
> > >> [utf8conditioner detected and did replacements for a number of
> > >> characeters]
> > >> oaiharvest.pl: Got 752 records (running total: 752)
> > >> oaiharvest.pl: No resumptionToken, end of complete list.
> > >>
> > >> [expat could then parse response extracting 752 records, no
> > >> resumptionToken]
> > >>
> > >> oaiharvest.pl: Done.
> > >> simeon@ice 14Feb03>
> > >>
> > >>
> > >> [doing the same tests with Xerces...]
> > >>
> > >> simeon@ice 14Feb03>xercesCountElements lr
> > >> [Fatal Error] lr:320:76: An invalid XML character (Unicode: 0xc) was
> > >> found in the element content of the document.
> > >>
> > >> simeon@ice 14Feb03>cat lr | utf8conditioner -x > lrc
> > >> Line 320, char 81453, byte 81491: code not allowed in XML: 0x000C,
> > >> substituted 0x3F
> > >> [..etc, same output as above...]
> > >> Line 2287, char 663373, byte 663429: code not allowed in XML: 0x000B,
> > >> substituted 0x3F
> > >>
> > >> simeon@ice 14Feb03>xercesCountElements lrc
> > >> lrc: 4665;519;10 ms (18104 elems, 3013 attrs, 0 spaces, 740197 chars)
> > >>
> > >>
> > >> _______________________________________________
> > >> OAI-implementers mailing list
> > >> OAI-implementers@oaisrv.nsdl.cornell.edu
> > >> http://oaisrv.nsdl.cornell.edu/mailman/listinfo/oai-implementers
> > >>
> > >
> > > _______________________________________________
> > > OAI-implementers mailing list
> > > OAI-implementers@oaisrv.nsdl.cornell.edu
> > > http://oaisrv.nsdl.cornell.edu/mailman/listinfo/oai-implementers
> >
> >
> > --
> > =====================================================================
> > hussein suleman ~ hussein@cs.uct.ac.za ~ http://www.husseinsspace.com
> > =====================================================================
> >
> >
> > _______________________________________________
> > OAI-implementers mailing list
> > OAI-implementers@oaisrv.nsdl.cornell.edu
> > http://oaisrv.nsdl.cornell.edu/mailman/listinfo/oai-implementers
> >
>
> _______________________________________________
> OAI-implementers mailing list
> OAI-implementers@oaisrv.nsdl.cornell.edu
> http://oaisrv.nsdl.cornell.edu/mailman/listinfo/oai-implementers
>