[OAI-implementers] XML encoding problems with DSpace at MIT
Tim Brody
tim@tim.brody.btinternet.co.uk
Tue, 18 Feb 2003 15:30:15 +0000
Celestial keeps a record of errors that occurred during harvesting:
http://celestial.eprints.org/cgi-bin/status
I reset the errors occasionally to save space.
The mods format appears to be AWOL:
http://celestial.eprints.org/cgi-bin/status?action=repository;metadataFormat=66
The OAI 1.1 memory.loc.gov interface is returning internal server
errors, has this interface been removed (lcoa1 supercede it?)?
How to determine what character encoding a PDF is in probably depends on
your PDF tool (unless you fancy writing a PDF parser :-)
Reading the PDF spec:
http://partners.adobe.com/asn/developer/acrosdk/docs/pdfspec.pdf
The default encoding is ISOLatin1, otherwise quoting the doc:
"If text is encoded in Unicode the first two bytes of the text must be
the Unicode Byte Order marker, <FE FF>."
I guess that if a Text object in PDF is in Unicode it uses UTF-16. I've
not done anything with PDF metadata to know for certain.
All the best,
Tim.
Caroline Arms wrote:
> As a data provider, LC would like to know if it is generating invalid
> characters. The gradual migration to UNICODE is going to give us all
> problems, in part BECAUSE some systems work so hard to recognize different
> character encodings and adjust. I'm with Hussein. Notify data providers
> of problems (even if you do adjust) so that the problem can be fixed as
> close to home as possible.
>
> As a related aside, if anyone has a suggestion for an efficient way
> (preferably unix-based) to check that the metadata in a PDF file is stored
> in UTF-8 encoding (or consistently in any other UNICODE encoding), I'd be
> interested.
>
> Caroline Arms
> Office of Strategic Initiatives
> Library of Congress