[OAI-implementers] Trouble parsing records with apache commons digester : UTF8 and xerces UTFDataFormatException

Thomas Krämer kraemert@smail.uni-koeln.de
Thu, 15 Jan 2004 17:49:00 +0100


Hello,

i try parsing records with the commons digester, which works pretty fine, set you are not handling 
special charactars such as german umlaute, french accents etc.

if found a hint at:

http://www.mail-archive.com/oxf-users@orbeon.com/msg00297.html which is not suitable for harvester
applications.

shouldn't the providers be aware of the right character encoding?
and: does anyone know how to handle this?

I am not sure about whether i making wrong assumtions or the handlind of character encoding is not 
standardized yet.

an example:

i try to parse metadata records with the apache commons digester, which uses xerces.

unfortunately, all that metadata is declared as UTF-8, which causes a


java.io.UTFDataFormatException: Invalid byte 2 of 2-byte UTF-8 sequence.
     at org.apache.xerces.impl.io.UTF8Reader.invalidByte(Unknown 
Source)ava.io.UTFDataFormatException: Invalid byte 2 of 2-byte UTF-8 sequence.
     at org.apache.xerces.impl.io.UTF8Reader.invalidByte(Unknown Source)


when i try to read an xml file such as the one attached below.


Any suggestions?



<?xml version="1.0" encoding="utf-8"?>
<oai_dc:dc xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/
http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
<dc:title>Medienphilosophie(n)</dc:title>         <dc:creator>Hartmann, Dr.
Frank</dc:creator>         <dc:subject>Medienphilosophie, Theorie der
Virtualität, Cyberphilosophie</dc:subject>         <dc:description>Die Frage, ob

...

wird, auflösen wird lassen. Eine Rekonstruktion relevanter
Positionen.</dc:description>         <dc:date>2002-01-01</dc:date>
<dc:type>Book Chapter</dc:type>
<dc:identifier>http://sammelpunkt.philo.at:8080/archive/00000103/</dc:identifier> <dc:format>html 
http://sammelpunkt.philo.at:8080/archive/00000103/01/medienphilosophie.html</dc:format></oai_dc:dc>




kind regards

thomas