[OAI-implementers] Trouble parsing records with apache commons digester : UTF8 and xerces
UTFDataFormatException
Thomas Krämer
kraemert@smail.uni-koeln.de
Mon, 12 Jan 2004 16:44:29 +0100
This is a multi-part message in MIME format.
--------------050200000108070601040106
Content-Type: text/plain; charset=us-ascii; format=flowed
Content-Transfer-Encoding: 7bit
Hello,
parsing records with the commons digester works pretty fine, set you are not handling special
charactars such as german umlaute, french accents etc.
if found a hint at:
http://www.mail-archive.com/oxf-users@orbeon.com/msg00297.html which is not suitable for harvester
applications.
shouldn't the providers be aware of the right character encoding?
and: does anyone know how to handle this?
(further details see attachment)
kind regards
thomas
--------------050200000108070601040106
Content-Type: message/rfc822;
name="UTF8 and xercers exception : UTFDataFormatException"
Content-Transfer-Encoding: 8bit
Content-Disposition: inline;
filename="UTF8 and xercers exception : UTFDataFormatException"
Message-ID: <4002BD8E.10709@smail.uni-koeln.de>
Date: Mon, 12 Jan 2004 16:30:22 +0100
From: =?ISO-8859-1?Q?Thomas_Kr=E4mer?= <kraemert@smail.uni-koeln.de>
User-Agent: Mozilla/5.0 (X11; U; Linux i686; de-AT; rv:1.5) Gecko/20031007
X-Accept-Language: de, en
MIME-Version: 1.0
To: general@xml.apache.org
Subject: UTF8 and xercers exception : UTFDataFormatException
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 8bit
Hello,
i try to parse metadata records with the apache commons digester, which uses xerces.
unfortunately, almost all that metadata is declared as UTF-8, which causes a
java.io.UTFDataFormatException: Invalid byte 2 of 2-byte UTF-8 sequence.
at org.apache.xerces.impl.io.UTF8Reader.invalidByte(Unknown Source)ava.io.UTFDataFormatException:
Invalid byte 2 of 2-byte UTF-8 sequence.
at org.apache.xerces.impl.io.UTF8Reader.invalidByte(Unknown Source)
when i try to read an xml file such as the one attached below.
in the archive i found a hint, which recommends to change the encoding to ISO-8859-1,
but , of course, this does not help if done at digestion time.
Any suggestions?
kind regards
thomas
<?xml version="1.0" encoding="utf-8"?>
<oai_dc:dc xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/
http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
<dc:title>Medienphilosophie(n)</dc:title> <dc:creator>Hartmann, Dr.
Frank</dc:creator> <dc:subject>Medienphilosophie, Theorie der
Virtualität, Cyberphilosophie</dc:subject> <dc:description>Die Frage, ob
...
wird, auflösen wird lassen. Eine Rekonstruktion relevanter
Positionen.</dc:description> <dc:date>2002-01-01</dc:date>
<dc:type>Book Chapter</dc:type>
<dc:identifier>http://sammelpunkt.philo.at:8080/archive/00000103/</dc:identifier>
<dc:format>html
http://sammelpunkt.philo.at:8080/archive/00000103/01/medienphilosophie.html</dc:format></oai_dc:dc>
--------------050200000108070601040106--