[OAI-implementers] valid character encoding
Simeon Warner
simeon@cs.cornell.edu
Wed, 13 Aug 2003 11:27:55 -0400 (EDT)
On Wed, 13 Aug 2003, Todd White wrote:
> On Wed, 13 Aug 2003, Thomas G. Habing wrote:
>
> > The OAI spec mandates that all XML responses must be encoded as UTF-8.
>
> here's an example of a record that has a special character. i'm not if
> i'm handling it correctly. can anyone confirm?
>
> http://michiganteacher.net/oai?verb=GetRecord&metadataPrefix=oai_dc&identifier=oai:michiganteacher.net:120
You have "mus\'ee" in the title and the e acute is not UTF-8 encoded. You
have
0xE9 0x00E9 #LATIN SMALL LETTER E WITH ACUTE
You might find my little utf8conditioner code helpful for checking
(http://www.cs.cornell.edu/people/simeon/software/utf8conditioner/)
your UTF8 output:
simeon@ice ~>cat oai.xml | ~/src/utf8/utf8conditioner -c
Line 22, char 1181, byte 1181: byte 2 isn't continuation: 0xE9 0x65, restart at 0x65, substituted 0x3F
The correct UTF-8 encoding for character code E9 is the two byte
sequence C3 A9.
Cheers,
Simeon