[OAI-implementers] valid character encoding
Simeon Warner
simeon@cs.cornell.edu
Wed, 13 Aug 2003 22:46:48 -0400 (EDT)
On Thu, 14 Aug 2003, Steve Thomas wrote:
> While we're on the topic, I have records with the name Niccolò in them (as in
> Machiavelli) -- there's a grave accent over the final "o". But this doesn't
> seem to be part of UTF-8, or your conditioner doesn't recognise it. (Although
> it displays correctly everywhere.)
>
> Is this invalid in UTF-8, or ... what?
>
> When I dump it in Unix, the character is \xf2, apparently.
0xF2 is NOT a valid UTF-8 sequence.
No single byte in the range 0x80--0xFF is a valid UTF-8 sequence. 0xF2 is
the Latin 1, CP1252 and Unicode code for o grave and is represented as a
two-byte sequence in UTF-8 (0xC3 0xB2).
If you have data in Latin 1 it is trivial to convert that to UTF-8 but you
must do the conversion before writing XML records for OAI use!
There seems to be some confusion about these issues so I'll attempt to
summarize a few key points:
o UTF-8 is a particular ENCODING of Unicode (UCS, ISO 10646). Individual
characters are represented by a sequence of between 1 and 6 bytes. Any
byte >= 0x80 is part of a multi-byte sequence.
o The ASCII characters (0x20-0x7F) have the same codes in Latin 1 (aka ISO
8859-1) and Unicode. They are also represented by single bytes with the
same values in a UTF-8 stream.
o The Latin 1 characters (0xC0-0xFF) have the same codes in Unicode. In
UTF-8 streams they are encoded as two-byte sequences. (Direct inclusion of
these codes in UTF-8 will likely result in invalid UTF-8 sequences and
will certainly not be correctly interpreted.)
o Almost every other character set can be mapped to Unicode but may
require look-up-tables.
o There are libraries and tools to do character set conversion and
encoding in most common languages. For example, perl permits quite general
conversion; say latin1 to utf8:
#see http://search.cpan.org/author/JHI/perl-5.8.0/ext/Encode/Encode.pm
use Encode;
$utf8data = encode("utf8", decode("iso-8859-1", $latin1data));
For more details see:
http://www.cl.cam.ac.uk/~mgk25/unicode.html (FAQ)
http://www.ietf.org/rfc/rfc2279.txt (UTF-8)
http://www.unicode.org/standard/standard.html (Unicode)
I hope this helps.
Cheers,
Simeon.