<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<HTML>
<HEAD>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">
<META NAME="Generator" CONTENT="MS Exchange Server version 6.5.7226.0">
<TITLE>RE: [OAI-implementers] implementation of non-English characters w/UTF-8?</TITLE>
</HEAD>
<BODY>
<!-- Converted from text/plain format -->
<P><FONT SIZE=2>Hi Jewel,<BR>
<BR>
UTF-8 can handle any Unicode character ("e" with an accent, and thousands more from many languages). As long as the encoding of the your characters constitute valid UTF-8, you should be set. The problem often arises when you think you have UTF-8 to begin with, but your source data is actually using some other encoding. Often the problem isn't apparent until you get to the non-ascii characters because several different encodings represent "low-ascii" in the same way (the first few bytes). It sounds like that might be what's happening in your case.<BR>
<BR>
If so, the best thing to do (and this is sometimes really hard) is to find out what encoding the original provider of the file used. If you know that, then you can convert it to UTF-8 using a tool designed for that job[1]. If you're unable to determine what the original encoding was, you can at least make the file validate by replacing the odd characters with valid (though, probably incorrect) UTF-8 ones[2].<BR>
<BR>
- Chris<BR>
<BR>
[1] Like this one that google told me about: <A HREF="http://www.chilkatsoft.com/CharsetStudio.asp">http://www.chilkatsoft.com/CharsetStudio.asp</A><BR>
[2] Simeon here at Cornell wrote a nice utility for this: <A HREF="http://www.cs.cornell.edu/people/simeon/software/utf8conditioner/">http://www.cs.cornell.edu/people/simeon/software/utf8conditioner/</A><BR>
<BR>
-----Original Message-----<BR>
From: oai-implementers-bounces@openarchives.org on behalf of Jewel Ward<BR>
Sent: Tue 9/13/2005 3:29 PM<BR>
To: OAI-implementers<BR>
Subject: [OAI-implementers] implementation of non-English characters w/UTF-8?<BR>
<BR>
<BR>
How have other people implemented "non-UTF-8" characters in their DP<BR>
records?<BR>
<BR>
Meaning, we have non-English characters that are "choking" when we test<BR>
our Data Provider. [Think "e" with the accent over it<BR>
<A HREF="http://lib-app1.usc.edu:8085/oaidp?verb=GetRecord&identifier=oai:usc:digitalarchive:bhe/bhe-m27&metadataPrefix=oai_dc">http://lib-app1.usc.edu:8085/oaidp?verb=GetRecord&identifier=oai:usc:digitalarchive:bhe/bhe-m27&metadataPrefix=oai_dc</A><BR>
(surname after first name of "Elmo").] Eventually, we will have several<BR>
Asian language character sets, as well as the current non-English<BR>
characters.<BR>
<BR>
I have looked over the protocol, looked at various tutorials, the<BR>
oai-implementers archives, and the OAI Best Practices site, and have not<BR>
seen any guidelines other than this thread:<BR>
<BR>
<A HREF="http://www.openarchives.org/pipermail/oai-implementers/2001-April/000093.html">http://www.openarchives.org/pipermail/oai-implementers/2001-April/000093.html</A><BR>
<BR>
I'm also looking at OLAC and some of the DP implementations in Japan,<BR>
but have not [yet] found the solution. [Like this:<BR>
<A HREF="http://mitizane.ll.chiba-u.jp/cgi-bin/oai/oai2.0?verb=ListRecords&metadataPrefix=oai_dc">http://mitizane.ll.chiba-u.jp/cgi-bin/oai/oai2.0?verb=ListRecords&metadataPrefix=oai_dc</A><BR>
.]<BR>
<BR>
Will we just have to locate the individual characters that are choking<BR>
and encode those a specific way?<BR>
<BR>
Thanks in advance,<BR>
<BR>
Jewel<BR>
<BR>
--<BR>
Jewel H. Ward<BR>
Program Manager, USC Digital Archive<BR>
Leavey Library, Information Services Division<BR>
University of Southern California<BR>
Tel: (213) 821-2298 Cell: (213) 219-2784<BR>
<BR>
_______________________________________________<BR>
OAI-implementers mailing list<BR>
List information, archives, preferences and to unsubscribe:<BR>
<A HREF="http://www.openarchives.org/mailman/listinfo/oai-implementers">http://www.openarchives.org/mailman/listinfo/oai-implementers</A><BR>
<BR>
<BR>
</FONT>
</P>
</BODY>
</HTML>