<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">

<HTML>

<HEAD>

<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">

<META NAME="Generator" CONTENT="MS Exchange Server version 6.5.7226.0">

<TITLE>RE: [OAI-implementers] implementation of non-English characters w/UTF-8?</TITLE>

</HEAD>

<BODY>

<!-- Converted from text/plain format -->


<P><FONT SIZE=2>Hi Jewel,<BR>

<BR>

UTF-8 can handle any Unicode character (&quot;e&quot; with an accent, and thousands more from many languages).&nbsp; As long as the encoding of the your characters constitute valid UTF-8, you should be set.&nbsp; The problem often arises when you think you have UTF-8 to begin with, but your source data is actually using some other encoding.&nbsp; Often the problem isn't apparent until you get to the non-ascii characters because several different encodings represent &quot;low-ascii&quot; in the same way (the first few bytes).&nbsp; It sounds like that might be what's happening in your case.<BR>

<BR>

If so, the best thing to do (and this is sometimes really hard) is to find out what encoding the original provider of the file used.&nbsp; If you know that, then you can convert it to UTF-8 using a tool designed for that job[1].&nbsp;&nbsp; If you're unable to determine what the original encoding was, you can at least make the file validate by replacing the odd characters with valid (though, probably incorrect) UTF-8 ones[2].<BR>

<BR>

- Chris<BR>

<BR>

[1] Like this one that google told me about: <A HREF="http://www.chilkatsoft.com/CharsetStudio.asp">http://www.chilkatsoft.com/CharsetStudio.asp</A><BR>

[2] Simeon here at Cornell wrote a nice utility for this: <A HREF="http://www.cs.cornell.edu/people/simeon/software/utf8conditioner/">http://www.cs.cornell.edu/people/simeon/software/utf8conditioner/</A><BR>

<BR>

-----Original Message-----<BR>

From: oai-implementers-bounces@openarchives.org on behalf of Jewel Ward<BR>

Sent: Tue 9/13/2005 3:29 PM<BR>

To: OAI-implementers<BR>

Subject: [OAI-implementers] implementation of non-English characters w/UTF-8?<BR>

<BR>

<BR>

How have other people implemented &quot;non-UTF-8&quot; characters in their DP<BR>

records?<BR>

<BR>

Meaning, we have non-English characters that are &quot;choking&quot; when we test<BR>

our Data Provider.&nbsp; [Think &quot;e&quot; with the accent over it<BR>

<A HREF="http://lib-app1.usc.edu:8085/oaidp?verb=GetRecord&identifier=oai:usc:digitalarchive:bhe/bhe-m27&metadataPrefix=oai_dc">http://lib-app1.usc.edu:8085/oaidp?verb=GetRecord&identifier=oai:usc:digitalarchive:bhe/bhe-m27&metadataPrefix=oai_dc</A><BR>

(surname after first name of &quot;Elmo&quot;).]&nbsp; Eventually, we will have several<BR>

Asian language character sets, as well as the current non-English<BR>

characters.<BR>

<BR>

I have looked over the protocol, looked at various tutorials, the<BR>

oai-implementers archives, and the OAI Best Practices site, and have not<BR>

seen any guidelines other than this thread:<BR>

<BR>

<A HREF="http://www.openarchives.org/pipermail/oai-implementers/2001-April/000093.html">http://www.openarchives.org/pipermail/oai-implementers/2001-April/000093.html</A><BR>

<BR>

I'm also looking at OLAC and some of the DP implementations in Japan,<BR>

but have not [yet] found the solution.&nbsp; [Like this:<BR>

<A HREF="http://mitizane.ll.chiba-u.jp/cgi-bin/oai/oai2.0?verb=ListRecords&metadataPrefix=oai_dc">http://mitizane.ll.chiba-u.jp/cgi-bin/oai/oai2.0?verb=ListRecords&metadataPrefix=oai_dc</A><BR>

.]<BR>

<BR>

Will we just have to locate the individual characters that are choking<BR>

and encode those a specific way?<BR>

<BR>

Thanks in advance,<BR>

<BR>

Jewel<BR>

<BR>

--<BR>

Jewel H. Ward<BR>

Program Manager, USC Digital Archive<BR>

Leavey Library, Information Services Division<BR>

University of Southern California<BR>

Tel: (213) 821-2298&nbsp;&nbsp; Cell: (213) 219-2784<BR>

<BR>

_______________________________________________<BR>

OAI-implementers mailing list<BR>

List information, archives, preferences and to unsubscribe:<BR>

<A HREF="http://www.openarchives.org/mailman/listinfo/oai-implementers">http://www.openarchives.org/mailman/listinfo/oai-implementers</A><BR>

<BR>

<BR>

</FONT>

</P>


</BODY>

</HTML>