[OAI-implementers] character encoding
Tim Brody
tdb01r@ecs.soton.ac.uk
Fri, 31 Oct 2003 12:18:30 -0000
----- Original Message -----
From: "Todd White" <tmwhite@merit.edu>
> i sent a message to the list some time ago and, while working on other
> non-XML and non-OAI projects, i've been closing watching the list in hopes
> of finding the solution to my encoding problem. i'm embarrassed to admit
> that this encoding problem remains.
>
> perhaps i should provide some details...
>
> DATA STORAGE: Oracle
> DATA DELIVERY: DBI.pm
> OAI CONSTRUCTOR: Perl script (using Embperl)
> WEB SERVER: Apache
>
> in other words, i have a single Perl script, in the form an Embperl file,
> that draws the data from Oracle, via DBI, then i simply loop through the
> data and wrap each element with the appropriate XML tag before returning
> the whole mess through STDOUT.
>
> i'm guessing that i should encode each character to UTF-8 as it passes
> through the script, but as yet, i'm not sure how to best do this.
>
> any helpful tips, advice, rants, etc. will be most welcome. i thank you
> in advance.
>
> -Todd
I strongly urge you to use a 5.8.x version of Perl, as it has built-in
support for UTF-8.
As you are outputting via STDOUT you should use:
binmode(STDOUT,":utf8");
Which is pretty self-explanatory :-)
You need to find out what character coding your data is in, and convert it
into UTF-8. e.g. if your data is in ISO-8859-1 ("Latin-1, West Europe") you
would do something like:
use Encode; # Functions for converting strings between encodings
use utf8; # Tell Perl that you are using UTF-8 in your program
$sth = DBI::connect(...)->prepare("SELECT FROM DB");
$sth->execute;
my ($str_latin) = $sth->fetchrow_array();
my $str_utf8 = decode("iso-8859-1",$str_latin);
print $str_utf8; # n.b. you will still need to escape <>"& in string data
__end__
UTF-8 also restricts control characters, so you may need to do something
like:
$str_utf8 =~ s/[\x00-\x08\x0b-\x0d\x0e-\x1f]//sg; # Remove all control
characters except newline (\n)
There are quite a few utility functions in Encode for handling encodings, so
is well worth taking a look at the help page.
(I gotcha I have noticed is Perl modules that are written in C may not flag
a string as UTF-8, even though the data is. There are methods in Encode for
changing this flag - but should be used with caution!)
All the best,
Tim.