[OAI-implementers] SPECIAL CHARACTERS...
Simeon Warner
simeon@cs.cornell.edu
Wed, 2 Oct 2002 10:27:42 -0400 (EDT)
Tim mentioned encoding support in perl5.8 in his earlier post and I tried
it out some time ago. It seems pretty good and is probably a good
solution if the data coming from your database is in a well-defined (and
supported) encoding such as latin1.
I played with the "from_to" function supplied by the "Encode" module and
it seems very easy to use. These functions will write multi-byte
characters instead of entities but that is fine.
The entity encoding of gt, lt, amp and quot is a separate XML issue which
should be handled by whatever XML writing code you are using.
Cheers,
Simeon.
Code I played with is below, test with:
simeon@ice ~>echo "çãö" | convert-encoding.pl -f ISO-8859-1 -t utf8
çãö
simeon@ice ~>
where the gibberish çãö is actually the correct utf8 bytes displayed
incorrectly on my terminal, perhaps octal makes it more obvious:
simeon@ice ~>echo "çãö" | convert-encoding.pl -f ISO-8859-1 -t utf8 | hexdump -c
0000000 303 247 303 243 303 266 \n
#!/usr/bin/perl5.8.0
#
use strict;
use Getopt::Std;
use vars qw($opt_f $opt_t $opt_h);
my $FROM='utf8';
my $TO='utf8';
unless ((&getopts('f:t:h') && !$opt_h)) {
die "usage: $0 [-f from] [-t to] [-h]\n
Convert bytestream from one encoding to another.
-f from set incoming encoding [default $FROM]
-t to set outgoing encoding [default $TO]
-h this help.\n";
}
my $from = $opt_f || $FROM;
my $to = $opt_t || $TO;
use Encode 'from_to';
undef $/; #make read to string slurp all file
my $data=<STDIN>;
&from_to($data, $from, $to); # from legacy to utf-8
print $data;
On Wed, 2 Oct 2002, Tim Brody wrote:
> (Only tested using the Perl expat parser ...)
>
> I don't *think* your solution will cover all situations (e.g., it didn't
> encode the last of the three example latin characters). Exhaustively parsing
> all 8-bit character codes produces the following required regexps to go from
> raw any-ascii text to UTF-8 parsable (i.e. a shot-gun approach):
>
> s/&/&/sg;
> s/</</sg;
> s/>/>/sg;
> s/[\x00-\x08\x0b-\x0c\x0e-\x1f]//sg;
> s/([\x80-\xff])/sprintf("&#x%04x;",ord($1))/seg;
>
> This will delete any control characters that aren't valid Unicode, and
> entity-encode characters above 127 (note, there are control characters above
> 127 in the Unicode database but these seem to be accepted by the parser
> ...).
>
> It would still be better to use a proper encoding transform than rely on
> regexps :-)
>
> Regards,
> Tim.
>
> ----- Original Message -----
> From: "Marina Muilwijk" <m.muilwijk@library.uu.nl>
> To: "OAI Implementers" <oai-implementers@oaisrv.nsdl.cornell.edu>
> Sent: Friday, September 27, 2002 2:46 PM
> Subject: Re: [OAI-implementers] SPECIAL CHARACTERS...
>
>
> On 27 Sep 2002 at 10:06, Ramon Martins Sodoma da Fonseca wrote:
>
> > We are having problems with the character encoding.
> > We need to display special charaters, like "ç, ã, ö", and others, and
> > our question is:
>
> We use Perl's sprintf function. For instance:
> $creators =~ s/([^<>:a-zA-Z, .\/-])/sprintf "&#x%04X;", ord($1)/ei;
>
> which converts everything but the characters within brackets to their
> hexadecimal value and adds the "&#X" required for Unicode encoding.
>
>
>
> _______________________________________________
> OAI-implementers mailing list
> OAI-implementers@oaisrv.nsdl.cornell.edu
> http://oaisrv.nsdl.cornell.edu/mailman/listinfo/oai-implementers
>
> _______________________________________________
> OAI-implementers mailing list
> OAI-implementers@oaisrv.nsdl.cornell.edu
> http://oaisrv.nsdl.cornell.edu/mailman/listinfo/oai-implementers
>