[OAI-implementers] character vs entity references
Thomas G. Habing
thabing@uiuc.edu
Wed, 05 Nov 2003 12:30:52 -0600
Heinrich Stamerjohanns wrote:
> On Wed, 5 Nov 2003, Tim Brody wrote:
>
>
>>AFAIK a character reference is a reference into the Unicode character set,
>>so its invalid whether its in &#xx; form, utf-8, utf16 or whatever.
>>
>
> I do not know what you exactly mean by that, but "ñ" is certainly a
> correct character reference. The byte presentation of characters
> above 127 is just different (ISO-8859-1:1 byte, UTF-8:more bytes),
> but the character-reference ñ represents the same character in
> XML(iso-8859-1) and XML(UTF-8).
>
>
>>You should either remove the characters or convert the character to its
>>nearest equivalent in Unicode (for control characters there probably isn't
>>one).
>
>
> I remove invalid characters with this (PHP code with perlregex):
> // just remove invalid control characters
> $pattern ="/[\x-\x8\xb-\xc\xe-\x1f]/";
> $string = preg_replace($pattern,'',$string);
>
> Heinrich
>
>
You need to be careful of characters in the x7F-x9F range. In Unicode these
are all control characters and are forbidden in XML 1.0. But in many
charsets these points are occupied by printable characters, such as in the
Windows:Western charset where, for example, x8A is the S with caron, but in
Unicode this needs to be converted to x160. If you just took this
character and turned it into entity Š the resulting XML would not be valid.
--
Thomas Habing
Research Programmer, Digital Library Projects
University of Illinois at Urbana-Champaign
155 Grainger Engineering Library Information Center, MC-274
thabing@uiuc.edu, (217) 244-4425
http://dli.grainger.uiuc.edu