[OAI-implementers] character vs entity references
Thomas G. Habing
thabing@uiuc.edu
Fri, 07 Nov 2003 12:05:47 -0600
Thomas G. Habing wrote:
>
> You need to be careful of characters in the x7F-x9F range. In Unicode
> these are all control characters and are forbidden in XML 1.0. But in
> many charsets these points are occupied by printable characters, such as
> in the Windows:Western charset where, for example, x8A is the S with
> caron, but in Unicode this needs to be converted to x160. If you just
> took this character and turned it into entity Š the resulting XML
> would not be valid.
>
Hi all,
I need to amend this slightly. Characters in the range x7F-x9F are legal in
XML 1.0 and a compliant parser shouldn't complain about them (although I am
pretty certain that some earlier XML parsers did complain about characters
in this range). In any case, you still need to be careful with these
characters if you are converting from one of the Windows character sets. A
good description of the issue can be found at
http://www.w3.org/International/questions/qa-controls.html. Note that XML
1.1 treats control characters somewhat differently than 1.0 in that it
allows them but they can only be represented as Numeric Character References.
Regards,
Tom
--
Thomas Habing
Research Programmer, Digital Library Projects
University of Illinois at Urbana-Champaign
155 Grainger Engineering Library Information Center, MC-274
thabing@uiuc.edu, (217) 244-4425
http://dli.grainger.uiuc.edu