regular expressions for cleanup was: Re: [OAI-implementers] XML
encoding problems with DSpace at MIT
Hussein Suleman
hussein@cs.uct.ac.za
Thu, 20 Feb 2003 17:18:24 +0200
hi
Brian Tingle wrote:
> The most common problems I've had as a provider so far have had to do
> with the ampersands in non-XML data that I want to expose.
... (see rest below)
this will work some of the time, but there will be problems if you have
XML/HTML/SGML entities that are other than the standard ones (eg. i
believe © will cause problems) ... maybe you are already addressing
this, but if not, read on ...
XML has only 5 predefined entities (quot, lt, gt, amp, apos) - anything
else requires an external entity definition and OAI requires using
numerical entities instead of those (see start of section 3.2 of
protocol). the clean solution is either to convert any suspected
entities (Latin-1 seems to pop up in many places because of HTML) into
numerical Unicode entities, and then double-escape anything you dont
recognise ... best effort is probably not good enough - if in doubt,
it's better to produce slightly over-escaped valid XML than originally
encoded but possibly invalid XML :)
but, hey, don't reinvent the wheel ... look at the code templates
available on the OAI website. most of the toolkits do some degree of
data cleaning. if you use Perl, the VTOAI template i wrote has a
"Utility.pm" module for data cleaning which does all of the above/below
plus much more.
ttfn,
----hussein
--
=====================================================================
hussein suleman ~ hussein@cs.uct.ac.za ~ http://www.husseinsspace.com
=====================================================================
> This regular expression is what I use to take non-XML data that has lots
> of ampersands and turn them to & but it will not "duouble" escape
> " &c. that might allready be in there allready.
>
> $content =
> (Dad& &Mac & "Jake" JonesMr. A. Birch--S.U.B.&T; B.&T. &T.; Co. :127a)
>
> turns to $content=
> (Dad& &Mac & "Jake" JonesMr. A. Birch--S.U.B.&T; B.&T. &T.; Co. :127a)
>
> my $ident = '[:_A-Za-z][:A-Za-z0-9\-\_]+';
> $content =~ s,\&(?!$ident;),&,sg;
>
>
> Heinrich Stamerjohanns <stamer@uni-oldenburg.de> wrote:
>
>>The most common problem seems to me (I cannot get to arc.cs.odu.edu, to
>>see the parsing errors) that people create Unicode from their databases
>>but forget to remove ISO-control characters, which are not valid in XML
>>(the comment in XML 1.0 spec was irritating and has been changed in XML
>>1.1 spec). Maybe this should be explicitly pointed out in the
>>documentation of the protocol.
>>
>>So to produce valid xml, something like this should be applied before
>
> you
>
>>send out the data (this is in php, but is a perlre pattern):
>>
>>
>> // just remove invalid characters
>> $pattern ="/[\x-\x8\xb-\xc\xe-\x1f]/";
>> $string = preg_replace($pattern,'',$string);
>
>
>
>
>
> On Mon, 17 Feb 2003, Heinrich Stamerjohanns wrote:
>
>
>>On Sat, 15 Feb 2003, Hussein Suleman wrote:
>>
>>
>>> > The question is, the more harvesters implement fixes the less pressure
>>> > there is on repositories to fix their output, so should harvesters
>>> > accept bad-XML?
>>
>>>hi
>>>
>>>i think Tim poses a very relevant question: do we deal with the
>>>so-called "real-world" encoding problems or do we try to encourage
>>>people to fix their implementations? (of course, for research purposes,
>>>we may end up doing both :))
>>>
>>
>>Hi,
>>
>>If you want a working protocol, you must insist that the data-providers
>>deliver valid XML.
>>If they don't deliver valid XML, they are not OAI-compliant, thus some
>>harvesters will choke, some who try to fix the XML, might not.
>>
>>The most common problem seems to me (I cannot get to arc.cs.odu.edu, to
>>see the parsing errors) that people create Unicode from their databases
>>but forget to remove ISO-control characters, which are not valid in XML
>>(the comment in XML 1.0 spec was irritating and has been changed in XML
>>1.1 spec). Maybe this should be explicitly pointed out in the
>>documentation of the protocol.
>>
>>So to produce valid xml, something like this should be applied before you
>>send out the data (this is in php, but is a perlre pattern):
>>
>>
>> // just remove invalid characters
>> $pattern ="/[\x-\x8\xb-\xc\xe-\x1f]/";
>> $string = preg_replace($pattern,'',$string);
>>
>>
>>Greetings, Heinrich
>>
>>
>>--
>> Dr. Heinrich Stamerjohanns Tel. +49-441-798-4276
>> Institute for Science Networking stamer@uni-oldenburg.de
>> University of Oldenburg http://isn.uni-oldenburg.de/~stamer
>>
>>
>>
>>_______________________________________________
>>OAI-implementers mailing list
>>OAI-implementers@oaisrv.nsdl.cornell.edu
>>http://oaisrv.nsdl.cornell.edu/mailman/listinfo/oai-implementers
>>
>
> _______________________________________________
> OAI-implementers mailing list
> OAI-implementers@oaisrv.nsdl.cornell.edu
> http://oaisrv.nsdl.cornell.edu/mailman/listinfo/oai-implementers