regular expressions for cleanup was: Re: [OAI-implementers] XML encoding problems with DSpace at MIT

Brian Tingle btingle@hades.ucop.edu
Mon, 17 Feb 2003 05:44:47 -0800 (PST)


The most common problems I've had as a provider so far have had to do 
with the ampersands in non-XML data that I want to expose.

This regular expression is what I use to take non-XML data that has lots 
of ampersands and turn them to & but it will not "duouble" escape
" &c. that might allready be in there allready.

$content =
(Dad& &Mac & "Jake" JonesMr. A. Birch--S.U.B.&T; B.&T. &T.; Co.  :127a) 

turns to $content=
(Dad& &Mac & "Jake" JonesMr. A. Birch--S.U.B.&T; B.&T. &T.; Co.  :127a)

        my $ident = '[:_A-Za-z][:A-Za-z0-9\-\_]+';
        $content =~ s,\&(?!$ident;),&,sg;


Heinrich Stamerjohanns <stamer@uni-oldenburg.de> wrote:
> The most common problem seems to me (I cannot get to arc.cs.odu.edu, to
> see the parsing errors) that people create Unicode from their databases
> but forget to remove ISO-control characters, which are not valid in XML
> (the comment in XML 1.0 spec was irritating and has been changed in XML
> 1.1 spec). Maybe this should be explicitly pointed out in the
> documentation of the protocol.
> 
> So to produce valid xml, something like this should be applied before 
you
> send out the data (this is in php, but is a perlre pattern):
> 
>   
>         // just remove invalid characters
>         $pattern ="/[\x-\x8\xb-\xc\xe-\x1f]/";
>         $string = preg_replace($pattern,'',$string);




On Mon, 17 Feb 2003, Heinrich Stamerjohanns wrote:

> On Sat, 15 Feb 2003, Hussein Suleman wrote:
> 
> >  > The question is, the more harvesters implement fixes the less pressure
> >  > there is on repositories to fix their output, so should harvesters
> >  > accept bad-XML?
> 
> > hi
> >
> > i think Tim poses a very relevant question: do we deal with the
> > so-called "real-world" encoding problems or do we try to encourage
> > people to fix their implementations? (of course, for research purposes,
> > we may end up doing both :))
> >
> 
> Hi,
> 
> If you want a working protocol, you must insist that the data-providers
> deliver valid XML.
> If they don't deliver valid XML, they are not OAI-compliant, thus some
> harvesters will choke, some who try to fix the XML, might not.
> 
> The most common problem seems to me (I cannot get to arc.cs.odu.edu, to
> see the parsing errors) that people create Unicode from their databases
> but forget to remove ISO-control characters, which are not valid in XML
> (the comment in XML 1.0 spec was irritating and has been changed in XML
> 1.1 spec). Maybe this should be explicitly pointed out in the
> documentation of the protocol.
> 
> So to produce valid xml, something like this should be applied before you
> send out the data (this is in php, but is a perlre pattern):
> 
> 
>         // just remove invalid characters
>         $pattern ="/[\x-\x8\xb-\xc\xe-\x1f]/";
>         $string = preg_replace($pattern,'',$string);
> 
> 
> Greetings, Heinrich
> 
> 
> --
>   Dr. Heinrich Stamerjohanns        Tel. +49-441-798-4276
>   Institute for Science Networking  stamer@uni-oldenburg.de
>   University of Oldenburg           http://isn.uni-oldenburg.de/~stamer
> 
> 
> 
> _______________________________________________
> OAI-implementers mailing list
> OAI-implementers@oaisrv.nsdl.cornell.edu
> http://oaisrv.nsdl.cornell.edu/mailman/listinfo/oai-implementers
>