regular expressions for cleanup was: Re: [OAI-implementers] XML encoding problems with DSpace at MIT
Brian Tingle
btingle@hades.ucop.edu
Mon, 17 Feb 2003 05:44:47 -0800 (PST)
The most common problems I've had as a provider so far have had to do
with the ampersands in non-XML data that I want to expose.
This regular expression is what I use to take non-XML data that has lots
of ampersands and turn them to & but it will not "duouble" escape
" &c. that might allready be in there allready.
$content =
(Dad& &Mac & "Jake" JonesMr. A. Birch--S.U.B.&T; B.&T. &T.; Co. :127a)
turns to $content=
(Dad& &Mac & "Jake" JonesMr. A. Birch--S.U.B.&T; B.&T. &T.; Co. :127a)
my $ident = '[:_A-Za-z][:A-Za-z0-9\-\_]+';
$content =~ s,\&(?!$ident;),&,sg;
Heinrich Stamerjohanns <stamer@uni-oldenburg.de> wrote:
> The most common problem seems to me (I cannot get to arc.cs.odu.edu, to
> see the parsing errors) that people create Unicode from their databases
> but forget to remove ISO-control characters, which are not valid in XML
> (the comment in XML 1.0 spec was irritating and has been changed in XML
> 1.1 spec). Maybe this should be explicitly pointed out in the
> documentation of the protocol.
>
> So to produce valid xml, something like this should be applied before
you
> send out the data (this is in php, but is a perlre pattern):
>
>
> // just remove invalid characters
> $pattern ="/[\x-\x8\xb-\xc\xe-\x1f]/";
> $string = preg_replace($pattern,'',$string);
On Mon, 17 Feb 2003, Heinrich Stamerjohanns wrote:
> On Sat, 15 Feb 2003, Hussein Suleman wrote:
>
> > > The question is, the more harvesters implement fixes the less pressure
> > > there is on repositories to fix their output, so should harvesters
> > > accept bad-XML?
>
> > hi
> >
> > i think Tim poses a very relevant question: do we deal with the
> > so-called "real-world" encoding problems or do we try to encourage
> > people to fix their implementations? (of course, for research purposes,
> > we may end up doing both :))
> >
>
> Hi,
>
> If you want a working protocol, you must insist that the data-providers
> deliver valid XML.
> If they don't deliver valid XML, they are not OAI-compliant, thus some
> harvesters will choke, some who try to fix the XML, might not.
>
> The most common problem seems to me (I cannot get to arc.cs.odu.edu, to
> see the parsing errors) that people create Unicode from their databases
> but forget to remove ISO-control characters, which are not valid in XML
> (the comment in XML 1.0 spec was irritating and has been changed in XML
> 1.1 spec). Maybe this should be explicitly pointed out in the
> documentation of the protocol.
>
> So to produce valid xml, something like this should be applied before you
> send out the data (this is in php, but is a perlre pattern):
>
>
> // just remove invalid characters
> $pattern ="/[\x-\x8\xb-\xc\xe-\x1f]/";
> $string = preg_replace($pattern,'',$string);
>
>
> Greetings, Heinrich
>
>
> --
> Dr. Heinrich Stamerjohanns Tel. +49-441-798-4276
> Institute for Science Networking stamer@uni-oldenburg.de
> University of Oldenburg http://isn.uni-oldenburg.de/~stamer
>
>
>
> _______________________________________________
> OAI-implementers mailing list
> OAI-implementers@oaisrv.nsdl.cornell.edu
> http://oaisrv.nsdl.cornell.edu/mailman/listinfo/oai-implementers
>