[OAI-implementers] Re: oia_dc records (was Re: [EP-tech] Free Text Indexing)

Andy Powell a.powell@ukoln.ac.uk
Tue, 9 Jul 2002 21:16:09 +0100 (BST)


On Mon, 8 Jul 2002, ePrints Support wrote:

> X-Posted to OAI-Implementers from eprints-tech
> 
> I've been thinking hard on this issue.
> 
> I think that it is futile to build anything "clever" on top of
> unqualified DC.
> 
> What would be far more useful is if a number of interested parties
> could agree on a value-added metadata format for doing "clever stuff".

I guess this is what the Academic Metadata Format (AMF) people have been
trying to do?

  http://amf.openlib.org/doc/ebisu.html

I don't have an objection to what you suggest.

Just to clarify the rationale behind my suggestions about the current
oai_dc defaults in the eprints.org software...

1) all eprint archives that support OAI-PMH *must* support oai_dc (to be
compliant with the protocol)

2) we might as well make the default configuration of oai_dc in the
eprints.org software generate metadata that is as useful as possible (even
though it might not be as useful as some other, richer, format).

3) the oai_dc metadata generated by the eprints.org software should
conform to DCMI semantics and guidelines on best practice.

That's all my suggestions were trying to do.  I hope this makes sense?

Regards,

Andy.

> Eg. You, the AKT project and citebase project at Southampton and
> similar projects.
> 
> If a good number of service providers all use the same advanced 
> metadata type then this will be an incentive for OAI archives to
> support it (esp. if EPrints leads the way). 
> 
> The alternative situation is that every archive supports oai_dc or
> oai_dc + 1 random other. Which is (nearly) useless for federating
> richer metadata.
> 
> So the real question is what would people want to use OAI for
> beyond "dumb text search" resource discovery. Things which would
> involve more than just storing and searching the data but actually
> processing it. And what metadata schema would best suit the 
> majority of these needs?
> 
> 
> 
> 
> 
> On Sat, Jul 06, 2002 at 10:44:33AM +0100, Andy Powell wrote:
> > On Sat, 6 Jul 2002, Andy Powell wrote:
> > 
> > > On Fri, 5 Jul 2002, ePrints Support wrote:
> > > 
> > > > Has anybody actually modified the ArchiveOAIConfig.pm module? If none or 
> > > > very few have a will feel confident to totally replace it with a different
> > > > system.
> > > > 
> > > > If anyone made any *good* changes, lemmie know, they might be useful!
> > > 
> > > Not sure if this is what you were asking for, but I have some
> > > comments/questions about the use of DC in the records exposed using
> > > OAI by default (i.e. without modifying the OAI config in 2.0.1).
> > 
> > A simple set of context diffs to acheive most of what I say below (for
> > v2.0.1, sorry I haven't upgraded yet!) are attached.  Note, probably
> > better to handle mime typing thru normal mime.types or whatever if
> > possible?  Also, I didn't know how to get at the alternative URLs.
> > 
> > Results can be viewed using repository explorer
> > 
> > http://oai.dlib.vt.edu/cgi-bin/Explorer/oai2.0/testoai
> > 
> > against
> > 
> > http://eprints.bath.ac.uk/perl/oai
> > 
> > Andy.
> > 
> > > Here's a text view (cut-and-paste from the repository explorer) of a
> > > record from ePrints@Bath:
> > > 
> > > 
> > >       title: An OAI Approach to Sharing Subject Gateway Content
> > >       creator: Powell, Andy
> > >       subject: UKOLN
> > >       description: The Resource Discovery Network (RDN) has taken a...
> > >       date: 2001-01-01
> > >       type: Conference Poster
> > >       identifier: http://eprints.bath.ac.uk/archive/00000003/
> > >       format: pdf http://eprints.bath.ac.uk/archive/00000003/01/1097.pdf
> > > 
> > > I think that dc:identifier should be the URI of the item not the URI of
> > > the abstract page about the item.  For multiple-format items, simply
> > > repeat dc:identifer (as you currently do in dc:format).
> > > 
> > > Your current use of dc:format is not ideal. It would be better (i.e.
> > > conform more with DCMI recommendations) to put a MIME type in dc:format -
> > > putting both a type and a URI in dc:format doesn't match with DCMI
> > > recommendations.
> > > 
> > > I think the abstract page is related to the resource being described by
> > > the metadata (i.e. the abstract page is related to the item(s)), therefore
> > > it would be better to put the URI of the abstract into dc:relation.
> > > 
> > > So, for the record above, I'd prefer to see
> > > 
> > >       identifier: http://eprints.bath.ac.uk/archive/00000003/01/1097.pdf
> > >       format: application/pdf
> > >       relation: http://eprints.bath.ac.uk/archive/00000003/
> > > 
> > > with identifier and format repeated if more than one format is available.
> > > This isn't perfect (because there's no strong tie between the
> > > format/identifier pairs) but it is more in line with DCMI recommendations
> > > and semantics.
> > > 
> > > The URIs for 'alternative' locations of the item (e.g. URIs external to
> > > the eprint archive) should also appear in repeated dc:identifier elements 
> > > IMHO.
> > > 
> > > (Note: there will be some in the DC-camp that would say that in the case
> > > of multiple formats being available, you should expose multiple DC
> > > metadata records using OAI-PMH (one for each available format), perhaps
> > > using dc:relation to tie them together.  I would argue that this is
> > > probably over the top and would make life quite difficult for software
> > > trying to process the multiple metadata records).
> > > 
> > > Finally, you seem to hardcode the day in the date to "01" (I think).  Note
> > > that  2001-09  (i.e. a date without a day) is a perfectly valid ISO8601
> > > date so you don't really need to force the '-01' bit.
> > > 
> > > I'd be interested in other's views on all this...
> > > 
> > > Andy
> > > --
> > > Distributed Systems, UKOLN, University of Bath, Bath, BA2 7AY, UK
> > > http://www.ukoln.ac.uk/ukoln/staff/a.powell       +44 1225 383933
> > > Resource Discovery Network http://www.rdn.ac.uk/
> > > 
> > > 
> > > 
> > 
> > Andy
> > --
> > Distributed Systems, UKOLN, University of Bath, Bath, BA2 7AY, UK
> > http://www.ukoln.ac.uk/ukoln/staff/a.powell       +44 1225 383933
> > Resource Discovery Network http://www.rdn.ac.uk/
> > 
> 
> > *** ArchiveOAIConfig.pm	Wed Jun  5 12:43:09 2002
> > --- ArchiveOAIConfig.pm.orig	Thu Jun 13 17:29:33 2002
> > ***************
> > *** 280,293 ****
> >   
> >   		$month = "01" if( !defined $month );
> >   		
> > ! 		push @dcdata, [ "date", "$year-$month" ];
> >   	}
> >   
> >   	my $ds = $eprint->get_dataset();
> >   	push @dcdata, [ "type", $ds->get_type_name( $session, $eprint->get_value( "type" ) ) ];
> >   
> > ! 	# dc:relation is the URL of the sbstract
> > ! 	push @dcdata, [ "relation", $eprint->get_url() ];
> >   
> >   	# Export the type and URL of each actual document, this
> >   	# is far from ideal, but DC offers no easy solution to
> > --- 280,294 ----
> >   
> >   		$month = "01" if( !defined $month );
> >   		
> > ! 		push @dcdata, [ "date", "$year-$month-01" ];
> >   	}
> >   
> >   	my $ds = $eprint->get_dataset();
> >   	push @dcdata, [ "type", $ds->get_type_name( $session, $eprint->get_value( "type" ) ) ];
> >   
> > ! 	# The identifier is the URL of the abstract page.
> > ! 	# possibly this should be the OAI ID, or both.
> > ! 	push @dcdata, [ "identifier", $eprint->get_url() ];
> >   
> >   	# Export the type and URL of each actual document, this
> >   	# is far from ideal, but DC offers no easy solution to
> > ***************
> > *** 295,311 ****
> >   	# citation linking systems, so better to have it than not.
> >   
> >   	my @documents = $eprint->get_all_documents();
> > - 	my %mime_types = (
> > - 		pdf => "application/pdf"
> > - 		);
> >   	foreach( @documents )
> >   	{
> > ! 		push @dcdata, [ "identifier", $_->get_url() ];
> > ! 		if( $_->is_set( "format" ) )
> > ! 		{
> > ! 			my $format = $mime_types{$_->get_value( "format" )};
> > ! 			push @dcdata, [ "format", $format ] if $format;
> > ! 		}
> >   	}
> >   		
> >   	return @dcdata;
> > --- 296,304 ----
> >   	# citation linking systems, so better to have it than not.
> >   
> >   	my @documents = $eprint->get_all_documents();
> >   	foreach( @documents )
> >   	{
> > ! 		push @dcdata, [ "format", $_->get_value( "format" )." ".$_->get_url() ];
> >   	}
> >   		
> >   	return @dcdata;
> 
> 
> -- 
> 
>  Christopher Gutteridge                   eprints-support@ecs.soton.ac.uk
>  ePrints2 Coder, Support and Stuff        +44 23 8059 4833
> 
> 

Andy
--
Distributed Systems, UKOLN, University of Bath, Bath, BA2 7AY, UK
http://www.ukoln.ac.uk/ukoln/staff/a.powell       +44 1225 383933
Resource Discovery Network http://www.rdn.ac.uk/