[UPS] research proposal for NSF ITR? deadline for letter of intent is N ov. 15

Wed, 10 Nov 1999 08:59:03 -0500

This message is in MIME format. Since your mail reader does not understand
this format, some or all of this message may not be legible.

------_=_NextPart_000_01BF2B83.C17ABDA0
Content-Type: text/plain;
	charset="iso-8859-1"

Hi!

I am writing to suggest a serious collaborative research proposal
effort. Please read what is below and send me a brief reply
no later than Friday 11/12.

After our meeting, a number went on to the Dublin Core 7 meeting in
Frankfurt and continued some of the discussion.  Also, there has
been some email exchange involving Herbert and Carl, as well as
several at the University of Virginia, who Carl recommended become
involved, because of their enhancements and application of the Fedora
work from Cornell.

I've tried to summarize my view of all these discussions, in two
parts, below. The first part, A, addresses the Problems we are
tackling, couched in broad terms related to measuring impact on
society.  The second part, B, elaborates the vision that I think
a number of us were trying to develop in terms of open digital libraries
using a component-based distributed architecture.

These are both rough, so while I encourage detailed comments, please
note that this was sent without the weeks of polishing needed; consider
the general ideas and see if they resonate.

Now, my question is, should our group write a serious proposal to NSF,
perhaps for the ITR program, that is NSF 99-167
    http://www.nsf.gov/cgi-bin/getpub?nsf99167
I believe that we might have a chance to get funding to do some or
all of the research that is discussed below.  Note that a letter of
intent is due Nov. 15, so we need to decide this soon.

I am willing to take the lead on this, but only if noone else wants
to.  Carl is too busy, and I believe it would be hard for either
Herbert or the LANL group to get NSF support.  If others think that
they are in a better position, I'd happily pass on this opportunity :-)

I realize that many in the group want their archives to grow and
prosper and while they support UPS, don't view that as their central
interest. Others may view the integrated UPS as an important area
of research in its own right. I think we need both groups, working
together.

Anyhow, I now call for brief replies from each interested party to
answer the following questions.  I believe a reply could fit in a
short paragraph if you are pressed for time - I hope everyone will
reply no later than by Friday  11/12.

1. Do you think we have a chance of getting NSF support through 99-167?
2. Do you think the issues raised in part A below define key
   research questions?
3. Does the outline in part B define a sensible follow-on for research
   according to our group vision?
4. Would you like to be involved in this? If so, what would your role
   be?  (answer with as many as apply)
   a- Work on own archive, but integrate it with UPS.
   b- Focus on UPS and its architecture, development, evaluation.
   c- Research, mostly on the architecture and software side.
   d- Research, mostly on the sociological and usage side.
5. What other comments and suggestions do you have?

I look forward to your replies. Many thanks, Ed

 - - - - - - - A. thoughts about research follow- - - -- - - - 
If we go after funding for the UPS initiative, what are the most important
research questions, whose resolution will have the greatest impact, and
will help advance our understanding the most?  Below are some candidates -
suggestions on others are welcome:

1. Research into the nature of scholarship and its change as a result of
UPS:
    - Will research build upon newer work than was possible before,
       since the delays in learning about scholarly efforts are reduced?
       Will that happen universally, or only in some situations?
       For example, will that happen only in less well known places
       that would not have heard through "invisible colleges", thus
       enfranchising smaller groups?
    - Will scholars look at more works than before since it is easier?
       If so, how much of those works will be examined? Which parts?
    - Will looking at such works not previously used (e.g., theses)
       provide real benefit?  Which types or genres are most beneficial?
       Or what combinations?
    - Will scholarly habits shift in a significant way to use UPS?
        Instead of works that cost (lots) more? Instead of works that are
        not as readily available (e.g., journals not available
        electronically)?
        For what types of scholarly activities / tasks?  For what learning
        activities?
        Will there be more cross-disciplinary research?

2. Digital library architecture
    - What is the "right" component-wise decomposition for digital libraries
        to support interoperability most easily and effectively?  Can we
        build it?
    - Can we demonstrate its practicality? Scalability? Efficiency? The ease
        with which new collections are made available? New virtual
        collections? New services? New combinations of services?
    - Can we demonstrate its usability? Effectiveness?  What are the effects
         of the decomposition on the complexity and performance of services?
         What are the effects on users of this decomposition / synthesis of
         services - do they become hard to understand? Hard to manage?
         How long does learning take?
    - Will it be easily adopted by many repository managers?  Which ones?
        Why? Why not - in case some don't support it?
    - How do we deal with lack of metadata provided, to synthesize it?  What
        are the effects of lack of metadata? Of very detailed metadata? How
        much of it is used? How often? How does this compare to only having
        full-text searching and linking?
    - How does this compare with the current situation with many separate 
        collections and services?

3. Studies possible
    - Study for various user communities, activities, tasks, periods
        of time:
    - Measure efficiency and effectiveness.
    - Determine relative effectiveness across user communities.
    - What combinations of collections and services into virtual collections
       are most popular? most beneficial?
    - Which services are most popular, beneficial?
    - What combinations of services are most popular, beneficial? What usage
        scenarios evolve (e.g., visualize collection, browse, search for
        similar items to ones identified, and then search with
restrictions)?
    - How do patterns of use of services and their combinations vary across
        the communities, activities, tasks, etc.?

- - - - - - --  - -B. One possible approach to funding- - - - - - - -- - - -

Title: Covering the Grey - from Santa Fe:
   Universal, Heterogeneous, Distributed, Collaborative Self-Archiving

Overview: This project aims to support the objectives articulated at the
Santa
  Fe meeting of October 21-22, 1999 to build a worldwide infrastructure for
  integrated access to the gray and related literature (theses,
dissertations,
  e-prints, ...) wherein authors are involved in self-archiving processes.
  It focuses on research related to architectures, collections, services,
and
  tools. It emphasizes information management, with middleware that
  is part of a scalable information infrastructure, supported with effective
  human-computer interfaces. It strongly involves HCI experts, from design
  through large-scale analysis of real usage.  It extends the scholarly
  community's ability to access the latest research results, both according
to
  traditional disciplinary boundaries and through new cross-disciplinary
  knowledge representations and organizations.

A. Architectures
This project will support a variety of architectural instantiations, based
on
the following
underlying assumptions:
* Distributed
   - Content and services will make use of the distributed capabilities of
      the Internet.
   - "Collections" refer to containers and other structures related to
      content.
* (Virtual )Collections
   - We can have a directed acyclic graph of arbitrary complexity, building
      virtual collections from other collections, along with query-based
      restrictions/views of collections built upon.
   - See sections B and C below for more details.
* Components
   - A variety of DL-related content elements, software tools, and
middleware
        services exist or will be developed.
   - These should be composable easily as components of larger, more
powerful
        services and systems.
* Services
   - Key services will be supported, with high-performance, scalable to
serve
        large numbers of users.
   - Services range from those surrounding a "raw" collection, to those
around
        a virtual collection, to those transforming or delivering
information,
        to those supporting individuals or communities of uses, etc.
        See section C below for more details.
* Registries
   - Components of a given type will be registered so they can be located
      and used.
Note: Approaches based on agent technology, federation, centralization with
      replication, information buses, etc. can all be supported.  So too
      can be buckets. But this effort is neutral to the use of buckets as in
      the SODA effort, etc.

B. Collections:
* Collections all support a repository access protocol
   - Given an ID, return one or both of:
       - native form of its content, which hopefully will include
provenance,
             either implicitly or explicitly
       - standard form (according to collection type) of its content
   - All collections are self-describing, and can describe their
subdivisions.
   - Identifiers used are persistent.
* Collections conform to a multiple inheritance taxonomy, with various
facets:
   - One facet specifies organizations, e.g., university, college, dept.,
      group, individual
     Note: Case study work by Neill Kipp at Virginia Tech, looking for
      patterns in the digital library field, has identified the following
      types and examples of digital libraries. Some of the facets below
relate
      to this list too.
         Community (NDLTD, CSTC, CDDC)
         Publisher (ACM-DL, D-Lib, Lexis)
         Warehouse (NZDL Music Library, Amazon.com, Marian)
         Museum (VTSF, Blake Archive)
         Library (NCSU MyLibrary, LA Courts Information Support)
   - One facet specifies subjects/disciplines, which can be structured
      hierarchically, e.g., science then computing,
      digital libraries then metadata then DC then DC.title
   - One facet distinguishes physical (e.g., LoC, Virginia Tech) or
      virtual (NDLTD, NCSTRL)
     Note: virtual collections can be constructed from physical or virtual
      collections, by identifying them, possibly with a filter that
      identifies a subset of interest.
   - One facet distinguishes types, with special methods as appropriate
      for the type
      - metadata - handle Dublin Core qualifiers and RDF in intelligent
        fashion
      - document - returning all or part(s)
      - multimedia - with methods for returning objects (compressed,
        uncompressed), or for streaming
      - authority control - with de-duping
      - terms and conditions - for simple types that can easily be managed,
        like worldwide, educational use, for campus community, for
        author-specified group
      - thesaurus/cluster - with methods for returning an object, or a
        neighborhood of object
* Collections included in the development and testing in this project
include
   - NCSTRL - distributed across organizations, single discipline (CS)
   - NDLTD - distributed across organizations, genre (ETDs) across all
      disciplines
   - LTRS - NASA reports collection, from a single organization (multiple
      sites), on a (broad) discipline
   - xxx - centralized repository, serving multiple disciplines (Physics,
      CORR, ...)
   - Economics preprints - RePEc collection harvested from multiple
      organizations, single discipline
   - CogPrints - Cog. Sci. collection harvested from multiple organizations,
      single discipline
   - SLAC/SPIRES - physics collection from multiple organizations, single
      discipline
   - International physics departments - Harvest collection from multiple
      organizations, single discipline
   - other groups from Santa Fe meeting, plus additional volunteers

C. Services
* Support programmatic access
* Existing foundations for services include the following software
   - Dienst
   - Dienst extended for the Santa Fe initiative
   - SFX (see recent D-Lib Magazine article and 2 earlier this year)
   - MARIAN, NDLTD, and other efforts at Virginia Tech
* Types of services
   - Authority
      - Maybe Library of Congress will assist with a pilot - ref. Caroline
        Arms.
      - Maybe OCLC will assist with its authority information - ref. T.
        Hickey.
      - In connection with NDLTD and work in Germany, their server about
        teachers may be supporting this.
   - Statistical - analyzing properties and reporting, to help with other
      services such as visualization.
   - Clustering - perhaps using software from H. Chen in Arizona.
   - Summarization - perhaps using Stanford or Xerox tools
   - Index - MARIAN and other systems
   - Search - MARIAN and other systems
      - Including sophisticated use of content+context+links,
         metadata+text+multimedia
   - Disseminate - providing various forms and versions
   - Transform - supporting dissemination and archiving/preservation
      - manage conversions among MARC, Dublin Core, ReDIF, RFC-1807, ...
   - Thesaurus/concept space - manage MeSH, ERIC, ACM categories, ...
   - Browse - support navigation through thesaurus/concept space,
      document space, ...
   - Visualize - provide special support to manage collections, results
      sets, concepts, ...
   - Certification - authorization, authentication, and resolution so know
      best terms and conditions for a given user regarding any restricted
      digital object (e.g., for SFX)
   - Link (e.g., using SFX) - from citations directly to digital objects
      in best form(s), using certification services
   - Mirroring, replication - for robustness, for political and
      efficiency purposes
   - Archiving and preservation
        - maybe connect harvesting activities at time of collection/cleanup
          to one or more 3rd party services to ensure preservation of
          archived content
        - use transform services to shift archive contents to newer
          representations as needed
   - Annotation
        - by author, those authorized by author, for public view
        - by anyone, for note-taking, for personal collection
   - Editorial processing and review, certification of quality
   - Other workflow management services
        - For educational resources, like CSTC (www.cstc.org)
        - For theses and dissertations (see software provided to
           administrators, from link near bottom right of page
           www.ndltd.org)
   - Submission by author
        - Of digital objects and metadata
        - Including accurate identification using thesaurus/concept
           space services
        - Including careful identification of every reference/citation
           so each can be easily resolved (e.g., with SFX)
        - With educational/training support about the process and principles

D. Tools
* The underlying mechanisms to make all this work involve principles and
support of
   - Sharing/collaboration
   - Aggregation (at varying levels of static -> dynamic, with harvesting
and
      fed. search)
   - Automation (improving workflow, shifting to dynamic capabilities
      like SFX)
* We build upon various projects related to DC, RDF, XML
* We build upon relevant work at Cornell, Stanford, OCLC, ...

E. HCI
* HCI involved from beginning of design of the services and tools
* Remote evaluation of users working with the testbed as part of normal
   activities
* Evaluation of users in NSF-funded HCI labs at Virginia Tech 

F. Impact
* Of the research to be undertaken
   - Promote a DL industry
   - Promote scholars making their research available sooner, in more
detail,
        in ways wherein discovery and reuse are supported better
* Of the services to be developed (testbed)
   - Promote sharing of research results
      - Reduce costs
      - Increase speed and convenience
      - Increase amount of electronic publishing
   - Promote cross-disciplinary work
   - Promote feasible efforts to develop new archives and services
   - Promote building of infrastructure at research universities
   - Promote knowledge of epublishing, digital libraries, IPR, ... of
scholars

------_=_NextPart_000_01BF2B83.C17ABDA0
Content-Type: application/octet-stream;
	name="Edward A. Fox (E-mail).vcf"
Content-Disposition: attachment;
	filename="Edward A. Fox (E-mail).vcf"

BEGIN:VCARD
VERSION:2.1
N:Fox;Edward
FN:Edward A. Fox (E-mail)
ORG:Virginia Tech;Computer Science
TITLE:Professor
TEL;WORK;VOICE:+1 540-231-5113
TEL;HOME;VOICE:+1 (540) 552-8667
TEL;CELL;VOICE:+1 (540) 230-6266
ADR;WORK:;608 McBryde Hall;203 Craig Drive;Blacksburg;VA;24060;United States of America
LABEL;WORK;ENCODING=QUOTED-PRINTABLE:608 McBryde Hall=0D=0A203 Craig Drive=0D=0ABlacksburg, VA 24060=0D=0AUnited =
States of America
ADR;HOME:;;203 Craig Drive;Blacksburg;VA;24060;United States of America
LABEL;HOME;ENCODING=QUOTED-PRINTABLE:203 Craig Drive=0D=0ABlacksburg, VA 24060=0D=0AUnited States of America
EMAIL;PREF;INTERNET:fox@vt.edu
REV:19991014T223256Z
END:VCARD

------_=_NextPart_000_01BF2B83.C17ABDA0--