| LTER Home | Intranet | LNO |
NESCent BDI: A Digital Repository for Preservation and Sharing of Data Underlying Published Works in Evolutionary Biology
The field of evolutionary biology is suffering from a crisis of data attrition. The problem is particularly evident when a researcher unsuccessfully attempts to obtain data sets associated with a published journal article. Though specialized databases exist for some of the most commonly seen data types (such as DNA sequences, character state matrices, and phylogenetic trees), it is rare that every dataset associated with a published paper has a suitable permanent home. Furthermore, while many evolutionary biology journals have policies that encourage authors to make their data accessible on-line, many individual researchers lack the technological means and sustainable infrastructure to ensure preservation and availability of their data over the long term. In this respect, evolutionary biology is typical of “small science” disciplines -- individuals or small groups collect much of the data manually, datasets are highly idiosyncratic in composition and format, and there is little infrastructure available for authors to share published data. As a result, much of the data underlying published works in the field is unavailable for future researchers to validate controversial findings, to reuse for studies that build upon the published work, to reanalyze as new methods and ideas are introduced, and to synthesize for the discovery of emergent trends. At the behest of major journals and societies in evolutionary biology, NESCent has begun development of a digital repository, called Dryad, for the preservation, discovery and sharing of data underlying published works throughout the discipline.
The overall aim in this proposal is to facilitate data sharing upon publication by the evolutionary community by addressing the major hurdles to adoption of Dryad, both technical and otherwise, in three broad areas: i) deposition and access interface, ii) incentives and interoperability, and iii) sustainability. We will also promote the use of Dryad as an educational tool to teach future scientists about the value of digital data archives. To achieve these goals, we propose the following specific aims (SA).
1. Deposition and access interface. Dryad aspires to provide a way for researchers to deposit their data in a usable form with minimal burden, and to take fuller advantage of existing technologies for information retrieval. Data deposition will be coordinated with the manuscript submission process. This will enable reliable bibliographic metadata (e.g. author, title, etc.) to be automatically stored by Dryad, and the citation for the data objects can be automatically included in the article. We will explore means for assisting the capture of scientific metadata (e.g. geo-spatial information, taxonomic scope) from authors using various approaches in automatic metadata generation. To maintain both data integrity and metadata quality, data curators will validate and, if necessary, edit, submissions to the repository. A retrieval interface will be developed that uses the available metadata more fully, and also uses both existing and newly developed and relevant vocabularies to augment queries. Evaluations and user-testing will be employed during the design and implementation process, including studies of automated and user-generated metadata quality, the accuracy and recall of information retrieval, and usability studies of both the deposition and retrieval interface.
2. Incentives and interoperability. A major incentive to adoption is to implement, as far as possible, "one-stop shopping" for the deposition, discovery and retrieval of data. Towards this goal, we will enable interoperability with specialized databases and with metadata registries in related disciplines. As proof-of-concept for one-stop deposition, we will implement hand-shaking mechanisms with GenBank, for sequence data, and TreeBASE, for phylogenetic data so that, where required by the journal or requested
by the author, data will simultaneously be deposited in Dryad and either GenBank or TreeBASE. Handshaking will include automatic reuse of bibliographic metadata and identifiers, greatly simplifying the task of data deposition for the author. Dryad will assign globally unique, stable, and resolvable identifiers for datasets. These identifiers will enable Dryad to broker among the data objects related to a single paper, whether they be within Dryad itself or in a specialized repository and data identifiers will provide a mechanism for data citations. Interoperability of Dryad with other digital collections in
biology and beyond will be achieved, in part, by implementing the OAI-PMH protocol for metadata harvesting. As a proof-of-concept, we will add full compliance with OAI-PMH to Dryad, TreeBASE and Metacat, the premier metadata registry and data repository for ecology. Dryad and MetaCat will also implement the Library of Congress Search and Retrieve via URL standard, which will allow on-the-fly access to repository contents by third parties through a web-service protocol, and will also enable syndication of repository contents.
3. Sustainability. We propose a governance model and one technical experiment designed to ensure data preservation and sharing in perpetuity. Dryad will be overseen by a Management Board (MB) of stakeholders from evolutionary biology journals and societies, advised by information science experts and representatives from other scientific data sharing initiatives, who will set policy and plan for the financial self-sufficiency of the repository beyond the life of this project. We will explore technical advances in the long-term stewardship of digital data collections by implementing a distributed
data preservation system following the LOCKSS (Lots of Copies Keep Stuff Safe) model, in addition to managing a more standard architecture of redundant production and backup systems within the North Carolina State University Libraries.
4. Community engagement is an integral component of the project and is critical both to short-term adoption by the user community and its long-term success. Datasets of special educational value will receive extra curatorial attention and be presented for student use through a dedicated education section of the repository, acclimating future investigators to a scientific culture in which digitally shared data will play an increasingly important role. Dryad tutorials will be presented at major evolutionary biology conferences to promote adoption and increase the extent and quality of the metadata provided by authors. NESCent will host annual workshops to support emerging metadata and interoperability standards in the field of evolutionary biology, and plan for future handshaking efforts.
The work proposed here will have a broad and transformative impact by enabling the preservation, discovery, sharing and reuse of data for an entire biological discipline. It represents a unique collaboration among diverse institutions (academic journals and associated scientific societies, a national synthesis center and research network, a major community database) and expert communities (evolutionary biologists, information scientists and research librarians) and a pioneering application of digital library technology to data sharing for “small science”. We intend that this will serve as a model for efforts to preserve and share data in other disciplines facing a similar crisis of data attrition.
The LNO, in collaboration with NESCent, will focus on three specific software tasks:
- Crosswalks that transform metadata content from the Ecological Metadata Language to the Dublin Core standard and Dublin Core to the Ecological Metadata Language standard,
- Extension of the Metacat metadata repository so that it is compliant to the Open Archive Initiative Protocol for Metadata Harvesting (OAI-PMH), and
- 1)Extension of the Metacat metadata repository to support metadata querying capabilities to include the Library of Congress standards Search and Retrieve via URL (SRU) and Search and Retrieve via Web-services (SRW).
