LTER Home | Intranet | LNO
Search:

PASTA - Provenance Aware Synthesis Tracking Architecture

Overview

The Provenance Aware Synthesis Tracking Architecture (PASTA) is a developing model for dynamically harvesting and archiving site-based data and metadata of the LTER Network for use in generating synthetically derived data products. These derived data products are then accessed through multiple user and machine interfaces. All derived data are described by a rich and structured Ecological Metadata Language (EML) document, which emphasizes the product processing history and its origin – the product “provenance”. PASTA is modular by design (see Figure) and can be represented by many software components that fit together within the broader model. Modularity is crucial to the flexibility and extensibility represented by PASTA so that it may be uniquely applied to solve a variety of data life-cycle problems. As such, PASTA is a model that allows the LTER Network Information System (NIS) to pick-and-choose the right set of software tools, commercial or open-source, for current and future NIS modules. Site produced data are loaded into a common data “Cache”. The “Cache” exposes data through a standard interface, which can be used by analytical applications to produce derived data. Metadata for all derived data products is created and then harvested into Metacat, thereby providing resources for data discovery. Once identified, the data is made accessible through various interfaces, such as a web browser or other web-service based tools. The following information provides a brief discussion of the highlights of the PASTA model and examples of implementation.

Existing Technologies

PASTA leverages existing technologies developed within the ecological community, including the Ecological Metadata Language, Metacat, and the Metacat-Harvester. To actively participate in the NIS, site data must be described with rich EML, including the complete description of all data tables. The site must also ensure that data are available for loading through standard network protocols, like the Hypertext Transfer Protocol (HTTP) that is commonly used by web servers. PASTA was designed specifically to minimize impact on individual sites; sites that fully participate in the LTER initiative to describe data with EML should also be fully compatible with PASTA.

New Infrastructure

Site data is identified as new or modified by PASTA when an update to the EML document is harvested into Metacat. The “Parser-Loader” uses the data table description found in the EML document to create a new relational database table in the data “Cache”. Once the new table is created, the “Parser-Loader” uses the network access point defined in the EML document and copies the site data into the data “Cache” table. The “Parser-Loader” module was developed in early 2007 through a collaborative effort with the LTER Network Office and the National Center for Ecological Analysis and Synthesis. Also known as the EML Data Manager Library, this software package is officially part of the EML software distribution as part of the NSF funded Knowledge Network for Biocomplexity project. This infrastructure allows the dynamic creation of a persistent archive of LTER data that can serve the ecological community in perpetuity.

Pluggable Workflows

PASTA's data “Cache” acts as a consolidated LTER data archive that is accessible by various applications, such as work-flow engines or external analytical processors. These applications can directly access site data in the “Cache” for further processing or integration into synthetically derived data products. Examples of such applications include “Kepler”, a graphical work-flow engine based on the Ptolemy project, and the open-source statistical analysis package “R” that is being used as part of the EcoTrends project. Derived data products can be streamed to other applications or stored for distribution through PASTA's external interfaces. PASTA achieves data interoperability with other projects, information systems, and workflow systems by using standard database connection information and table descriptions.

Derived Data/Metadata Management

Derived data in PASTA may consist of an integrated, value-added product that has been processed through a scientific workflow or simply a “pass-through” of the original site data restructured and presented in a consistent format. Information about processes that “touch” the data, including information regarding the origin of the data, is captured as part of the provenance metadata trail. Metadata describing derived data is stored in Metacat as EML, thereby enabling standard programming interfaces for the discovery of all data. Derived data within the same project framework (e.g., EcoTrends) is stored in a uniform global schema – in other words, all derived data in PASTA is stored using the same canonical structure when viewed by internal or external application.

User Interfaces

Access to derived data and metadata is performed through web-based interfaces using a standard web browser or by writing custom services that directly request data/metadata through well defined application programming interfaces (APIs). Web interfaces can provide a rich set of discovery tools for locating derived data, dynamic plotting of single and combined data sets, as well as providing direct access to data through a “download” dialog. All data access is controlled through industry-standard identity management that confirms user acceptance of the local data use policy before allowing the data to be viewed. Each data access event is logged into an informational database for “use” analysis. Access events can also trigger notifications to the data set owner/creator, thereby promoting accurate and timely attribution when data is used in public media. Perhaps the best example of a PASTA web-based interface is that of the EcoTrends project web site (www.ecotrends.info).

For additional information – Mark Servilla, LTER Network Office (servilla@lternet.edu)