Thematic Real-time Environmental Distributed Data Services (THREDDS): Domenico et al.: JoDI

Abstract

The overarching goal of Unidata's Thematic Real-time Environmental Distributed Data Services (THREDDS) is to provide students, educators and researchers with coherent access to a large collection of real-time and archived datasets from a variety of environmental data sources at a number of distributed server sites. The datasets will be conveniently accessible from a collection of THREDDS-enabled data analysis and display tools. THREDDS will provide real-time data delivery via reliable, event-driven "push" technology as well as transparent access to datasets using "pull" systems that make it possible to access data on remote servers as if they were on the user's own computer. The system will be built on a set of software components and data servers that are already in operation or under development. The heart of THREDDS is metadata contained in publishable inventories and catalogs (PICats). The creation, publication and distribution of PICats will be facilitated by the discovery system and services provided by DLESE. For example, sites receiving real-time environmental data can create PICats describing data products automatically as they arrive using decoders and crawlers. On the other hand, since PICats do not have to reside on the server with the data, researchers will be able to create PICats for online publications that point to datasets residing on several data servers. Similarly, educators will incorporate PICats of illustrative datasets into modules that also include tools for data analysis and visualization, and students will be able to use PICats to point to datasets related to their research projects, just as they now use URLs to point to relevant documents. This paper presents an overview of THREDDS and an update on the current status.

1 Overview

In "The Absorbent Mind" Maria Montessori described education simply and elegantly: "It is not acquired by listening to words, but in virtue of experiences in which the child acts on his environment". On a different level with more detail, the National Science Education Standards describe a learning process based on inquiry: "Inquiry is a multifaceted activity that involves ... using tools to gather, analyze, and interpret data; proposing answers, explanations, and predictions; and communicating the results". These quotes capture the essence of the interactive data environment that Thematic Real-time Environmental Distributed Data Services (THREDDS) will foster.

Each second of each day, observing systems around the globe are gathering data that provide snapshots of almost every measurable aspect of our environment: satellites monitor cloud movements, atmospheric constituents and the temperature of the land and ocean surfaces. Lightning strikes are recorded as they occur throughout the country. Global positioning system and seismic sensors monitor tiny movements as well as major shifts of the planet's tectonic plates. Modeling programs are being developed that use the current data to forecast future evolution on scales ranging from short-term weather forecasts to very long-term climatic changes.

The goal of this work is to expand the means by which learners -- including students, educators, scientists and the general public -- can use these vast resources to perform their own inquiries, i.e. to "act on their environment". Figure 1, a screen dump from a prototype of one of the THREDDS interactive data analysis and display applications, illustrates a few of the ways in which users can interact with environmental datasets that are accessed from remote servers as if they were on local disks. In this particular instance, the display is a 3D rendering of the jet stream as predicted by a supercomputer model dataset on a server at the National Center for Atmospheric Research (NCAR).

Figure 1. Interactive data analysis and display application (the screen image above was created by software engineer Stuart Wier of the Unidata Program Center MetApps project)

Data collections are a cornerstone of the scientific research and education environment. While the amount and variety of earth system data are increasing daily, the systems for making these data readily available and useful to the academic community have not kept pace. We envision a framework -- a scientific data web -- that will allow faculty and students to search (in the vocabulary of their particular discipline) for available data and to find them, regardless of where the data reside. Just having the data is not enough, however. Even the many spectacular pictures generated from datasets available on the Web present an essentially passive view of what is happening. To interact with the environmental phenomena represented by the data, users need specialized visualization and analysis tools that enable them to manipulate and examine the datasets themselves. They need to create their own visual images, and they must be able to manipulate those images in 3D space and perhaps even "fly" through and around them. It should be possible to move a probe around in the image to see how the temperature or pressure changes with depth in the ocean or height in the atmosphere at different points on the globe. Moreover, it is important to overlay images of data from different sources. For example, at the time of a severe thunderstorm, one might ask how the information about rainfall from a nearby radar site correlates with measurements of stream flows in the local river basin. If those measurements indicate a problem is arising, it would be valuable to overlay predictions from forecast (meteorological and hydrological) models. Ultimately it may be important to include demographic information about populations in threatened areas.

As a two-year project with limited resources, THREDDS clearly will not do all of this. However, our goal is to build key components that will make such a system possible and to incorporate them into a working prototype that includes a large number of data providers, a group of interactive tool builders, metadata experts, and representatives of the digital library community. The broad access to data and analysis tools envisioned in the prototype scientific data web will enable educators to work with data in classrooms, scientists to examine and incorporate data from other disciplines, and students to explore and test their ideas using the yardstick of data. Indeed, in the end, anyone with Internet access will be able to incorporate scientific data into their everyday lives more easily.

2 Strategy: a Variety of Tools and Data Sources Bound by Metadata Catalogs

2.1 Interactive Data Analysis and Display Tools

The strategic goal of THREDDS is to provide students, educators and researchers with coherent access to a large collection of real-time and archived datasets from a variety of environmental data sources at a number of distributed server sites. The datasets will be conveniently accessible from a collection of THREDDS-enabled data analysis and display tools. The arsenal of tools includes Web-based "thin" clients" that allow the learner to browse and manipulate data using the processing power on the servers; interactive data analysis applets that can be embedded directly into html educational documents; full "thick" client applications that harness the computing power and flexibility of the user's own workstation while accessing data from a collection of remote servers.

2.1.1 "Thin" Client Browser-based Analysis and Display Systems

On a superficial level, the browser-accessible data analysis and display tools look similar to the more traditional Web sites that offer a display of images generated from data. There is one important difference: namely, these thin clients enable the user to interact directly with the data by using a set of analysis tools that run on the server. An example of this powerful server-based approach resides at the Climate Data Library of the International Research Institute (IRI) for Climate Prediction at Lamont Doherty Earth Observatory (LDEO). The Climate Data Library enables interactive analysis of datasets on the server via the INGRID system developed by Benno Blumenthal. A second example is the Live Access Server (LAS) which was developed at the Pacific Marine Environment Laboratory (PMEL) under the direction of Steve Hankin.

2.1.2 Interactive Data Analysis Applets Embedded in Educational Materials

The screen shot in Figure 2 is part of a Web page from the collection of interactive WeatherWise (WXWise) applets developed by a team led by Tom Whittaker and Steve Ackermann for use in courses at the University of Wisconsin-Madison. This particular applet accesses a current infrared satellite image and allows the learner to see how a portion of the image would change if the temperature were higher or lower than it actually is. The learner is then asked to respond to questions at the bottom of the page. It is an illustration of an embedded Java applet that allows for direct interaction with real-time environmental data stored on THREDDS servers. You can activate the WeatherWise applet in a Java-enabled browser by clicking on the image.

Figure 2. Interactive applet embedded in educational module Web page (click on the image to activate the applet)

2.1.3 Fully Interactive "Thick" Client Applications

This animated loop in Figure 3 is a series of screen dumps from a prototype application of the Unidata MetApps project. The loop shows how the user can interact with data on a remote server. The panels on the left show the parameters available in the dataset under investigation -- along with a set of options for viewing the data. The specific data that have been selected for the 3D rendering are views of the jet stream predicted by a supercomputer forecast model run at the National Centers for Environmental Prediction and delivered to a THREDDS server at NCAR via Unidata's Internet Data Distribution (IDD) system. Using the Distributed Ocean Data Systems (DODS) client-server protocol, the application was able to bring across only the subset of the data needed for the visualization. The loop illustrates several aspects of the image that were generated by the user manipulating the 3D image with her mouse.

Figure 3. Fully interactive "thick" client application (the image above is another screen dump by Stuart Wier of the Unidata Program Center MetApps project)

2.1.4 Embedding Interactive Data Analysis Applications into Publications

In the long term, the intention is to develop THREDDS capabilities to the point where one can embed pointers to datasets and tools into online publications such as this one. In the meantime, it is still necessary to install some client-side software components on your own computer. If you're interested this can be done for the current beta test version of at least one of the client applications. There are two approaches to this. One is to get the full Java application running on your own computer. The other is to use a Java applications startup facility called WebStart. Both approaches are described by Stuart Wier at http://www.unidata.ucar.edu/staff/wier/index.html.

2.2 Distributed Data Sources

The schematic in Figure 4 shows how a user running a THREDDS client on a local workstation can access data from a number of distributed servers, each of which has its own emphasis or "theme". Many of the servers are in turn populated with environmental data in real time via the IDD system that has been delivering data to nearly 100 universities for the last seven years. A few of these servers already exist, others are being built, and a couple (the streamflow and demographic data servers) are still in the formative idea stage.

Figure 4. Client data access from distributed data servers

Figure 5 shows how data from a set of servers can be plotted together in an interactive application. Only the required portions of the datasets are transmitted over the network and the application can allow for the wide variety of spatial and temporal resolutions for each data element. This particular screen image is one frame from an animation showing the evolution of the data over time.

Figure 5. Interactive analysis and visualization of data from distributed servers (The screen image above was created by Don Murray lead software engineer on the Unidata Program Center MetApps project. The prototype application that generated the image was developed by Unidata in collaboration with the Atmospheric Technology Division at the National Center for Atmospheric Research)

2.3 Metadata Catalogs

At the heart of THREDDS is metadata contained in publishable inventories and catalogs. Based on XML, these inventories and catalogs can be created in many different ways. Data providers receiving real-time environmental data are instrumenting decoders to create entries describing data products as they arrive and become part of the data server inventory. Crawlers are being implemented to create inventories by traversing existing retrospective data collections. Since catalogs do not have to reside on the data servers, researchers will be able to create specialized or personal catalogs for research publications that point to datasets residing on several data servers. Educators will incorporate catalogs of illustrative datasets into educational modules that also include tools for data analysis and visualization. Just as they now use URLs to point to relevant documents, students will eventually be able to reference datasets and analysis tools related to their research projects. Since the inventories and catalogs are text-based, they can be "harvested" and indexed into Digital Library for Earth System Education (DLESE) and other digital libraries.

The screen shot in Figure 6 is also from a prototype client data analysis application, part of the Unidata MetApps development project. The screen illustrates key aspects of THREDDS data catalog access from within a client application. First, the pop-up "Choose DODS Dataset" window enables access to several catalog servers on different machines on the Internet. The lower part of the pop-up window shows a menu of data items available on one of the servers. This particular catalog has dataset entries arranged three different ways: by variable, by model, and by experiment. The details of the individual catalog entries are not important, but one should note that the words associated with each dataset or collection of datasets can be chosen by the creator of the catalog and that the catalog itself can refer to datasets and collections of datasets on a variety of data servers.

Figure 6. Searching distributed data catalogs from within applications programs

Figure 7 is a screen shot from another MetApps client which depicts a catalog that is automatically generated as real-time weather forecast model data arrives at the motherlode server at NCAR. In this case, the main menu items are the names of the various models and one of the model collections, SST-A, has been opened to show the individual datasets available on the server. In essence, the hierarchical list in this case comprises an inventory of the model output datasets available on the server at the time.

Figure 7. Data server inventory listing as seen in analysis and display tool (click on the image to see the current version of the catalog - needs an up-to-date version of Internet Explorer)

Figure 8 is a different view of the same catalog shown in Figure 7, seen from within an application accessing the catalog. The view below shows the actual XML code for the catalog as seen from within the Internet Explorer browser. If you are viewing this page with a recent version of Internet Explorer, you should be able to look at the current version of the catalog by clicking on either Figure 7 or Figure 8.

Figure 8. Data server catalog in native XML form (click on the image to see the current version of the catalog - needs an up-to-date version of Internet Explorer)

3 Teams

THREDDS is a highly collaborative project, and this section lists of the partners working on the three main areas of THREDDS development: a set of data provider sites; a group of software developers working on systems for data analysis and display; and a set of metadata experts relating to Earth system data collections.

3.1 Data Providers

The following institutions have agreed to be data-server partners:

The National Climatic Data Center, NCDC, including the NOAA Operational Model Archive and Distribution System NOMADS
The National Geophysical Data Center, NGDC
The Space Science and Engineering Center, SSEC,at the University of Wisconsin-Madison for GOES satellite data
The International Research Institute/Lamont Doherty Earth Observatory, IRI/LDEO
The Pacific Marine Environment Laboratory, PMEL
The National Center for Atmospheric Research, NCAR
The Climate Diagnostic Center, CDC
Fleet Numerical Meteorological and Oceanographic Center, FNMOC
George Mason University/Center for Oceans Land Atmosphere GMU/COLA
University of Alabama Huntsville for satellite and hydrology data
The Unidata community of 90 universities via their Abstract Data Distribution Environment (ADDE) servers.

Note that NCAR and SSEC will serve as testbed sites for server-side software. As the project progresses and the common underpinnings are tested at the initial sites, additional sites will be added. Sites under consideration are:

Incorporated Research Institutes for Seismology Data Management Center, IRIS DMC
University of Oklahoma for radar data
Atmospheric Radiation Measurement, ARM
University of Florence Interoperability System for supporting the Italian Scientific Community
working in the Earth Observation from the Space (SINOTS) for European satellite data.

It is not possible in this article to provide a detailed description of the content of each of these sites. Some are large national data centers. To give a sense of the magnitude and breadth of a typical THREDDS server, the prototype systems at NCAR are initially targeted to handle about 1 terabyte of data online. This will hold several months of data arriving at the site at a rate of about 10 gigabytes each hour. During busy hours, more than 1 gigabyte of data arrives at the server, with several products each second. The products range from satellite images and the output of numerical weather prediction models that are hundreds of megabytes to 80-character reports from individual weather reporting stations from around the world. In between the product list includes lightning strike data; images and four-dimensional volume scans from NEXRAD radar sites; atmospheric data recorded by commercial aircraft in flight; and vertical profiles taken by weather balloons. By the end of the project, we hope to find resources to be able to store a full year of data on the prototype server. The reader is encouraged to visit the sites to get a more detailed understanding of the holdings.

3.2 Client Analysis and Display Tools

The THREDDS prototype will provide examples of a wide variety of working applications that use our metadata framework to find, analyze and display data from server sites. This will demonstrate an end-to-end system for data access and visualization. The following developers will incorporate our client-side data-access components (class libraries and metadata access) into their own data manipulation tools:

Live Access Server (LAS, PMEL, Steve Hankin). LAS illustrates the use of a Web-based (thin) client with the bulk of the analysis and display generation done on the server side.
Ingrid (IRI/LDEO, Benno Blumenthal). This is another example of a system enabling analysis and display of data via a Web browser.
WXWise applets (the University of Wisconsin-Madison, Tom Whittaker). These applets illustrate the use of Java to embed data-analysis and display tools directly into educational modules on a Web site.
Virtual Geophysical Exploration Environment (VGEE, formerly The Virtual Exploratorium, the University of Illinois, West Chester State, DLESE, and NCAR, Don Middleton). This application incorporates the educational functions directly into the data analysis and display tool itself.
Data Discovery Toolkit and Foundry based on EDMI (Earth Data Multimedia Instrument, New Media Studio, Bruce Caron). These are a set of data-analysis and display tools based on IDL and Macromedia Director. They can be used to generate elaborate educational modules.
Meteorological Applications (MetApps) (Unidata Program Center, Don Murray). A set of pure Java, platform-independent, two- and three-dimensional data-analysis and display tools-based on the VisAD infrastructure.
Visualization for Algorithm Development (VisAD) infrastructure from SSEC (Bill Hibbard of the University of Wisconsin-Madison in conjunction with the Unidata Program Center).
Others: Some software packages (MatLab, Interactive Data Language (IDL), Man-computer Interactive Data Access System (McIdas) have already been adapted to acquire remote data via DODS or ADDE. Even if these systems are not adapted to take direct advantage of Catalogs or other THREDDS advances, their users will benefit from data available on THREDDS servers.

3.3 Metadata Expertise

As noted earlier, the technological core of this initiative, the crucial component now under development, is a system for adding the semantic description of scientific datasets necessary for data manipulation and discovery. It must interoperate with data providers, data servers, data clients, catalog servers, discovery systems and other middleware components. Investigators will select key scientific datasets and semantic descriptions developed for an end-to-end demonstration of the utility of this approach. Unidata staff will work closely with DLESE to ensure that the resulting metadata system will interoperate effectively with the National STEM (Science Technology Engineering Math) Digital Library (NSDL).

Partners with whom we will consult on matters of metadata and interoperability are:

The Earth System Markup Language (ESML, University of Alabama-Huntsville);
The DIstributed MEtadata System (DIMES, George Mason University);
The aggregation data catalog that is part of DODS (University of Rhode Island, Unidata);
Digital Library for Earth System Education, DLESE;
The University of Florence (Italy). Prof. Stefano Nativi is acting as a liaison with the international metadata standards community.

4 Conclusions

In perhaps a different way than Maria Montessori originally envisioned, THREDDS will provide a way in which we can learn by "acting on our environment". Much work remains to be done to achieve the long range THREDDS mission of developing an environmental data web that allows learners of all ages to find and interact with datasets that illustrate the current state of the global environment, but we have designed the system and have begun construction. This article provides a glimpse of the interactivity that will be possible, a sense of the range of data types and partners involved in the effort, and a basic understanding of architecture of the system and the approach being taken to make it a reality.

Thematic Real-time Environmental Distributed Data Services (THREDDS): Incorporating Interactive Analysis Tools into NSDL