Personal tools
You are here: Home Publications Technical Documents Recommended Data Sharing Practices Whitepaper

Recommended Data Sharing Practices Whitepaper

Prepared by Jesse Cleary, UNC Chapel Hill, Department of Marine Sciences and Jeremy Cothran, USC, Baruch Institute for Marine & Coastal Sciences, on September 15, 2006

Download this file as a Microsoft Word Document or an Adobe PDF file.



This document will present a set of data sharing recommendations and standards that are ready for regional consideration as the Southeast Coastal Ocean Observing Regional Association (SECOORA) begins to address its data management role. As the Southeast U.S Atlantic Coastal Ocean Observing System (SEACOOS), a regional OOS has been engaged in the role of data aggregator over the last several years, this document will draw heavily on the experience of that group. This work is also shaped by existing IOOS Data Management and Communication documentation as relates to Metadata, Data Discovery, Data Transport, Online Access, and Data Archive. However, the IOOS DMAC recommendations are developing in tandem with the efforts contained herein, making for an iterative development process of waiting for top-down standards to emerge while also suggesting workable solutions in a grassroots manner.

This document will outline a template of recommendations for the emerging SE Regional Association as well as other emerging Regional Associations (RAs) and potential Sub-Regional Data Providers (SRDPs). Future improvements are discussed where SEACOOS has encountered operational challenges requiring additional system development. Many of these improvements are under various stages of development ranging from early brainstorming to beta ready maturity.

This document compliments several existing SEACOOS documents:

1.1 Acknowledgements

It should be mentioned that the efforts contained herein are the work of a large group of researchers throughout the southeast US. Institutions acting jointly under the SEACOOS banner included the University of North Carolina at Chapel Hill, University of South Carolina, Skidaway Institute of Oceanography, University of South Florida, and the University of Miami. Contributions were also received from other regional ocean observation institutions, private companies and federal agencies. Without the ongoing efforts of this dedicated group, none of the following would have been possible.

1.2 Regional Background

The Southeast Atlantic Coastal Ocean Observing System (SEACOOS) is a distributed near real-time ocean observations and modeling program that is being developed for a four-state region of the Southeast US (FL,GA,SC,NC), encompassing the coastal ocean from the eastern Gulf of Mexico to beyond Cape Hatteras. SEACOOS was presented with the chance to define data standards and integrate in-situ observations, model output, and remote sensing data products from several institutions and programs. The integration of a near real-time data stream and subsequent GIS visualization provides immediate feedback and validation as to the effectiveness of this regional observation and modeling effort. Additional distribution of these aggregated datasets relies on these standards to integrate SEACOOS data into multiple external projects of scientific and societal importance.

1.3 Technology Overview

This section covers the major steps SEACOOS followed to create and serve a near real-time data stream of in-situ observations, model output, and remotely sensed imagery. A review of the current set of IOOS DMAC documentation is recommended as these standards are under development and may influence the steps and solutions discussed below. Our hope is that an understanding of this process will help RAs as they formulate their initial data management and data sharing technology strategies. This process is outlined in a linear fashion below, while recognizing that iteration between several steps at once is likely.

  1. Conceptualize available data types and consider possible storage schemas: SEACOOS collects data on ocean state variables and biogeochemistry in the form of in-situ observations, circulation models, and remote sensing. The spatial and temporal frequency of these data is highly variable and required considerable forethought to address all possible data configurations.
  2. Develop and standardize data vocabularies, file formats, and transport protocols: Developing a standard vocabulary or data dictionary of common language to refer to our disparate data was a significant achievement. This was critical to the further development of SEACOOS data-specific file formats using the netCDF format and DODS/OPeNDAP transport protocol (per IOOS DMAC).
  3. Determine desired applications and requisite software packages: SEACOOS visualizes data spatially and graphically, providing researchers and external audiences with access to this information in near real-time. Open source GIS and graphics packages are used to drive these applications wherever possible.
  4. Determine database schemas for observations, model output, satellite imagery, and metadata: With both the data and application ends of the data stream conceptualized, a database schema to enable their connection was developed. The open source PostgreSQL database, with PostGIS extension for geospatial indexing and mapping, is used by SEACOOS.
  5. Address hardware needs of particular database and application configurations: SEACOOS utilizes separate servers to house the database(s), web mapping application, and project website. Incorporate planning for separate site hardware redundancy.
  6. Implement schemas and applications: Intermediary code development is crucial in automating and connecting these disparate technologies to handle a near real-time data stream. SEACOOS uses perl as the primary scripting language to aggregate, parse, and store incoming data in the PostgreSQL database. PHP/MapScript is used to create interactive mapping applications, embedding MapServer GIS controls within HTML pages.
  7. Disseminate data “outward” to external audiences: As part of the IOOS push, SEACOOS cascades its data into other national data aggregation efforts. Open Geospatial Consortium (OGC) services are utilized to transfer map images (Web Mapping Service) and raw data (Web Feature Service) to other GIS applications. SEACOOS is also active in making collected data available by a variety of requests and formats depending on audience need.



As the Regional Association (RA) spins up to assume data management oversight from the existing OOS, a clear delineation of RA and Sub-Regional Data Provider (SRDP) roles and responsibilities is a useful construct to guide this emergence process. This section incorporates discussions from recent SECOORA/SEACOOS data sharing meetings in addition to IOOS DMAC documents. As a general statement, participants in these meetings wanted responsibilities to reside alongside the expertise and knowledge to implement them. While setting the technical stage for a seamless handshake of data, the RA also looks outward toward national discussions, data dissemination, and standards that will affect the region. SRDPs provide their end of the data handshake, while focusing locally on the observation they make, most QA/QC processes, and local user needs. It should be acknowledged that the RA may be no more than different groups of SRDP representatives and thus many decisions will be made by researchers serving multiple constituencies.

2.1 Regional Data Center

The key data sharing role taken on by the RA is the creation and hosting of a data management and aggregation center. In the southeast, this role is currently filled by the regional OOS (SEACOOS). Responsibilities here include the creation and oversight of a centralized repository of aggregated regional data. Toward this end the RA should provide the technical guidance to help each SRDP populate this database and to ensure the data contained therein is standardized and useful for both regional and extra-regional data users.

The RA should also facilitate the development of Best Practices and Requirements for the sub-regional data providers. This includes the incorporation of relevant national standards and recommendations wherever possible. In the SEACOOS project this was made successful by following a participatory model driven by representatives from each SRDP. Meeting participants also suggested that the RA could also house a collection of useful software tools (data analysis and visualization) and schemas for data management and sharing. This resource base could be accessed by new SRDPs to help speed their spin-up process.

2.1.1 Data Aggregation

The RA should also implement (or develop if needed) the requirements for data sharing formats and transport protocols. SEACOOS has developed a convention for the netCDF format that all data must follow to be processed at the regional data center. Extension of the convention is done through a collaborative process as new variables and QA/QC methods are included. The SEACOOS netCDF convention is an extension of the Climate and Forecast metadata convention (CF 1.0), itself an extension of the COARDS standards. The RA should ensure that such existing standards for data sharing and transport are adhered to and incorporated into the RA Best Practices and Requirements.

The RA should also set the requirements and tests for the QA/QC of aggregated data. The RA should help to inform national discussion on the subject (QARTOD meetings) and help apply those recommendations at the regional scale. The RA should leave the initial QA/QC of data to each SRDP who are more intimately familiar with their data and the manner of its collection. The RA may also perform secondary QA/QC procedures that are dependent upon multiple data points (e.g. nearest neighbor) or external datasets (e.g. model comparisons).

As an aggregator of sub regional data, the RA is responsible for maintaining one or more databases of aggregated observations which may be organized centrally or distributed. This includes a schema to cover observation types as they develop as well as automated transport tools and format parsers to harvest data from each SRDP and prepare it for addition to the aggregate database(s). Limited archiving should take place on this aggregated dataset to preserve the value added in the aggregation process (reformatting, unit conversion, QA/QC procedures) but the bulk of archiving should be left to each SRDP to implement. Data managers at the RA level should be active in these Best Practices discussions to ensure that impacts on the database population process are addressed.

2.1.2 Data Dissemination

The RA Regional Data Center should also be responsible for dissemination of the aggregated dataset to other national and sub-regional projects. Several different transport methods should be provided, including OPeNDAP/DODS (IOOS DMAC recommendation, USCG request), HTTP for raw file download, and XML web services (OGC and SOAP, both IOOS DMAC recommendations). These methods should access and export data from the database in a number or common formats – netCDF (USCG preferred), ASCII, CSV, XML/GML, and ESRI shapefile are a short list.

The IOOS DMAC recommends that the RA should also be responsible for the creation of observation and sensor catalog records. Several catalogs of this type already exist and responsibility for regional entries therein should be an RA task. This also might cover the creation of dataset level FGDC compliant metadata records and the appropriate distribution of those records to various marine data catalogs and clearing houses. More detailed metadata records such as those from a region wide sensor inventory should also be organized and distributed at the RA level.

2.2 Sub-Regional Data Providers

Of primary importance to the Sub-Regional Data Providers (SRDPs) is the operation of their local observing activities and raw data processing resources. Diversity in approach at this level is expected and is seen as an asset. For most of these institutions this is a very full-time endeavor, so additional RA required tasks should attempt to run parallel with these ongoing and primary efforts. Many regional institutions are also engaged in various data mining from Federal data sources – NOS, NDBC, USGS, and NWS. These data streams are passed along to the RA alongside datasets created at each institution. While this re-collection is expected to cease as federal data providers stand up more robust transport and access mechanisms, SRDPs should expect to meet his need in the interim.

An additional important role to be filled by each SRDP is to be an active participant in the technical discussions at the RA level. This participatory model has worked very well for the SEACOOS project and has enabled the rapid development and subsequent acceptance of new data management and data sharing procedures.

Other responsibilities are generally to follow the Best Practices as developed at the RA level. The IOOS DMAC does not make many specific recommendations at this stage of the eventual IOOS data flow, but does set some data and metadata standards that require actions from this level. This includes the processing of raw observation data into the file format(s) and convention required by the RA to perform data aggregation. Archiving of raw data also is left to each individual SRDP. Limited archiving of the aggregated regional dataset may occur at the RA level but the most thorough and deepest archives should live with the institutions that originally collected the data. SRDPs are also responsible for implementing the bulk of required QA/QC procedures as the local knowledge to best perform these tests resides at the institutions collecting the data. The RA will guide this process and issue requirements, but the performance of the tests occurs locally. Implementing these tests may require the SRDP to build local climatologies of observations and inventory sensor specifications and tolerances. Each SRDP will also be responsible for the setup of websites and servers capable of meeting the regionally established transport protocol(s).



Many distinct institutions create and collect oceanographic data across the Southeast US coastal ocean. These observations are made through a wide array of instrumentation, under a variety of data collection, data transport, and storage schemas. An initial challenge is to maintain this diversity while also encouraging aggregation of this disparate data into the desired regional dataset(s). With some foresight about current and possible data types, the RA can craft a flexible and extensible aggregation schema. This begins with a look at the data types collected by partner institutions and adapting transport formats to these types.

One key consideration is to develop solutions for the most complex data model and let everything else fall out as a subset of that case. With this in mind, SEACOOS chose to model the data by making all variables (including latitude, longitude and depth) a function of time. Other data forms allowed for programmatic 'shortcuts' based on the types of dimensions presented in the file - for instance, if latitude and longitude were each a dimension of 1 point, then the file was processed as a fixed station. Most of the debate centered on whether descriptions should be carried in the variable attributes or in the dimension or variable naming convention. Presented below is the rationale behind the data format and types addressed within the SEACOOS project.

3.1.1 netCDF Standard

The discussion of data types is informed by the file format standard that these types will inhabit. After much discussion between SEACOOS institutions, data standard proposals for a SEACOOS in-situ netCDF format were developed. NetCDF was chosen as it is well documented, commonly used in oceanography, and a well tested format for implementation using the DMAC recommended OPeNDAP protocol. The resultant SEACOOS netCDF convention serves as a specific syntax of variable dimensions and attributes (Southeast Atlantic Coastal Ocean Observing System netCDF Standard: SEACOOS CDL v2.0). The SEACOOS netCDF relied upon the Climate and Forecast (CF) Metadata netCDF Conventions v1.0 wherever possible (per IOOS DMAC). The purpose of the CF conventions is to require conforming datasets to contain sufficient metadata so that they are self-describing in that each variable in the file has an associated description of what it represents, including physical units if appropriate, and that each value can be located in space (relative to earth-based coordinates) and time. The main digression from CF that SEACOOS took is that SEACOOS netCDF gets standard names from the SEACOOS Data Dictionary instead of the CF Standard Name Table. Development of a SEACOOS CDL v3.0 to include language specifications for subsequent QA/QC testing is nearing completion and expected to be released in the next two months. This language specification is also covered within the newly released SEACOOS QA/QC whitepaper.

3.1.2 In-situ Observations

The SEACOOS netCDF format currently focuses on a few different cases: fixed-point, fixed-profiler, fixed-map, moving-point-2D, moving-point-3D and moving-profiler. This format can be programmatically parsed by the SEACOOS "data scout" (perl code here) which downloads the netCDF files via HTTP from data providers and populates the aggregate database or alerts the provider when there is a problem with the data being input.

3.1.3 Model Output

SEACOOS modeling groups had a different set of integration issues to resolve. Model resolution and interpolation issues regarding time and space were discussed as they related to model output, region overlaps, and their display. Display of the results via the GIS helped articulate these various projection, alignment, and resolution issues. Since all the model data was homogenous area/field oriented data, deciding on a common netCDF representation was fairly straightforward.

3.1.4 Remote Sensing

To complement the in-situ observations being collected in this domain, SEACOOS has engaged real-time satellite remote sensing capabilities, including redundant ground station facilities. The integration of remotely sensed data into the SEACOOS program has provided a regional context for in-situ time series data, as well as a historical baseline for the region’s surface waters. SEACOOS partners are engaged in the collection of real-time satellite data (some tailored for the SEACOOS domain), the production of derived data products, and the rapid delivery of these products via the SEACOOS web portal. Formatting decisions are left to the data providers and image transport is handled by FTP of images as georeferenced PNG files. Currently SEACOOS is ingesting remotely sensed satellite images from the MODIS, AVHRR, and QuikSCAT platforms.

SEACOOS partners also maintain several HF radar arrays (CODAR and WERA systems). Remotely sensed data from these fixed platforms is treated much like in-situ data. Totals data is coded into a netCDF file, harvested by the SEACOOS data scout, and parsed into the SEACOOS relational database for access by mapping and query applications.

3.2 Metadata

Most observation specific metadata (“child metadata” in IOOS DMAC documentation) are maintained at the SRDP level. However, metadata about the aggregated dataset (“parent metadata”) and tools for better metadata organization have been developed at the SEACOOS level. In addition, SEACOOS will soon be collecting selected observation metadata and QA/QC processing metadata (tests, ranges used etc). Applications to utilize this newly dynamic and detailed metadata are also being developed.

One of the initial metadata concerns was to provide an online browser based tool for users to create and manage IOOS recommended FGDC (Federal Geographic Data Commission) metadata records for data discovery purposes. One application that satisfies this need is Meta-Door, an open source application publicly available for others to utilize. Meta-Door also allows users to administer groups of users and manage some basic platform and sensor metadata. It is capable of sharing these metadata records with other applications via xml import and export. There are also several other metadata maintenance tools available or under development.

A key recommendation from the SECOORA Data Sharing workshop and IOOS DMAC plan was that an inventory of RA observing assets and measured variables be created. The need for this type of metadata was recognized early on within the SEACOOS project and a temporally static snapshot was created. This inventory became a database and web map housing detailed information about observation platforms, sensor equipment, and environmental variables measured across the SEACOOS region. It presents the spatial resolution of sensors and variable measurements while also serving to facilitate technical discussion amongst the SEACOOS observation community (SEACOOS Equipment and Variable Inventory).

Several new metadata components are currently under development. These components will organize and serve important pieces of metadata for external catalog efforts and internal project monitoring.

A metadata effort geared toward data discovery has recently emerged to populate a simple IOOS regional ocean observing system catalog of existing or planned observation types (June 2006 CSC workshop). This was discussed at a 2 day conference at WHOI organized by the NOAA Coastal Services Center (CSC). The website documentation (see here) details an experimental draft of a possible simple observation type metadata CSV format and a visualization product which utilizes this format. This catalog record format is intentionally minimal and simple to help encourage wide participation and ease of adoption.

As part of the transition from SEACOOS to SECOORA the initial observations metadata snapshot (SEACOOS Equipment and Variable Inventory) needs to be reorganized to become dynamically updated and machine readable. This information can be rather complex, so a new relational database schema is needed beyond the normal process of storing of metadata in flat files. An additional goal is to explicitly link this metadata to specific observations and thus assess the quality of measurements made by specific sensors over time. Other uses of such joined sensor metadata include monitoring system wide performance down to the sensor level. Several other external applications might be employed to help organize the vocabulary in this sensor inventory (MMI ontology and crosswalk products) and also to help serve this metadata (SensorML machine readable metadata wrapper).



Using a relational database to store regional observations maximizes future data flexibility, handles unit conversion and GIS format processing, efficiently stores a wide range of data types, and follows an IOOS DMAC recommendation. SEACOOS uses the open source PostgreSQL relational database to aggregate and store project partners’ in-situ observations, model output, and references to remotely sensed image data. PostgreSQL can be accessed by a number of front-end applications via standard SQL statements and spatially extended to include geospatial datatypes and indexes using the PostGIS extension.

4.1 PostgreSQL Database with PostGIS Extension

SEACOOS data are stored in two PostgreSQL database instances. One instance contains the in-situ observations data and image file references to remotely sensed data. The other contains model output data and duplicate in-situ observations, used for “round-robin” updating. The databases are partitioned into separate tables for each in-situ observation variable, remotely sensed raster layer, and model variable layer per hour. The remotely sensed tables do not house the actual images but pointers to the image files and their ancillary boundary files. The remotely sensed data tables are used to execute raster queries, which require the image RGB values to be referenced against a look-up table of actual measured values.

The PostgreSQL database is “spatially enabled” using the PostGIS extension for PostgreSQL. PostGIS adds several geospatial objects to the supported data types in PostgreSQL. This functions as the spatial database engine for all subsequent GIS data visualization. This extension encodes the text locations of SEACOOS observation data as a geospatially indexed geometry column hash representation, better enabling mapping and spatial query functionality. GIS mapping applications utilize these geometry columns to render the associated map locations of data. PostGIS fields can also be imported and exported from other common GIS data formats such as ESRI shapefiles.

4.2 Data Structures / Canonical Forms

The structure of temporal, geospatial data as it is stored in various formats should ideally be capable of having its structural elements described in a handful of forms. Describing and labeling these forms (and what should be abstracted away) are the beginning steps before automated programmatic conventions, labels, and processing can be utilized in data transformation.

As an example, two predictable forms for storing buoy data are:

  • 'by station' where the tablename is that of the station and each row corresponds to all the variable readings for a given time measurement
  • 'by variable' where the tablename is that of the variable measured and each row corresponds to a measurement time, station id, and possibly lat, long, and depth describing the measurement point and the measurand value. The ‘by variable’ form is the same as ‘point form’ and an example is listed below.

Currently the GIS favors a 'by variable' approach which corresponds to variable data layers. This format is concise, amenable to query, and resultset packaging (the ability to mix and match variables which have a similar reference scheme on each variable table). Issues of varying temporal sampling resolutions across multiple stations are also better handled in this form. SEACOOS is developing programs to convert other table formats to this format. See here.

Click here for database descriptions of the wind and SST tables that SEACOOS currently utilizes. A full listing of the archival SEACOOS observation database schema is listed here. Efforts are made to keep from 'normalizing' the table into subtables, preferring a single table approach with redundancy in certain fields. Since the storage needs are initially low, the database remains conceptually and operationally simple. Table performance can be further optimized by partitioning and use of VACUUM, COPY, CLUSTER commands and other indexing schemes applied similarly across these repeated table structures.

4.2.1 Point Form Example

(By variable form, used with point and moving point data)

The following represents a basic table layout which might be implemented on a PostgreSQL database. Click here for generic table creation details.

row_entry_date TIMESTAMP with time zone,
row_update_date TIMESTAMP with time zone,
platform_id INT NOT NULL,
sensor_id INT,
measurement_date TIMESTAMP with time zone,
measurement_value_ FLOAT,
-- other associated measurement_value_ added here as well
latitude FLOAT,
longitude FLOAT,
z_desc VARCHAR(20),
qc_level INT,
qc_flag VARCHAR(32)

4.2.2 Multi_Obs Form

The ‘point form’ approach represents an initial approach that needed to be modified slightly to more easily accommodate new data of similar datatypes. The initial approach was to use one table instance per observation type, but this has created too much development and maintenance overhead as we continue to add more observations to our aggregations and products. The new approach is to add an observation index column to a generalized observation table which allows us to reuse the same singular ‘point form’ table schema against multiple observation (‘multi_obs’) and groupings (vectors for example) of observation datatypes. The advantages to this approach are easier data and product development and less database maintenance as there are less individual table references involved. Development with this approach should be simpler and faster because only a new observation type index is added within a generally supported table schema rather than adding new tables or table-specific products. See Appendix Figure 3 or more notes at MultiObsSchema showing a sample schema and implementation.

4.2.3 Xenia Package

While developing relational database schemas to support SEACOOS efforts, it is beneficial to review, document, and share those schemas with other groups for their development purposes and also to share any coding benefit derived from products or services that share those schemas in common. The moniker for a general SEACOOS reference database schema and support scripts which we are trying to develop more against is ‘Xenia’ (see XeniaPackage or Appendix Figure 4).

Xenia will use the earlier mentioned ‘Multi_Obs Form’ tables while extending additional support tables for the performance and notification of quality control tests performed against collected data. Xenia should also provide some minimal product functionality in terms of mapping and graph products and web services for data dissemination and sharing developed against the schema. Xenia should hold some basic platform and sensor metadata, such as location, that is critical to observation data mapping and graphing products. Xenia may also support the concept of users and groups in regards to observation event or quality control notifications.

Xenia will likely be developed as both a ‘basic’ version addressing more common data observation issues regarding time and location and more customized versions addressing more specific datatypes or functionalities.

4.3 Maintenance Processes

An advantage of relational databases is their ability to use multiple table search indexes towards quickly retrieving or sorting query data. Towards this end the SEACOOS databases regularly have a PostgreSQL VACUUM process run against them. This is an automated database maintenance tool to remove deleted data and maintain the search indexes integrity while new data may be added or changed ongoing. Data gathered is also populated to the databases using the COPY command which is much more conducive to extremely large (millions of records) batch file processing mainly in regards to high volume model data.

SEACOOS databases and servers are organized to help distribute specific workloads with maintenance tasks specific to those functions. The data ‘scout’ server is continually scanning online for new data and preparing this data for aggregation by the system. The web server accepts and directs web page queries towards the appropriate resources. Two in-situ databases play a round-robin role as one accepts queries while the other is being loaded and then roles are reversed. One database is specifically tasked with processing a 2-week window of model data products.

The manner in which SEACOOS addresses the data management issue of constantly accumulating observation data is to reuse the same table schema while limiting the time index for table instances to latest data, a short prior interval (past 2 weeks data). Older archival data may further be subdivided into more manageable monthly or annual periods. This allows queries on recent data to respond quickly and places a certain limit on how large any one table might get for indexing or backup purposes.

4.4 Database Redundancy

After several years of operation it has been recognized the pace and volume of data flow can be difficult to maintain at only one location. Uptime has remained in the 90% range, despite an increase in observations (~1000 in-situ stations collected per hour) and downstream consumers of data. Creating a parallel database was therefore implemented as method to improve this operating figure and provide failover redundancy during these limited downtime cases. Database redundancy might also improve the speed of map and query requests by tasking the redundant database server to more mundane web site image creation tasks and freeing the primary database server to power the interactive maps and external data feeds. More extensive database redundancy could also help promote a more ‘distributed’ concept of the overall system as a fault-tolerant series of similarly useful aggregations and products.

SEACOOS data managers decided that an easily manageable failover system in a fully redundant system needed to go beyond only the database. The visualization and mapping components need to also be replicated in order to front the redundant database. This visualization “stack” (PostgreSQL+PostGIS database + Data Scout/Parser + MapServer + Interactive Map code) can then be pointed to from the project website level in cases when a data flow problem is detected. The SEACOOS data scout code and database schema have been duplicated to date and have proved rather portable. A full backup implementation of the total stack is expected in the next several months. Several new challenges will emerge in maintaining the same code base between both sets of servers. Creating a system failover switch will also need to be designed before this redundant system comes online.



Initializing the SEACOOS data stream required the implementation of the recommended data transport protocols, data formats, and destination database schemas.

5.1 Data Transport Protocol

The OPeNDAP protocol has been designated by the IOOS DMAC as a component for the delivery of data in a sustainable Integrated Ocean Observing System (IOOS). SEACOOS data providers decided to establish a DODS/OPeNDAP server at each institution to serve observation data using the netCDF file format. SEACOOS implemented this transport recommendation and established a SEACOOS netCDF data format convention and data dictionary that extends the CF1.0 netCDF standard. The development of this transport method, format standard, and data dictionary enabled the smooth transport and aggregation of these netCDF files to a centralized server and relational database. Expansion of the data sharing commons to a larger regional audience may necessitate the revisiting and extension of these standards although they have proven robust to date.

In terms of reducing the scripting task to grab federal data for input to the data commons, the development and adoption of XML and web services at the federal level would reduce the need for ‘screen-scraping’ or other less efficient techniques for acquiring data. This changeover follows one of several data transport web service recommendations in the IOOS DMAC documents. Within this context it would be especially helpful to have a national consensus on a small handful of data/metadata request and response models. The OGC specifications such as Web Mapping Service (WMS) and Sensor Web Enablement (SWE), have been helpful towards these ends. XML could also be compressed or zipped to reduce the higher bandwidth associated with XML.

While there will continue to be discussion about the best methods of data transport, the more critical issue is data content and how it is represented using a standard format and vocabulary. Awareness and agreement are needed both in regards to how data is represented and provided. XML formats present significant advantages in providing a flexible, extendible record format with standard tools for validating and processing record elements. More simple ASCII or Comma Separated Value (CSV) formats and HTTP/XML access methods similar to Really Simple Syndication (RSS) can be used when trying to keep things simple or when technical resources to support more complex data formats and protocols are not present.

5.2 In-situ Observations and Model Output

The process SEACOOS followed to prepare for in-situ and model output data streams formatted to netCDF files is as follows:

  1. Database schema preparation: Pick a physical variable of interest (like wind speed & direction, sea surface temperature). Each variable is defined within a separate database table (one record for each measurement). One table would contain station id, time, latitude, longitude, depth, wind_speed, wind_direction, and associated metadata fields. Another table would contain station id, time, latitude, longitude, depth, sea_surface_temperature, and associated metadata fields. Table joins are possible using SQL, but are not currently used. Instead each separate table generates a GIS layer which can then be superimposed.
  2. Determine how the measurements will be defined in time and space: SEACOOS uses the standard UNIX or POSIX time epoch (seconds elapsed since 00:00:00 UTC on 1970-01-01). This can be a floating point number for subsecond intervals. For spatial considerations SEACOOS has developed a netCDF convention for datatypes relating to the degrees of freedom of the measurement point(s). This netCDF convention provides guidelines on how these should be defined in a netCDF file via dimensions and attributes.
  3. Additional considerations for display purposes: It's important to note that, out of the box, MapServer only provides visualization for x and y (latitude and longitude). One of the real strengths of the SEACOOS visualization is the inclusion of time and depth. Unfortunately, this also makes the data flow more complicated. Metadata fields are added which take into consideration:
    • Whether the data point should be shown in the display
    • Whether the data point can be normalized given an agreed upon normalization formula
    • How the data point is orientated as it relates to a specific coordinate reference system
    • How the data are interpolated and chosen for display

To add new physical in-situ variables, aside from addressing any new naming conventions, step 3 is the only step that should be required. Steps 1 & 2 are initial group discussion/decision processes that are subject to periodic consideration and revision if needed. Step 3 takes product (GIS in this case) considerations into mind, whereas the work accomplished in steps 1 & 2 should be universally applicable for aggregation needs across a variety of products.

Note that the above three steps are being more ‘built-in’ to the new relational database schemas such as ‘Xenia’ mentioned earlier which uses an observational index on the same data structures. We would like to build-in these repeating structures and elements (schema and code reuse) to the system to speed data development and reduce maintenance.

For in-situ observations and model data, each partner institution set up a DODS/OPeNDAP netCDF server to share a netCDF file representing the past 2 days worth of data. This data is still available via this interface, but since each transmission only involves a few kilobytes, a direct approach of getting the files via HTTP is currently used. So, for performance reasons, when aggregating the data at the central relational database, the netCDF files are uploaded directly (not utilizing the DODS/OPeNDAP API) and parsed with perl netCDF libraries. Data providers could be alerted when there is a problem with the data made available to the data scout.

  • SEACOOS perl data scout gathers the latest data (filenames suffixed from providers on a periodic basis and converts these netCDF files to SQL INSERT statements to populate the relational database.
  • Documentation on the SEACOOS CDL v2.0 netCDF format convention. SEACOOS CDL v3.0 is under development to be released in the next two months.

5.3 Remote Sensing

The aggregation process for remotely sensed data differs from the temporally regular data mentioned above since satellite overpasses may only occur once or twice a day. Images (usually PNG image files) are fetched as they are made available from SEACOOS partners. Each image filename contains a reference to the product type and timestamp like ‘avhrr_sst_200608192300.png’ an associated WLD file (for georeferencing), and has a matching reference created in a remote sensing database lookup table. These lookup tables contain file pointers to all remotely sensed images on the file system indexed by timestamp. The timestamp information is used to determine which image should be displayed for given temporal conditions in SEACOOS mapping applications.

5.4 Archive Process

The initial SEACOOS experiment was to aggregate and display data where aggregated data was held for only two weeks before being removed from the system. The group decided that the aggregation was valuable as a product in itself in terms of the usefulness of a common table format for observation data and also any conversions or quality control scripts run against the aggregation as a whole. With this concept in mind, several of the observation types have been archived ongoing since September 2004 as the system processing and storage resources have permitted.

The primary archive responsibilities remain with the SRDPs as they are always the most familiar with their own data quality and processing needs. Regionally aggregated data is archived as follows:

As in-situ data becomes older than 2 weeks old, it is moved to a similarly structured but separate database for archival records. This processing can happen during slower processing hours as part of the general system maintenance processes.

Remote sensing imagery is kept on the primary file system as storage resources permit. It may be moved to offline cheaper, slower storage mediums after a year. Large external USB storage devices (> 300 Gigabytes) have been well used as an inexpensive (< $1 per Gigabyte) secondary storage medium. These and other methods like Storage Area Networks (SANs) represent an improvement over manually loaded and difficult to manage storage mediums such as tapes and discs. These older data should also be provided to national archive centers like the National Oceanographic Data Center (NODC) after the represented data providers have had a chance to review and edit or remove their data from the archive records. The IOOS DMAC plan contains limited specifications for archive data centers.

Model data due to its large volume is not archived. The responsibility is left to the modeling teams to be able to supply the initial model starting conditions to recreate earlier model data if need be.

5.5 QA/QC Methods and Requirements

As discussed above, the bulk of the QA/QC testing resides at the SRDP providing the information to the RA. This information will be transmitted as part of regional netCDF convention compliant files collected and parsed by the RA. The RA centralized database must however contain the proper schema to receive and store the test results, as well as the capability to perform any remaining tests (nearest neighbor or model comparisons). These placeholder attributes or columns will be populated as the regional guidelines for QA/QC processing come into production. Additional scripts and routines are being written to perform RA level testing and to include QA/QC information in data filtering methods for mapping and data dissemination. These processes are underway in SEACOOS and recommendations for how SECOORA might implement them are still being developed and tested. A more detailed discussion of recommended QA/QC standards and procedures can be found in the SEACOOS/SECOORA QA/QC whitepaper (in progress). Additional research is underway to determine how best to translate QA/QC results into robust error estimates as requested by several regional data consumers (US Coast Guard for example).

Ancillary metadata about the sensors measuring each observation is also part of the latest netCDF convention (described in SEACOOS CDL v3.0, release data TBD) specifications for QA/QC. This data will become the source for an updated region wide sensor inventory to replace the previously static SEACOOS Equipment Inventory. Schema development is underway to create a database home for these data as it comes online. Including these sensor data in the observations data stream eliminates many of the prior application’s difficulties with currency and linkages to observations from each sensor. Many of the recommended metadata and monitoring procedures mentioned throughout this paper might be improved through reliance on this dataset. These additional metadata are also recommended by the IOOS DMAC on a per observation basis.

5.6 Performance Monitoring

There is also regional interest in the RA compiling some benchmarks on system performance. The concept of a virtual operations center that monitors the entire data sharing matrix in real time could both provide feedback on the entire system as well as notifications to individual partners when data flow problems are detected. Developing performance metrics is a key component in evaluating the RA’s performance in the eyes of funding agencies and potential operational data consumers. For example, the US Coast Guard requires at least 95% data uptime before including data providers in their new SAROPS environmental data catalog - how close is the RA to achieving this mark and are we getting closer?

A rudimentary system is in place within the SEACOOS project that monitors several outgoing data streams and alerts consumers of those streams when problems are detected. This system also links the latest records in the aggregated observation database for each known sensor, potentially identifying breakdowns in sensor performance, data transmission, database population, and data re-distribution. Such an internal warning requires an up-to-date snapshot of the platforms and sensors operating in the region and is one of the justifications for pushing forward on the RA sensor metadata project mentioned earlier.



The visualization efforts underway in SEACOOS are one of main successes of the project. This has lead to several OOSs adopting pieces of this visualization stack and methodology. Major changes to this system during a transition to SECOORA should be made very carefully.

After SEACOOS data are collected and aggregated, visualization procedures are implemented to represent this data for constituent user groups. These procedures provide immediate feedback and validation of SEACOOS data aggregation efforts, quickly addressing integration issues about data projection and resolution. These procedures use open source software whenever possible. The methods presented below encompass a significant amount of development work that has coalesced into a robust data visualization effort. This effort is the key step toward leveraging SEACOOS project data into national and international ocean observing efforts. Of particular interest, SEACOOS data currently flows into the test bed OpenIOOS map interface and is a keystone in that application’s data flow.

6.1 Web Mapping with MapServer

Visualization of SEACOOS data over the web utilizes the Minnesota MapServer open source mapping platform. MapServer is well adapted for use with PostgreSQL (via PostGIS) and can serve web mapping services within a flexible, scriptable environment. Although MapServer can parse ESRI data formats, these are not required and data source customization is encouraged. MapServer utilizes a “mapfile” (*.map) to setup parameters and control visualization details for each map instance. The instance powering SEACOOS maps is housed local to the SEACOOS aggregated database. A redundant visualization node is being developed to front the redundant database server at the backup location.

6.2 Support Visualization Applications

Several other open source applications are used to graph, animate, and query SEACOOS data. GIFsicle and AnimationS (AniS) are used to create and control data animations over the web. ImageMagick is used for image manipulation and to execute raster data queries. Mouseover query functionality is enabled with the searchmap component of MapServer, which creates an imagemap of the existing map image for queries. Gnuplot is used to generate time series graphs. All of these tools are scriptable and run behind the SEACOOS interactive maps.

6.3 Data Exploration Applications

Further data exploration and visualization has been enabled to allow researchers quick access to the SEACOOS database. These tools are automated web pages that rely on PHP to interact with the PostgreSQL database and MapServer, presenting database content ranges and simple maps. A similar suite of automated internal data visualization tools should be developed for SECOORA to provide easy exploration and validation of data for project researchers. The following pages are in use by SEACOOS researchers and are updated in near real time:

A data overview page displays a list of min and max timestamps for all SEACOOS interactive map data (model input data and observations data). It also provides links to individual pages for each data layer displaying the specific time slices available for each individual layer. Access is also available to map images of each layer and each time and date stamp, across 3 selectable regions. While these pages are not intended for the general public, they provide on-demand access and visualization to the entire SEACOOS database for our distributed research community.

The data animation page takes URL query string parameters and creates animations of data ingested by SEACOOS. The animation routine combines maps and graphs for most SEACOOS data. Users have control over the GIS layers, scale, platforms to graph, and time step. These animations are then served via another PHP generated page with full animation movement controls. These animations are created, stored, and served at USC until the user asks for them to be removed. Example here.

A cached observation page serves static images each hour for a variety of SEACOOS data layers and sub regions. This page is supplied by a script that sends modified URL query strings to the MapServer and caches the map images that are returned.

6.4 OGC Web Services

Further external presentation of SEACOOS data is enabled through web service standards set by the Open Geospatial Consortium (OGC). OGC web mapping services are intended to be platform independent, web-enabled, representations of GIS data and are a key component in the IOOS DMAC implementation. The services can be accessed, controlled, and presented by web browsers as well as other GIS software platforms (i.e. ESRI Interoperability toolbar, GAIA application). SEACOOS OGC services rely on the MapServer CGI engine. SEACOOS provides Web Mapping Services (WMS) and Web Feature Services (WFS). The WMS feed returns static, georeferenced images to a user’s browser or GIS platform, while the WFS feed returns actual feature data, allowing visualization control and spatial analysis on these data externally. Both services are heavily used by both sub regional and national projects (OpenIOOS testbed)

SEACOOS has extended its MapServer GIS and OGC web services to incorporate the time component of observed data. We are interested to see the inclusion of a time point or range reference index included eventually to the OGC services. A recommendation for the temporal extension of WMS exists in the WMS 1.1.1 specification but is not yet implemented in SEACOOS. In addition, IOOS DMAC is recommending the OGC Web Coverage Service (WCS) as a research implementation to serve raster data as a web service. As OGC specifications continue to develop, the RA must ensure that regional web services remain compliant and take full advantage of recommended functionality.

6.5 XML Web Services

SEACOOS involvement in providing OGC WMS/WFS feeds was part of a general community interest in standing up some experimental web services following the direction set out by the IOOS DMAC. After a 2005 OOSTech meeting in Baltimore, Maryland on web services, several attending observing system technical representatives organized a ‘OOSTech Service Definition Team’ which has developed a simple web service for sharing latest salinity measurements on a national map.

This development led to the XML and web service enabling of the SEACOOS database of observations (further documentation). Development of this type is more along the lines of Service Oriented Architectures (SOA) and other data integration efforts that are XML and web service specific and constitute a key IOOS DMAC recommendation. The cornerstone of these technologies is sharing of data (or critical processing metadata with binary objects) using XML and XML specific technologies for data validation and processing. The earlier SEACOOS netCDF and data dictionary could eventually be aided or supplanted by these types of wider data standards.

SEACOOS has also moved forward with making available other popular XML data feeds via HTTP. This common simple style of data sharing is also referred to as REST. The existing SEACOOS XML data observation feed has also been converted to Keyhole Markup Language (KML) which allows the latest SEACOOS collected observations to be viewed within the <>Google Earth product and other 3D-based geospatial browsers. GeoRSS also presents another possible simple but effective data sharing model that may become more widespread.

There is also a focus on packaging and web service enabling the existing functionalities provided throughout the system. Packaging and documentation can help towards producing redundant data feeds, aggregations, products and services. These redundancies can create a network where applications can gracefully switch between available data sources and services.

Web service enablement allows functional components which are often built into an application or system such as quality control processing, notification or visualization to be shared and reused (pipeline/component type processing) by other system workflows, helping to make processes as well as data more machine-to-machine interoperable and widely useful.

6.6 Existing Delivery Formats

In addition to the web services data query methods listed earlier, there are several other methods and formats for delivery of SEACOOS data. These additional formats are a response to requests for additional translations tools and format options from data consumers.

The SEACOOS relational database provides an OPeNDAP relational database query interface to query database tables of observation aggregations. SEACOOS hopes to expand its OPeNDAP “out” services to also serve netCDF files of the various aggregated datasets (request by several data consumers – USCG for example). In addition, IOOS DMAC recommends the data center export gridded datasets (model output, remotely sensed images) via OPeNDAP.

Data is also periodically exported on a daily and monthly basis from the database tables to CSV files which are available by HTTP. CSV or column-oriented formats are more useful to the modeling or research community who prefer well-structured data for batch oriented processing into their models or tools. While the public web display times are often oriented towards the local time, GMT/UTC time and SI measurement units are the preferred output format for modeling or research data delivery.

SEACOOS has also provided data conversion filters in the past to address problems requiring some conversion of data to a more useable format and will continue to provide these conversions or tools where a clear need exists.

6.7 Near Real-Time and Archival Data

From an aggregation perspective, the establishment of in-situ near real time data observation data flows has been more manageable in that the instrumentation limits the amount of data passed due to telemetry bandwidth and operational power constraints. Similar hourly-collected low volume data streams should continue to be more easily collected, processed, and archived. These low volume near real time data streams will continue to be of immediate interest to several audiences such as recreational, emergency management, search and rescue, and circulation model users if the streams are reliable (always available) and credible (quality controlled).

A complimentary set of aggregated data over longer time scales is also useful to researchers developing or requiring climatologies for their applications (fisheries managers for example). Such temporal aggregation of the existing spatially aggregated dataset is needed and requires an additional set or resources to store and serve. SEACOOS has provided such data in monthly sets and on an ad hoc basis when requested. The emerging RA may expect to expand this effort to longer time steps on a more regular basis.

The more difficult to manage archival issues concern high data volume and complexity of data. Full in-situ datasets collected during instrumentation ‘turn-arounds’, remote sensing imagery and products, and modeling data and products all have the capability to quickly outgrow most regionally provided processing resources. These types of data will continue to need more specialized modeling and user resource centers which address the specific processing or product needs which fall outside of a more general regional focus. A general regional data center can help the coordinated and combined display or sharing of derived products from these datatypes as images, etc where the focus is on display and interpretation of data products instead of processing and archiving of primary data.


7. Current Developments

While the above mentioned activities have established a successful and robust ocean observing system for the South East US, improvements are ongoing. As the emergence of SECOORA data management nears, these ongoing research activities should be carried forward under the new data management hierarchy. Where relevant, attention should be given to the guidance provided by the IOOS DMAC planning documents which are undergoing constant development. These activities are explained above but merit collective presentation as current developments.

Several metadata organization and distribution projects are currently underway. These include input to IOOS national ocean observation catalog efforts (i.e CSC) and a more thorough way of accounting for sensor specific metadata soon to be transmitted as part of the QA/QC process. (Section 3.5)

Database specific changes deal with repackaging and simplification of the existing schemas to become more modular. This effort will simplify the addition of new variables to the aggregated data commons. In addition, database redundancy and failover capabilities are being developed to better distribute the increasingly heavy load of aggregating and serving project data (Sections 4.2 and 4.4)

Changes to the aggregation process include the incorporation of new QA/QC procedures both at the SRDP level and at the RA level. These new procedures and ancillary data will also be used to develop better performance monitoring metrics and a dynamic sensor inventory. (Sections 5.5 and 5.6)

The last set of current developments deal with data dissemination. The existing OGC data feeds may be reworked to better include the WMS time specification if possible. WCS should also be explored as a method for serving raster data via a web service. In addition, several other flavors of XML web services are being developed as part of a national push toward Service Oriented Architecture for data dissemination. Also there is an interest in expanding the range of data formats currently disseminated to include netCDF files of the aggregated data via DODS/OPeNDAP and datasets aggregated over longer time scales. (Sections 6.2 – 6.7)




Figure 1. SEACOOS Data Flow (In-Situ Data, Model Output)


Figure 2. SEACOOS Data Flow (Remote Sensing Data, Raster Images)


Figure 3. Multi_Obs schema


Figure 4. Xenia schema


Document Actions