This paper defines the best practices for documenting ocean acidification (OA) data and presents a framework for an OA metadata template. Metadata is structured information that describes and locates an information resource. It is the key to ensuring that a data set will be accessible into the future. With the rapid expansion of studies on biological responses to OA, the lack of a common metadata template to document the resulting data poses a significant hindrance to effective OA data management efforts. In this paper, we present a metadata template that can be applied to a broad spectrum of OA studies, including those studying the biological responses to OA. The “variable metadata section”, which includes the variable name, observation type, whether the variable is a manipulation condition or response variable, and the biological subject on which the variable is studied, forms the core of this metadata template. Additional metadata elements, such as investigators, temporal and spatial coverage, and data citation, are essential components to complete the template. We explain the structure of the template, and define many metadata elements that may be unfamiliar to researchers.
Available at NOAA Institutional Repository Accession number: ocn881471371
Since the start of the Industrial Revolution, human activities have released
large amounts of carbon dioxide (
Over the past 10 years, OA studies have expanded significantly. Based on the bibliographic database of the European Project on Ocean Acidification (EPOCA), the number of OA publications averaged about 10–20 per year from 1990 to 2005, and then increased sharply to about 270 publications per year by 2011 (Laffoley and Baxter, 2012). Much of this increase was from studies on biological responses to OA (Nisumaa et al., 2010, 2012). For example, publications of this type accounted for over 80 % of all OA papers in 2011 (Laffoley and Baxter, 2012). With the rapid growth in research and the parallel rise in publications, the need for a comprehensive OA metadata template to facilitate archiving and access to these data has been recognized by the international ocean acidification data management community (Hansson et al., 2014).
Metadata is structured information that describes an information resource (e.g., an oceanographic data set), enabling its discovery and access (Guenther and Radebaugh, 2004). For OA studies, a metadata record documents such information as what was measured; by whom; when (temporal coverage); where (geographic coverage); how was it sampled and analyzed; with what instruments and following what protocol; and finally what were its units of measure and quality of the data (Pesant et al., 2010).
Metadata is critical to data discovery. It enables data sets to be found through relevant criteria, and provides information about locations of the data sets. Metadata also helps to document information about the data sets, so that they can be understood and utilized beyond the original use. Metadata plays an extremely important role in supporting archiving and preservation of data, facilitating interoperability, and synthesizing legacy data. It serves as the key to ensuring that a data set will be accessible by future researchers (Guenther and Radebaugh, 2004).
While metadata templates for chemical OA data have been available for a long time, the lack of a template for biological response OA data is a significant hindrance to the effective management of these data. Establishing a metadata template is thus fundamental for research into biological responses to OA. We present a metadata template developed in collaboration with many OA researchers. It is applicable to a broad spectrum of OA data sets, including those from studies of biological responses to OA.
The envisioned OA metadata template responds to the needs expressed by the community to meet three requirements: (1) to enable data discovery, (2) to document information about OA data sets in a consistent manner, and (3) to be broadly applicable to many types of OA studies. These three requirements served as the guiding principles in the development of the OA metadata template.
One of the main functions of metadata is to enable data discovery. During the OA metadata template development, data discovery was emphasized when decisions were made on the selection of elements and their organization.
Another important role of the metadata template is to document information about a data set. The value of a data set increases significantly if it comes with the documentation needed to understand and use the data set. If such information could be collected, stored, discovered, and accessed in a consistent manner, the data set would be made more available towards improved assessments of marine ecosystem vulnerability and better OA forecasting capabilities.
The template targets a broad spectrum of OA data sets. Ocean acidification covers a wide range of oceanographic subject areas, including chemical observations, biological monitoring, physiological response experiments, model studies, and paleoceanography studies. If the metadata template can be constructed to apply to many types of OA data sets, the OA data management effort will be much more effective.
The OA metadata template development involved two steps:
Developing the content standard involved selecting the metadata elements and
establishing their hierarchical relationships. This process involved five
steps:
studying the experiment setup and analyzing available data sets for biological response OA studies; building on top of existing chemical OA metadata templates; Obtaining feedback from biological OA experts; testing the template by using sample information from publications on biological response OA studies; and finalizing the template and adding definitions for each metadata element.
Numerous metadata format standards exist for documenting geospatial metadata. One of the most complete and widely applicable standards is the International Organization for Standardization (ISO) 191** series, which is designed for geospatial metadata associated with positioning information on the surface of the globe (International Organization for Standardization, 2009). The ISO standards have been adopted throughout the international environmental community and were officially endorsed as US standards by the Federal Geographic Data Committee (FGDC) in September 2010.
Metadata elements from the content standard development were mapped to their corresponding fields in the ISO 19115-2 standard (International Organization for Standardization, 2009; Mize et al., 2011) to generate the ocean acidification ISO metadata template. The ISO 19115-2 standard was chosen to take advantage of such sections as “MI:Acquisition” which allows data providers to capture information about ships and other data collection platforms in a structured machine readable format. All of the fields in the original ISO 19115 standard are captured within ISO 19115-2, so there is no potential loss of content in choosing to use this version of the standards.
The OA metadata template (Ocean Acidification Data Stewardship Team, 2014)
consists of three files:
a an an
In the following sections, the main structure of the metadata template, its
elements, and their hierarchical relationships will be described.
Some commonly used ocean acidification variables, their definitions, and recommended abbreviations.
Continued.
Variable metadata section, with child metadata elements organized around the variable/parameter.
Commonly used observation types of a variable in ocean acidification studies.
The term “Variables” (or “Parameters”) refers to the observed or derived properties of a study (e.g., temperature, salinity, dissolved oxygen (DO), chlorophyll, and larval survival rate). Hereafter, we will use the word Variable(s), although Parameter(s) is an acceptable synonym for this discussion. Table 1 lists some commonly used variables in OA research. Variables are treated as the focal point of the entire metadata template because we expect them to be the single most important metadata elements that would be used as search terms to locate a data set. Furthermore, all types of OA research generate some kind of variables, regardless of their sampling scheme, experiment setup, or model inputs. Therefore, the treatment of variables as the focal point of the template allows the template to apply to many types of OA data.
In order to meet the data discovery goal, it is important to maintain controlled vocabularies for the variables (Table 2). As an example, the variable – dissolved inorganic carbon – is also commonly referred to as total carbon dioxide. If no controlled vocabulary is used, data consumers would have to search the database using all possible variations of the term, in order to locate every available data set. Controlled vocabularies, however, would allow data consumers to locate the data sets by using the corresponding term in the controlled vocabulary.
Ideally, controlled vocabularies should be standardized across all OA data centers. The Ocean Acidification International Coordination Centre (OA-ICC) has been leading the efforts in developing controlled vocabularies for OA data documentation. SeaDataNet, the “Pan-European infrastructure for ocean and marine data management”, also provides controlled vocabularies for many kinds of broader oceanographic services.
The child metadata elements of a variable are organized around the variable itself to form a “variable metadata section” (Table 2). The following fields, observation type, whether the variable is a manipulation condition or a response variable, and the biological subject on which the variable is studied form the skeleton structure of the variable metadata section.
“Observation type” identifies the way a variable was captured in relation to its observational context. It could be generic terms that describe how a variable is collected. For example, for chemical OA studies, the observation type could be “Surface underway”, “Time series”, or “Profile” (Table 3). For physiological response OA studies, such terms as “Laboratory experiment”, “Pelagic mesocosm”, “Benthic mesocosm”, or “Natural perturbation site study” could be used (Pesant et al., 2010).
A metadata element called “In-situ/manipulation/response” is also added as
a child element of a variable. In physiological response OA studies,
variables could fall into several categories. For example, carbon-related
variables, e.g., pH, partial pressure of carbon dioxide (
In biological studies, many of the measured variables are attached to a specific organism or a biological community. For example, the variable “Larval survival rate” is not detailed enough without mentioning species on which the larval survival rate was studied. An element, called “Biological subject”, is where users identify an organism or a biological community, to which the observation applies.
The four metadata elements discussed above – variable name, observation type, in-situ/manipulation/response, and biological subject – form the skeleton structure of any “variable metadata section”. They are also the main discovery metadata elements that would be used to locate data in a data search portal. However, additional metadata elements are needed to make the variable independently understandable.
“Variable abbreviation” documents the abbreviation or formula of a variable in the data files. “Full variable name” spells out the detailed descriptive name of the variable. For manipulation condition variables, their manipulation methods, e.g., bubbling carbon dioxide, adding acid or base to the solution, can be recorded in a field called “Manipulation method”. “Units” should be reported in accordance with the National Institute of Standards and Technology (NIST) International System of Units (SI). In addition, whether a variable is “Measured or calculated”, and its “Calculation method and parameters” are also important pieces of information to document (Table 2).
Instrumentation is split into two categories: “Sampling instruments” and “Analyzing instruments”. A common mistake in data management practices is that sampling and analyzing instruments are used interchangeably. For example, if a researcher measured dissolved oxygen (DO) using an oxygen sensor attached to a conductivity, temperature, depth (CTD) rosette, some people may document the instrument of the study as CTD rosette, but others may think the instrument is the oxygen sensor, even though both should be recorded: CTD rosette as the sampling instrument and oxygen sensor as the analyzing instrument. Instruments that are used to collect water samples or deploy sensors are here defined as sampling instruments. Examples of sampling instruments include a CTD rosette, a Niskin bottle, and a flow-through pump onboard a research vessel. The term Analyzing instruments, however, refers to instruments that are used to analyze water samples collected with the sampling instruments, or sensors that are mounted on the sampling instruments to measure some variables of the water. For example, the analyzing instrument for a pH measurement could be a glass electrode coupled with a pH meter, or a spectrophotometer. In addition, we also created a free text field called “Detailed sampling and analyzing information” to allow users to capture additional details of their sampling and analyzing procedures beyond what instruments are used.
Several elements that elaborate on the sample size and data quality are also
included in the template. “Uncertainty” is an open-text field that allows
users to document information about the data quality of the variable. Input
to this field could be the standard deviation of the measurements (e.g.,
1 %, 2
In addition to “Biological subject”, a field called “Species
identification code” is added to document the standard species IDs, if such
information is available (Table 2). Using the reference databases from the
Integrated Taxonomic Information System (
“Method reference” is reserved to hold the bibliographic citation information of the method used to measure the variable. Most OA studies involve collaboration from multiple scientists. Therefore, it is important to document the researchers' information for each variable, to give them credit for their sampling and analyzing efforts and to allow traceability for future questions. For the sake of brevity, only researchers' names and their affiliated institutions are recorded.
“Investigators” information is needed to credit researchers for their overall data collection and analysis efforts. In addition to the basic address and contact information, we recommend the use of personal identifiers (e.g., ORCID, ResearcherID) to unambiguously define the investigator, and recommend using a controlled vocabulary for organizations as well.
Temporal and spatial information provide important data constraints. The use
of ISO-8601 date (YYYY-MM-DD, e.g., 1997-07-16) and date plus hours, minutes,
and seconds (YYYY-MM-DDThh:mm:ssTZD, e.g., 1997-07-16T19:20:30
“Platforms” often refer to the research vessels that carry out the
research. However, platforms could be something other than a ship (e.g.,
glider, Argo, satellite) or something that is fixed (e.g., moored buoys,
towers). “Expedition code (EXPOCODE)” consists of the four-digit
International Council for the Exploration of the Sea (ICES) platform code
and the sailing date in the YYYYMMDD format. For a list of ICES platform
codes, please refer to
“Citation” refers to bibliographic information about how the data set
should be cited. When working with a data center to publish their data, data
producers may only need to provide the author list to complete the citation
as the other portions of the citation are typically the responsibility of
the data publisher. When compiling an author list, we recommend using the
format of Lastname1, Firstname1 Middlename1; Lastname2, Firstname2
Middlename2; … For data centers, we recommend the use of the styles
described in the FORCE11 Joint Declaration of Data Citation Principles
(
“References” are bibliographic citations of publications, e.g., papers, cruise reports, that describe the data set. Researchers often submit their data to an archive after their work has been published. It is important to share related publications to help future users better understand the data set. The “Supplementary information” field is reserved for any information critical to understanding the data set that does not fit into any other existing fields.
The creation of a common metadata template to manage biological response OA data sets is a major effort by the OA research and data management community. We described a metadata template that applies to many types of OA data sets, including chemical OA data sets and those describing the biological responses to OA. In addition to serving OA data management efforts effectively, the template can be used by the OA research community for documenting their OA data sets, sharing data sets among researchers, and submitting data sets to data centers. The metadata template files are stored at the National Oceanic and Atmospheric Administration institutional repository with a digital object identifier (DOI) of 10.7289/V5C24TCK. The metadata development approach documented here can benefit other scientific data management programs in terms of metadata template development.
This work was supported by the Ocean Acidification Program (OAP) of the National Oceanic and Atmospheric Administration (NOAA). Discussions with Hernan E. Garcia (National Centers for Environmental Information) benefitted the metadata template development. We thank Philip Goldstein (University of Colorado) for his contribution to the metadata template development and his comments on an earlier draft of the paper. Alex Kozyr (Carbon Dioxide Information Analysis Center) provided help with the documentation of chemical OA data sets. We are grateful to Andrew Dickson of Scripps Institution of Oceanography and Fiz Pérez of the Spanish National Research Council for their insightful comments on the template. Jacqueline Mize (National Coastal Data Development Center) and Sheri Philips (National Centers for Environmental Information) provided tremendous help on the use of the ISO 19115-2 format and the creation of the ISO version of the template. We thank the associate editor, Robert Key; two reviewers, Cynthia Chandler and Anton Velo; and two readers, J.-P. Gattuso and Yan Yang, for their excellent comments that helped to improve both the template and the paper. We thank Linda Jenkins (National Coastal Data Development Center) for her technical editing. We also thank Chris Chambers, Thomas Hurst, Chris Long, Annette DesRochers, Molly Timmers, Nina Bednaršek, Donald Christopher Melrose, Derek Manzello, Renee Carlton, Jessica Morgan (NOAA), and David Kline (Scripps Institution of Oceanography) for their comments on the template. Rob Ragsdale (US Integrated Ocean Observing System), Emilio Mayoga (University of Washington), and Sara Haines (University of North Carolina) contributed to the development of definitions for some of the variables. We are indebted to Dean Perry and Dylan Redman (NOAA Northeast Fisheries Science Center) for allowing us to use their metadata records as a real world example of our metadata submission form. We thank Mark Fornwall (United States Geological Survey) for his comments to improve the paper. Edited by: R. Key