The MAREDAT global database of high performance liquid chromatography marine pigment measurements

Abstract. A global pigment database consisting of 35 634 pigment suites measured by high performance liquid chromatography was assembled in support of the MARine Ecosytem DATa (MAREDAT) initiative. These data originate from 136 field surveys within the global ocean, were solicited from investigators and databases, compiled, and then quality controlled. Nearly one quarter of the data originates from the Laboratoire d'Oceanographie de Villefranche (LOV), with an additional 17% and 19% stemming from the US JGOFS and LTER programs, respectively. The MAREDAT pigment database provides high quality measurements of the major taxonomic pigments including chlorophylls a and b, 19'-butanoyloxyfucoxanthin, 19'-hexanoyloxyfucoxanthin, alloxanthin, divinyl chlorophyll a, fucoxanthin, lutein, peridinin, prasinoxanthin, violaxanthin and zeaxanthin, which may be used in varying combinations to estimate phytoplankton community composition. Quality control measures consisted of flagging samples that had a total chlorophyll a concentration of zero, had fewer than four reported accessory pigments, or exceeded two standard deviations of the log-linear regression of total chlorophyll a with total accessory pigment concentrations. We anticipate the MAREDAT pigment database to be of use in the marine ecology, remote sensing and ecological modeling communities, where it will support model validation and advance our global perspective on marine biodiversity. The original dataset together with quality control flags as well as the gridded MAREDAT pigment data may be downloaded from PANGAEA: http://doi.pangaea.de/10.1594/PANGAEA.793246 .


Introduction
The recognition of the role of phytoplankton functional groups in controlling the biogeochemical cycles of critical elements has created a need to constrain the global distribution and abundance of these groups (Doney et al., 2009;Deutsch, 2010, 2012). Marine ecosystem modelers now estimate multiple phytoplankton functional types (PFTs) within biogeochemical models, but until recently there existed very limited data with which to evaluate the modeled PFTs (Le Quéré et al., 2005;Hood et al., 2006;Anderson, 2005;Buitenhuis et al., 2012b). Great promise lies with indirect estimates of phytoplankton classes from satellite retrievals. Fields of chlorophyll and/or biomass specific to PFTs (Alvain et al., 2005(Alvain et al., , 2008Raitsos et al., 2008;Demarcq et al., 2012) or to different size classes of phytoplankton (Ciotti and Bricaud, 2006;Uitz et al., 2006;Hirata et al., 2008Hirata et al., , 2011Kostadinov et al., 2009;Brewin et al., 2010) can be quantified via remote sensing methods. However, at present these methods lack sufficient global validation datasets, and are restricted to the surface ocean.
Depth resolved phytoplankton community structure can be determined in field samples through electron or light mi-croscopy, flow cytometry, genetic analysis or high performance liquid chromatography (HPLC). Microscopy and genomics allow for direct identification of algal species and morphology; however, these methods are time-consuming for large-scale surveys, and certain types of taxa or features (e.g., flagella) may be lost or damaged depending on preservation and handling (Havskum et al., 2004). Flow cytometry allows for high sample throughput and information on size and pigment content, but is limited to the smaller size classes of plankton (Li and Wood, 1988). On the contrary, HPLC allows for a high sample throughput and yields pigment concentrations covering all size ranges that may be chemotaxonomically interpreted to quantify marine phytoplankton community composition (Mackey et al., 1996;Van den Meersche et al., 2008;Wright et al., 2010;Hirata et al., 2011). Furthermore, pigments are relatively frequently sampled during various oceanographic field campaigns. HPLC thus yields a data product with large potential, particularly given its advantage over satellite methods for resolving depth distribution.
Several of the major accessory pigments can provide a rough indication of the presence of specific taxonomic groups and size fractions (Vidussi et al., 2001), although a recent review has indicated that relatively few of these pigments are specific to individual phytoplankton taxa (Higgins et al., 2011). For this reason, methods employing the chemotaxonomic identification of pigment ratios specific to algal classes are recommended for interpretation of pigment data. Identification usually involves a method that determines the overall pigment structure as a linear sum of contributions from all groups, and then attempts to estimate these contributions through an (inverse) optimization method, such as is implemented in the widely used CHEMTAX program (Mackey et al., 1996). Alternative, more direct, but also less quantitative methods include the association of PFTs with certain pigment ratios that are unique for a particular group (Claustre, 1994;Uitz et al., 2006). In all cases, the pigment data need to be of high quality and consistent across all considered pigments.
We have combined 136 independent field datasets totaling 35 634 quality-controlled HPLC pigment suites from the world ocean for the MARine Ecosytem DATa (MARE-DAT) initiative (Table 1). The largest fraction (22 %) was contributed by the Laboratoire d'Océanographie de Villefranche (LOV). Additional major contributions were made by the Palmer LTER project (19 %), the US JGOFS program (17 %), the collective US BATS and HOT time series (13 %), and the AMT cruises (14 %), with the remaining significant fraction coming from individual investigators and cruises (see Table 1). The data cover nearly two decades and extend over a wide range of trophic regimes from extremely oligotrophic to near-coastal eutrophic. MAREDAT pigment data include both HPLC-derived chlorophyll a and accessory pigments (e.g., chlorophyll b, 19'-butanoyloxyfucoxanthin, 19'hexanoyloxyfucoxanthin, alloxanthin, divinyl chlorophyll a,   fucoxanthin, lutein, peridinin, prasinoxanthin, violaxanthin and zeaxanthin) in varying proportions within samples, and in varying locations, depths, and months among samples. This suggests the utility of MAREDAT's pigment database for assessing regional, vertical and seasonal scales of phytoplankton community structure and succession.
The following sections describe the data distributions and quality control procedures pertaining to the MAREDAT pigment database. Overall, we intend for the data to be readily exploited by the observational, modeling, and remote sensing communities to deepen understanding of the temporal and spatial diversity of primary producers in the ocean, and the consequences for global carbon fixation.

HPLC analytical methods
The majority of the HPLC data in the database were analyzed following standard protocols, with details provided in the original publications (Table 1). The method employed by the Laboratoire d'Océanographie de Villefranche has not been previously described in detail, and is therefore elaborated here.
Generally, 2.8 L of seawater were filtered onto GF/F Whatman 25 mm filters and frozen until analysis. Prior to 2002, samples were frozen in liquid nitrogen onboard ship, and then at −20 • C prior to analysis. After 2002, samples were frozen at −80 • C until analysis. All samples were extracted in 100 % methanol; however, four different analytical protocols (LOV-A, B, C and D) have been applied at the LOV since 1990. Method LOV-A is described in Mantoura and Llewellyn (1983) with gradient elution as in Williams and Claustre (1991). This reversed-phase C18 method only allowed for a partial resolution of divinyl chlorophyll a (DV Chl a) from chlorophyll a (Chl a). Method LOV-B, based on a reversed phase C8 column, has been described by Vidussi et al. (1996). It is characterized by complete resolution between DV Chl a and Chl a and partial resolution between zeaxanthin (Zeax) and lutein (Lut). Method LOV-C is a modified version of the Vidussi et al. (1996) method and is described by Claustre et al. (2004). It is characterized by a significant gain in sensitivity. However, inaccuracies with this method arise due to coelution of 19  and prasinoxanthin (Pras) peaks. Method LOV-D (associated with the BIOSOPE cruise) is a modified version of the C8 reversed phase Van Heukelem and Thomas (2001) method. It is described by Ras et al. (2008) and is characterized by its capacity for quantifying more than 26 pigments, and for its good resolution of the Zeax/Lut and 19Hex/Pras pairs. Before their submission to MAREDAT, the data underwent quality control procedures including comparison to CTD fluorescence data, detection and elimination of outliers and visual control of each vertical profile.
Since 1999, the LOV analytical procedures have undergone regular unbiased evaluations in the framework of international intercomparison exercises ("round-robins") that comprise NASA's SeaWiFS HPLC Analysis Round-Robin Experiment (SeaHARRE) (see Van Heukelem and Hooker, 2011). SeaHARRE-1 was associated with the PROSOPE project (Hooker et al., 2000), SeaHARRE-2 with the Bencal cruise (Hooker et al., 2005) and SeaHARRE-3 with the BIOSOPE project (Hooker et al., 2009). During the roundrobins, the participating laboratories analyzed several series of triplicate samples, resulting in a level of performance associated with each laboratory for a given type of sample. The primary objectives of SeaHARRE were to achieve, with regard to the Sea-viewing Wide Field of View Sensor (SeaW-iFS) Project, the pigment accuracy range of 20-25 %, with a 15 % range for significant algorithm refinement. Most of the participants have been within the latter range whether in coastal or offshore waters, and the successive SeaHARRE exercises progressively resulted in the improvement of the different methodologies in order to satisfy a number of quality criteria, to reduce uncertainties and to reach optimal performance metrics. Hence, on an international basis, since the LOV-D method was developed for the BIOSOPE cruise, the analytical performance of the LOV has been ranked "state of the art" (Hooker et al., 2005(Hooker et al., , 2009(Hooker et al., , 2010. Further, the LOV dataset will be regularly updated in the future with the growing amount of data (see Appendix A1).

Quality control for the MAREDAT pigment database
Marine HPLC measurements from samples taken at various depths and locations around the globe were collected from individual investigators and online data repositories for the individual field campaigns summarized in Table 1. Sampling years spanned from 1989 to 2008. The full suite of pigments available from each contribution was collected in order to conduct the quality control (QC) analysis. QC procedures followed the protocols of Trees et al. (2000) and Uitz et al. (2006) for data compilations involving contributions from multiple institutes, technicians, and methodologies. A total number of 40 536 discrete samples from 136 field surveys were initially compiled for MAREDAT. The geographic locations of the data that passed the full QC analysis are illustrated in Fig. 1.
The first measure of QC flagged samples ("F TC ") in which total Chl a concentration (TCHLA; mg m −3 ) was zero or less (n = 352), an indication that pigment concentrations were at or below HPLC instrumental detection limits, and thus not resolvable in the sample. TCHLA encompassed all reported Chl a derivatives, including divinyl chlorophyll a, epimers, allomers, and chlorophyllide a. The second QC procedure flagged samples ("F TA ") for which fewer than four non-zero accessory pigments were reported (n = 2306), as prior work has indicated that most algal types possess at least four accessory pigments detectable by HPLC . The third QC measure was based on the log-linear relationship between TCHLA and total accessory pigment concentration (TACC). TACC included photosynthetic carotenoids (19'-butanoyloxyfucoxanthin, 19'-hexanoyloxyfucoxanthin, fucoxanthin, peridinin, prasinoxanthin) and photoprotective carotenoids (alloxanthin, diadinoxanthin, diatoxanthin, lutein, neoxanthin, violaxanthin, zeaxanthin, and carotenes), chlorophylls b and c, and phaeopigments. Log transformation was appropriate for the regression as TCHLA varies by over four orders of magnitude globally (Campbell, 1995;Yoder et al., 1993). This QC criterion was built on the observation that phytoplankton maintain a relatively close ratio between their ancillary pigment and total Chl a concentrations as a consequence of photoacclimation . The least-squares log-linear fit between TCHLA and TACC (log 10 (TACC) = log 10 (TCHLA) (0.92) + 0.03; r 2 = 0.89; p = 0; n = 38 221) is displayed with the data in Fig. 2. The slope of the regression remains remarkably consistent with that observed by Trees et al. (2000) for a smaller subset of global ocean HPLC measurements. Samples that fell outside the range of two standard deviations of the regression line were flagged ("F Rat "; n = 2001). The criterion of two standard deviations was chosen to evaluate outlying (i.e., potentially erroneous) measurements without compromising observation of the natural fluctuation about the mean relationship (Uitz et al., 2006). As a final QC criterion, if more than 35 % of samples from a given field campaign (e.g., cruise or time series) was flagged during the third QC step, the entire campaign's samples were flagged ("F Cr "). This argument was based on the result of applying Chauvenet's criterion (Buitenhuis et al., 2012b) to the cruise percentages of F Rat flags, which identified 8 outlying cruises, thus flagging an additional 589 samples (Table 1). Overall, 4902 samples were flagged in the QC process, representing 12.1 % of the original collection. The resulting MAREDAT pigment database thus contains 35 634 high quality HPLC pigment measurement suites.
There is a clear trend of increased frequency of flagged samples with increased sample depth, mostly driven by the increase in F TA flags with depth (Fig. 3a). This is consistent with pigment concentrations decreasing to detection limits well below the base of the euphotic zone. The frequency of F Rat , the flag most closely related to HPLC analytical uncertainty , occurs roughly in proportion to TCHLA, indicating that analytical fidelity is not specifically biased toward low or high chlorophyll levels except within the minimum TCHLA bin (Fig. 3b).

Properties of the quality-controlled MAREDAT pigment database
The majority of the quality-controlled samples fall within the 0-75 m depth range, with the upper 20 m of the water column accounting for 34 % of the dataset. The number of Southern Hemisphere data is biased toward the polar region due to the collectively large contribution of samples from the Palmer Station LTER (Fig. 4a; Table 1). The data distribution for the Southern Hemisphere is also skewed toward the month of January for this reason, as Antarctic waters were typically accessed for HPLC sample collection during austral summer. Samples are overall evenly distributed between the two decades of collection (Fig. 4b). In terms of the seasonal distribution and range in TCHLA values, the MAREDAT pigment database contains good coverage for both hemispheres. The higher variability in chlorophyll concentration in the Northern Hemisphere than Southern Hemisphere during each season as seen from ocean color observations (Yoder et al., 1993) is also represented by the MARE-DAT chlorophyll data ( Fig. 5a and b). This relatively lower annual chlorophyll variability in the Southern Hemisphere is due to large areas of Southern Ocean iron depletion limiting the spring phytoplankton blooms (Behrenfeld and Kolber, 1999).

Distribution of accessory pigments
The number of samples within the MAREDAT pigment database that contribute to the distribution of each of the accessory pigments is not equal (Fig. 7), as we did not discriminate against contributions without a full suite of accessory pigments. The concentration of chlorophyll a (Fig. 7a) and the key accessory pigments (Fig. 7b-l) generally exhibit a log-normal distribution about their mean. The DV Chl a distribution has the fewest observations (Fig. 7f), potentially because this pigment is not routinely measured during HPLC analysis on high latitude samples. Figure 8 displays surface concentrations (as the average over the upper 20 m) of the main taxonomic pigments from the entire MAREDAT dataset. Figure 9 displays the zonal annual mean depth distributions for the upper 250 m, also using the entire dataset for averaging. Interpretation of the pigment data in terms of phytoplankton taxonomy (e.g., using an inverse method such as CHEMTAX) is beyond the scope of this ESSD paper. It should also be noted that caution must be exercised when using a single pigment distribution as an unambiguous marker for a given algal taxon (Higgins et al., 2011). For example, the distribution of 19Hex, once proposed as relatively indicative of autotrophic flagellates from the nanoplankton size class (Vidussi et al., 2001), is now thought to arise from many phytoplankton sources (Higgins et al., 2011), and is characterized by a rather ubiquitous global distribution (Liu et al., 2009). This conclusion is further supported by the relatively homogenous spatial distribution of 19Hex from MAREDAT (Figs. 8c and 9c).
Recognizing the aforementioned precautions, it is nevertheless interesting to consider the distributions of a few key accessory pigments within MAREDAT that may serve as relative markers for the broader phytoplankton groups given in Table 2 (adapted from Vidussi et al., 2001). For example, DV Chl a, which is indicative of the presence of Prochlorococcus spp. (Vidussi et al., 2001), shows highest concentration in   (Figs. 8f and 9f). This finding is consistent with prior reports on Prochlorococcus spp. distributions based on cell counts (Goericke et al., 2000;Buitenhuis et al., 2012a). Similarly, fucoxanthin (Fuco) is widely considered to be diagnostic of diatoms in regions where diatoms dominate the autotrophic abundance (e.g., high latitude productive areas; Wright et al., 2010), and accordingly the MAREDAT data show a bias for Fuco toward polar regions (Figs. 8g and 9g, respectively). Finally, the MAREDAT pigment dataset shows that Zeax is relatively confined to the warm tropical and subtropical wa-ters of each basin (Figs. 8l and 9l). Zeax roughly marks the presence of cyanobacteria, which are recognized dominants in these lower latitude regions (Claustre and Marty, 1995;Vidussi et al., 2001; see also global biomass assessment of Synechococcus and Prochlorococcus spp. in Buitenhuis et al., 2012a).

Recommendations for use
The quality-controlled MAREDAT pigment database may be exploited using any number of approaches to estimate phytoplankton community structure in the global ocean. Some of these approaches include the diagnostic pigment method to determine size fractions (Vidussi et al., 2001;Uitz et al., 2006), least-squares solutions (Letelier et al., 1993;Andersen et al., 1996), CHEMTAX analysis (Mackey et al., 1996) and similar approaches (Goericke and Montoya, 1998;van den Meersche et al., 2008). Regardless of method, care must always be taken to ensure that the regional or global algorithms applied are appropriate for the temporal or spatial scale in question. Furthermore, results should be validated with independent measures of assemblage composition (e.g., flow cell counts, microscopy) when possible. New global MAREDAT compilations of phytoplankton species abundance and biomass data from microscopic and cell counting methods (see this special issue) will aid in mapping discrete populations sampled by the pigment database for an effective global chemotaxonomic analysis. The presence/absence species information from MAREDAT, for example, may better inform accurate selection of "seed" pigment ratios that are needed for achieving robust quantitative results from CHEM-TAX (Mackey et al., 1996;Wright et al., 2010). In contrast, abundance-based biomass data from MAREDAT may allow for quantitative comparison with the CHEMTAX results (e.g., Llewellyn et al., 2005). Future assessments will benefit greatly from updated summaries on plankton pigment ratios (Higgins et al., 2011).   Ideal applications for chemotaxonomic workup results of the MAREDAT data are to (a) better ground-truth satellite estimates of phytoplankton functional types or size fractions, (b) determine biological niches in the ocean from phytoplankton community structure, and (c) evaluate MAREDATbased results against output from ecosystem models that translate physiological observations into phytoplankton community distributions. While climatological mean Chl a signals from both MAREDAT and SeaWiFS are in good agreement, chlorophyll and accessory pigment concentrations in a given region vary widely in time, especially in areas with large seasonality in phytoplankton composition (Yoder et al., 1993). Thus, if results from MAREDAT pigment analysis are to be used for model comparisons, a point-to-point evaluation by month is recommended.
The MAREDAT pigment database is viewed as a living dataset, with the expectation of new HPLC data contributions and improvements in the future. We anticipate that new product-specific datasets, such as those forming the MARE-DAT initiative, will permit exciting synergies and scientific gains across several disciplines.

A1 Data table
A full data table containing all pigment suites can be downloaded from the data archive PANGAEA: http://doi.pangaea. de/10.1594/PANGAEA.793246. The data file contains longitude, latitude, depth, sampling date and time, total chlorophyll a and total accessory pigment concentrations, the individual pigment concentrations, all quality flags, as well as the full data references. As a subset, a distinct archive reference is dedicated to the dataset originating from the LOV: http://doi.pangaea.de/10.1594/PANGAEA.808535.

A2 Gridded netCDF product
HPLC-derived pigment concentrations (chlorophyll a, 19'butanoyloxyfucoxanthin, 19'-hexanoyloxyfucoxanthin, alloxanthin, chlorophyll b, divinyl chlorophyll a, fucoxanthin, lutein, peridinin, prasinoxanthin, violaxanthin, and zeaxanthin) have been gridded onto a 360×180 • grid, with a vertical resolution of 33 depth levels (equivalent to World Ocean Atlas depths) and a temporal resolution of 12 months (climatological monthly means). Data has been converted to netCDF format for easy use in model evaluation exercises. The