Vertical distribution of chlorophyll a concentration and phytoplankton community composition from in situ ﬂuorescence proﬁles: a ﬁrst database for the global ocean

In vivo chlorophyll a ﬂuorescence is a proxy of chlorophyll a concentration, and is one of the most frequently measured biogeochemical properties in the ocean. Thousands of proﬁles are available from historical databases and the integration of ﬂuorescence sensors to autonomous platforms led to a signiﬁcant increase of chlorophyll ﬂuores- 5 cence proﬁle acquisition. To our knowledge, this important source of environmental data has not yet been included in global analyses. A total of 268 127 chlorophyll ﬂuorescence proﬁles from several databases as well as published and unpublished individual sources were compiled. Following a robust quality control procedure detailed in the present paper, about 49 000 chlorophyll ﬂuorescence proﬁles were converted 10 in phytoplankton biomass (i.e. chlorophyll a concentration) and size-based community composition (i.e. microphytoplankton, nanophytoplankton and picophytoplankton), using a method speciﬁcally developed to harmonize ﬂuorescence proﬁles from diverse sources. The data span over ﬁve decades from 1958 to 2015, including observations from all major oceanic basins and all seasons, and depths ranging from surface to 15 a median maximum sampling depth of around 700 m. Global maps of chlorophyll a concentration and phytoplankton community composition are presented here for the ﬁrst proﬁles a of phytoplankton (i.e. chloro- phyll a concentration) community composition using the of ten-step quality control proﬁles As examples of application, the ﬁrst maps of global mean chlorophyll a concentration for several oceanic layers as well as global maps phytoplankton community size indices. To further as- sess the quality of the resulting database, the climatological chlorophyll a concentration here computed for the surface layer


Introduction
Phytoplankton biomass is generally recognized to play a key role in the global carbon cycle, stressing the need for a better understanding of its spatio-temporal distribution and variability in the global ocean. Chlorophyll a concentration is widely used as a proxy to estimate phytoplankton biomass. The geographic and temporal distri-5 bution of this proxy is already well documented at global scale thanks to synoptic remote sensing observations by Ocean Color Radiometry (OCR, McClain, 2009;Siegel et al., 2013). Nevertheless, OCR observations are restricted to the ocean surface layer, "sensing" only one fifth of the so-called euphotic layer where phytoplankton photosynthesis is realized and which can sometimes extend well below 100 m (Gordon and McCluney, 1975;Morel and Berthon, 1989). It is therefore essential to better resolve the global distribution of phytoplankton biomass on the vertical.
The vertical distribution of chlorophyll a can be estimated with greatest accuracy from the analysis of water samples by High Performance Liquid Chromatography (HPLC, Claustre et al., 2004;Peloquin et al., 2013). However, these in situ measurements are 15 relatively scarce because their acquisition requires ship-based sampling and their analysis is costly. Moreover, because these measurements are made on water samples, the vertical resolution is generally weak (e.g. around one measurement every 10 m). The measurement of in vivo chlorophyll a fluorescence is widely used as a proxy for chlorophyll a concentration (Lorenzen, 1966). Besides dissolved oxygen concentration, 20 fluorescence is the most measured biogeochemical property in the global ocean. The advantages of this method are as follows: (1) it can be easily measured in situ using reliable sensors, (2) the vertical resolution is high, yielding several values per meter; and (3) data are available in digital format immediately after their acquisition. The integration of fluorescence sensors on autonomous platforms (e.g. profiling floats, animals, 25 gliders) has recently led to a burst in the acquisition of in vivo chlorophyll a fluorescence data (Claustre et al., 2010a). However, the relationship between chlorophyll a fluorescence and phytoplankton biomass is highly variable and depends on several factors, 368 ESSDD 8,2015 Database of global ocean chlorophyll a profiles R. Sauzède et al. including phytoplankton physiological state and community composition (Cunningham, 1996;Falkowski et al., 1985;Kiefer, 1973). The conversion of in situ chlorophyll a fluorescence measurements into phytoplankton biomass must therefore be done with great care. FLAVOR (Fluorescence to Algal communities Vertical distribution in the Oceanic 5 Realm) is a method developed to transform and combine large numbers of fluorescence profiles from various sampling sensors and platforms (Sauzède et al., 2015a). This neural network-based method generates vertical distributions of (1) chlorophyll a concentration and (2) phytoplankton community size indices (i.e. microphytoplankton, nanophytoplankton and picophytoplankton) based on the shape of in situ fluorescence 10 profiles (i.e. normalized profiles) and the day and location of acquisition. In addition to chlorophyll a concentration, community composition is an essential variable determining the possible impact of phytoplankton on oceanic carbon fluxes and climate change scenarios (e.g., Le Quere et al., 2005). Global data compilations of phytoplankton community composition from discrete water samples have recently been published in ESSD 15 (Peloquin et al., 2013) but data remain rather sparse. It could be an invaluable source of information to have a database of phytoplankton community size indices with the same spatio-temporal resolution as the fluorescence datasets. It now becomes possible using the FLAVOR method to transform and combine all available in situ fluorescence data into a single reference database that comprises essential information on chloro-20 phyll a concentration and phytoplankton community size indices vertical distributions. Presently, the widely used climatology of the global vertical distribution of chlorophyll a concentration is published in the World Ocean Atlas 2001 . The latter climatology is based on estimates from analyzed water samples available in the World Ocean Database (WOD, Levitus et al., 2013) and the World Data Center (WDC, http://gcmd.gsfc.nasa.gov/). This climatology, based on seven discrete depths (0-10-20-30-50-75-100 m), is mainly limited by the lack of in situ estimations of chlorophyll a concentration, which leads to a strong spatial interpolation of data. Moreover, the discrete depths used to compute the climatology fail to finely reproduce ESSDD 8,2015 Database of global ocean chlorophyll a profiles R. Sauzède et al. the vertical distribution of the phytoplankton biomass, especially in areas characterized by very deep (> 100 m) Deep Chlorophyll Maxima (DCM) such as the core of subtropical oligotrophic gyres. Using FLAVOR, the potential of the high vertical (around one data point per meter) and spatial resolution of chlorophyll fluorescence measurements would improve significantly the 3-D climatologies of chlorophyll a concentration.

5
Moreover, climatologies of phytoplankton community size indices could be created with a similar spatio-temporal resolution. This paper presents a global compilation of chlorophyll fluorescence profiles obtained from online databases and from published and unpublished individual sources. These were converted into a global compilation of phytoplankton biomass (i.e. chloro-10 phyll a concentration) and community composition using the FLAVOR method. Prior to the application of FLAVOR, a ten-step quality control procedure was specifically developed. The remaining profiles were then analyzed. As examples of application, we present the first maps of global mean chlorophyll a concentration for several oceanic layers as well as global maps phytoplankton community size indices. To further as- 15 sess the quality of the resulting database, the climatological chlorophyll a concentration here computed for the surface layer is compared to the climatological remotely sensed chlorophyll a concentration available from Modis Aqua. Moreover, monthly 3-D climatologies of chlorophyll a concentration and associated phytoplankton community size indices are analyzed for several ecological provinces defined by Longhurst (2010). 20 Overall, the dataset presented here can be readily exploited to deepen our understanding of the spatio-temporal distribution and variability of phytoplankton biomass and associated community composition in the global ocean. It is obviously a first steps towards a database that will regularly be improved thanks to the ongoing intensification of chlorophyll a fluorescence profile acquisition by Bio-Argo profiling floats, gliders and 25 instrumented mammals. 8,2015 Database of global ocean chlorophyll a profiles R. Sauzède et al.  Sauzède et al., 2015c). The data of in situ vertical fluorescence profiles compiled for creating the raw database were obtained from several 10 available online databases as well as published and unpublished individual sources.

ESSDD
The duplicates and single-surface values, which are not vertical profiles, were automatically removed (not integrated in the raw database). Finally, the raw database contains 268 127 fluorescence profiles. Following a robust quality control procedure detailed hereafter (Sect. 2.2), about 49 000 chlorophyll fluorescence profiles were converted 15 in phytoplankton biomass (i.e. chlorophyll a concentration) and size-based community composition (i.e. microphytoplankton, nanophytoplankton and picophytoplankton). The origin of this calibrated database is summarized in Table 1. The majority of the data comes from the National Oceanographic Data Center (NODC) and the fluorescence profiles acquired by Bio-Argo floats available on the Oceanographic Autonomous Ob-20 servations (OAO) web platform (63.7 and 12.5 % respectively, see percentages of data in the database depending on their origin in Table 1). Different modes of acquisition were used to collect the data presented in this study: (1) the CTD profiles are acquired using a fluorometer mounted on a CTD-rosette, (2) the OSD (Ocean Station Data) profiles are derived from water samples analyzed by 25 fluorometry and are defined as "low" resolution profiles (Boyer et al., 2009) Guinet et al., 2012). Table 2 lists the number of raw fluorescence profiles in the raw database according to these four modes of acquisition.
It is worth to note that the data acquired from gliders were not included in the 5 database. Although glider data are extremely numerous, they are restricted to a very small spatio-temporal window. As a consequence, a database including gliders data would likely be spatially and temporally biased, in contradiction with our first aim of building a global climatological database. 10 In order to use the FLAVOR method (see details in Sect. 2.3), a specific and adapted data quality control procedure was developed and applied to each in situ chlorophyll fluorescence profile. This procedure was schematically implemented according to four main steps of data control ( Fig. 1), each step being developed for discarding most, if not all, spurious fluorescence profiles that would deteriorate the quality of the database. 15 Firstly, several basic tests were applied: (1) duplicates and single-surface values, which are not vertical profiles, were removed (these profiles were removed from the beginning of the process so they are not included in the so-called raw database), (2) coastal profiles are removed using a bathymetric mask of 500 m depth, (3) the uppermost measurement has to be located within the 0-10 m layer while the deepest measure-20 ment has to be at or below 100 m. Secondly, tests on the profile vertical resolution are applied: (4) a minimum of ten values per profile is required (i.e. condition on the vertical resolution acquisition), (5) a minimum of five different values per profile is required (i.e. condition on the sensor resolution). Then, several tests on the fluorescence profile shape are applied. These conditions are based on the parameter used for the  20 m has to be greater than the median of the values of the last 10 % of the deepest samples of the profile (see Fig. 1a), (7) the depth Z 0 has to be within the last 10 % of the deepest samples of the profile (see Fig. 1b). Finally, a test on the noise of the profiles was developed and applied: (8) profiles with aberrant data caused by electronic noise are removed (i.e. variability greater than 20 % of the total profile range, see Fig. 1c).

5
To finish, a visual check allowed to verify all the remaining fluorescence profiles. The number of fluorescence profiles rejected at each step of the quality control procedure is presented in Table 3. Around 80 % of the raw fluorescence profiles were thus removed by this procedure. This step is an essential prerequisite for the development of a "clean" database of vertical distributions of phytoplankton biomass and community 10 composition in the global ocean. The quality control procedure removed 77, 71, 28 and 25 % of the OSD, UOR, AP and CTD profiles, respectively, with profiles removed by the test on the bathymetry not taken into account. 15 In order to assess the vertical distribution of the total chlorophyll a concentration (hereafter, [TChl]) and the chlorophyll a concentration associated to each phytoplankton size index (hereafter, [microChl], [nanoChl] and [picoChl] for microphytoplankton, nanophytoplankton and picophytoplankton respectively), the FLAVOR method (Sauzède et al., 2015a) is applied to each chlorophyll fluorescence satisfying the quality control proce-20 dure (see Sect. 2.2). In summary, FLAVOR is a neural network-based method which uses as inputs (1) (Sauzède et al., 2015a). It is important to note that one of the main failures of FLAVOR is that the impact of the daytime Non-Photochemical Quenching (NPQ; see, e.g., Cullen and Lewis, 1995), which is responsible for a decrease of 5 chlorophyll fluorescence values at high irradiance, is not accounted for by the method. If density profiles are available with fluorescence profiles, the NPQ could be corrected using the method of Xing et al. (2012) which involves substituting the fluorescence values acquired within the mixed layer by the maximum value within this layer. An additional step of the quality control is further applied once the FLAVOR method 10 has been operated. It is based on the Chauvenet's criterion which is used to identify statistical outliers in the retrieved biomass data (Buitenhuis et al., 2013;Glover et al., 2011;O'Brien et al., 2013). The criterion was applied to the surface data of each profile (median of values from the surface down to 20 m). As the Chauvenet's criterion is based on the assumption that the data follows a normal distribution, the analysis was 15 performed on the log-normalized [TChl] surface values. Such a criterion removes aberrant data partially caused by the failure of the FLAVOR method (see number of profiles removed by the Chauvenet's criterion in Table 3). The present database corresponds to model outputs, once the FLAVOR method has been applied. It has been mentioned previously that this method is not adapted for 20 the retrieval of chlorophyll a concentration on a fluorescence profile-by-profile basis (Sauzède et al., 2015a). Rather, FLAVOR and, hence, the resulting database are relevant for large scale investigations, e.g. development of climatologies of the vertical distribution of chlorophyll a from which regional anomalies or temporal trends might be evidenced.  illustrates the potential of this new type of acquisition which is expected to dramatically increase the number of collected fluorescence profiles in the future. Vertically, the database includes values of total chlorophyll a concentration and associated phytoplankton community composition from the surface down to a mean sampling depth of 743 m (with a maximum sampling depth ranging from 100 to 6000 m; 5 Fig. 5b).

Vertical distribution of the chlorophyll biomass
We present the database with respect to the vertical distribution of the total chlorophyll a concentration ([TChl]). Figure  that is typical of these oligotrophic regions (e.g., Cullen, 1982;Mignot et al., 2011Mignot et al., , 2014. The global distribution of the phytoplankton community composition, given in terms of fraction of chlorophyll a concentration associated to micro-, nano-and picophytoplankton, is presented for the 0-1.5 Z e layer (Fig. 7a-c respectively). Here Z e , the 20 euphotic depth is defined as the depth at which the irradiance is reduced to 1 % of its surface value. It was estimated according to the method of Morel and Berthon (1989), using the [TChl] profiles derived from FLAVOR. Figure 7 reveals general geographic patterns which are consistent with the knowledge about the ecological domains and biogeochemical provinces (e.g., Longhurst, 2010). On average microphytoplankton are 25 dominant in the subarctic zone with a relative contribution to the chlorophyll biomass reaching more than 70 % in these areas (Fig. 7a) Discussion Paper | Discussion Paper | Discussion Paper | Discussion Paper | a contribution reaching 45-55 % (Fig. 7c). Nanophytoplankton appear to be ubiquitous with a relatively stable contribution to biomass of 40-50 % (Fig. 7b).
To further assess the quality, range and representation of the FLAVOR-retrieved [TChl] database presented in this study, the retrieved surface [TChl] is compared to the remotely sensed [TChl]. In this context, the climatological [TChl] mean was ex-5 tracted at a 9 km spatial resolution from NASA Modis Aqua archive for the time period covering 2002 to 2014. The extracted satellite [TChl] data were re-gridded to a 3 • × 3 • spatial resolution. Similarly the FLAVOR-retrieved [TChl] values for the upper layer of the database (i.e., mean value calculated between the surface and 20 m) for the same period were re-gridded to 3 • × 3 • squares. Figures 8 and 9 show that climatological 10 averaged [TChl] from Modis Aqua and from the present database are generally consistent ( Fig. 8a and b). The log-transformed ratio of the Modis Aqua to the database [TChl] estimates reveals a rather good agreement with a median value of −0.16 and a standard deviation of 0.58 (see histogram in Fig. 8c). Figure 9 displays the geographic distribution of the log-transformed ratio between the Modis Aqua and the database es- 15 timates of climatological surface [TChl]. The ratio shows no specific spatial bias. The FLAVOR method was validated in all ocean basins individually. The retrievals obtained for the Southern and the Arctic Ocean were slightly less accurate than for the other basins. It is therefore possible that the estimation errors are greater in these areas and particularly due to an insufficient data density. This observation has also to be nu-

Example of application: climatological time series of the vertical distribution of chlorophyll a concentration and phytoplankton community composition
As an example of application, monthly climatologies were computed for three ecological provinces defined by Longhurst (2010) Fig. 10a): (1) the North Atlantic Subtropical Gyral Province West (NASW, Fig. 10b), (2) the Atlantic Subarctic Province (SARC, Fig. 10c) and (3) the North Pacific Subtropical Gyre Province (NPTG, Fig. 10d). Overall the time series of the vertical distribution in [TChl] are consistent with expectations as detailed by Longhurst (2010). For the NASW province (Fig. 10b), the [TChl] is relatively homogeneous from the surface to  The [TChl] at DCM reaches a maximum value in June and July. The dominant phytoplankton groups are the nano-and the pico-phytoplankton with relative contribution reaching 45-50 % for both size-based groups and slight opposite temporal evolutions. The contribution of microphytoplankton remains low all over the year (< 10 %). 25 The phytoplankton biomass (i.e. chlorophyll a concentration) and phytoplankton community size indices were derived from chlorophyll fluorescence profiles using a dedi-  Sauzède et al., 2015a). For the first time, in situ chlorophyll fluorescence profiles from various data centers have been collected and synthesized in a global dataset to create unified and interoperable products related to chlorophyll a concentration and phytoplankton communities. This work can thus be considered as a first step towards the development of a 3-D climatological representation of chlorophyll a concentration and phytoplankton community composition. As mentioned before, we recall here that this database should not be used on a profile-byprofile basis. Instead, this database has rather to be used to derive climatologies from which regional or temporal trends might possibly be extracted. To date, and because of the lack of in situ vertical data, the identification of such trends has been based exclusively on surface remotely-sensed data (Beaulieu et al., 2013;Boyce et al., 2010;Gregg, 2005;Gregg and Conkright, 2002). Obviously, the present dataset offers a potential refinement to improve open-ocean climatologies of chlorophyll a with respect to the vertical dimension. Finally, this database has to be considered as a reference that has the potential to 15 evolve. It is now clear that numerous fluorescence profiles will be acquired through robotic observations (e.g. Claustre et al., 2010b;Johnson et al., 2009). In fact, about a sixth of the profiles of the present database has been sampled by Bio-Argo profiling floats in only two years. Therefore the database proposed here represents a first step towards a global single reference database reconciling the oldest datasets of chloro-20 phyll fluorescence with the future ones mostly acquired remotely by autonomous platforms.