A global diatom database – abundance, biovolume and biomass in the world ocean

. Phytoplankton identiﬁcation and abundance data are now commonly feeding plankton distribution databases worldwide. This study is a ﬁrst attempt to compile the largest possible body of data available from di ﬀ erent databases as well as from individual published or unpublished datasets regarding diatom distribution in the world ocean. The data obtained originate from time series studies as well as spatial studies. This e ﬀ ort is supported by the Marine Ecosystem > 90% of global biomass, among which centric species were dominant. Thus, placing signiﬁcant e ﬀ orts on cell size measurements, process studies and C quota calculations of these species should considerably improve biomass estimates in the upcoming years. A ﬁrst-order estimate of the diatom biomass for the global ocean ranges from 444 to 582 Tg C, which converts to 3 to 4 Tmol Si and to an average Si biomass turnover rate of 0.15 to 0.19 d − 1 . Link to the dataset: doi:10.1594 / PANGAEA.777384.


Introduction
Marine ecosystems are characterized by large species diversity, yet the succession and distribution of the main taxa are still poorly understood. Plankton diversity is often narrowed down to the notion of functional group, which can be defined as a group of organisms operating the same biogeochemical process and driving the flux of the main biogenic elements differently from other groups. Functional groups have been further organized into plankton functional types (PFTs) (Le Quéré et al., 2005;Hood et al., 2006) in order to help construct biogeochemical models including diversity in a simplified way. Main PFTs include diatoms, calcifying organ-isms, nitrogen fixers, pico-autotrophs, pico-heterotrophs and various zooplankton groups. Diatoms are a large component of marine biomass and produce ∼ 25 % of the total C fixed on Earth Field et al., 1998), producing more organic C than all rainforests combined. Another striking image to consider is that they produce one fifth of the oxygen we breathe. Therefore, they have a major ecological significance and impact on the global elemental Si and C cycles Ragueneau et al., 2000;Tréguer, 2002;Jin et al., 2006). Diatoms also have a high export/production ratio due to elevated sedimentation rates by forming aggregates and incorporation into fast sinking zooplankton faeces. Diatoms are, along with dinoflagellates, today's most diverse planktonic flora. A current estimate of all living diatoms ranges from 10 000 to 100 000 species, but a smaller fraction, from 1400 to 1800 species, is recognized as marine planktonic (Sournia et al., 1991). Major progress has been made in the last decades on in situ Si dynamics, thereby improving models, but the knowledge of biological factors such as species composition, cell morphology and aggregation processes still needs to be improved (Hood et al., 2006).
Satellite data now allow a closer definition of functional groups from space (Alvain et al., 2005;Uitz et al., 2006), and this effort has been most fruitful on coccolithophores (Brown and Yoder, 1994;Iglesias-Rodriguez et al., 2002) but has also been recently attempted on Trichodesmium (Dupouy et al., 2008) and diatoms (Sathyendranath et al., 2004). However, many challenges remain with this approach, a major bias being the impossibility of capturing subsurface blooms but also assessing variable cellular pigment quotas. Hence, Dynamic Green Ocean Models (DGOM) still need validating with datasets giving C biomass estimates for each PFT. Improving the parameterization for diatoms in various biogeochemical models would thus help improve the global C budget and the subsequent fate of exported particulate matter with respect to depth estimations.
Phytoplankton identification and abundance data are now regularly added to plankton databases worldwide but need to be regrouped so that they can be useful to the biogeochemistry and modeling community. This study is the first attempt to compile the largest possible body of available data from these different databases as well as from individual datasets regarding diatom distribution in the world ocean. This study is supported by the MAREMIP program, which aims at building consistent datasets for the major PFTs in order to provide validation sets for biogeochemical ocean models. This paper is part of the special issue dedicated to providing global databases (named Marine Ecosystem Data -MAREDAT) on the nine main PFT for their abundance and C biomass.
Diatom cell sizes range from a few micrometers up to 2 mm and their cellular biovolumes span over nine orders of magnitude. Subsequent C conversion estimates are therefore prone to large errors if cell size is not correctly assessed. The challenge posed by compiling a global database on diatom abundance, biovolume and biomass is the large intraspecific variability observed in diverse parts of the world ocean and in the same area depending on environmental conditions and life stages.
Plankton identification and counting is sometimes rewarding, but is most often considered a tedious task, one that cannot be completed "without ruin of the body and mind" as Haeckel (1890) humorously phrased it. Systematic cell size measurements, biovolume and biomass conversion are even more challenging. An additional objective of this study is to provide a tool for taxonomists worldwide to facilitate these measurements and calculations in a standardized way during routine cell counts.
The objective of this study is to promote the construction of an extensive diatom database with standardized methods for collection, counting, data management and conversion to biomass used to assess the global importance of diatoms in marine productivity and provide field data for biogeochemical models including PFTs. An extensive bibliographic search was undertaken to compile all available diatom dimensions for all reported species. This will allow a first estimation of the contribution of diatoms to the global C budgets based on field data. A quantitative and qualitative description of the main features of diatom biomass distribution is presented in the following study. This effort has been initiated in the PANGAEA database, where individual collections are available, but should be the object of supplementary efforts to systematically include cell sizes in a standardized way (see methods section) in future studies.

Data collection
Data were collected through a first round of mail enquiries addressed to an extensive list of taxonomists. A second round of enquiries was sent to the administrators of the main known databases (PANGAEA, BODC, NODC, NMSF-Copepod, etc.) for access to their datasets. Finally, recent oceanographic cruises or research programs or time series that were known to include taxonomic data were identified and permission for use in the present database was acquired from each owner. The entries for each data point included date of collection, sampling depth, latitude, longitude, taxonomic information, abundance with unit, and if possible, sampling, preservation and counting methods. The latter information was most difficult to obtain for old datasets where the contact person could not be identified or had retired.
We collected over 293 000 individual geo-referenced data points with diatom abundances mostly from bottle sampling (Niskin, Hansen or other appropriate bottle sampling device). A very small fraction of the database included net hauls or Continuous Plankton Recorder (CPR) data, which were excluded from the present database as it is quite difficult to reconstruct quantitative cellular concentrations from them and  because of their bias towards collecting larger cells. After filtering out zero abundance data, net haul data, erroneous data and after statistical treatment (see Sect. 2.4), 91 704 data points with associated cell abundance remained, 90 648 of which were converted to C biomass. A total of 607 different taxonomic species and 136 different genera were identified after spell checking and taxonomic nomenclatural verification. The entire data treatment process is described in the flow diagram in Fig. 1.

Biomass conversion procedure
Measured cell sizes are rarely or vaguely indicated in phytoplankton databases. Clearly, more effort is needed on building accurate taxonomic databases with associated species size range for each oceanic and coastal region. In order to reconstruct each species cell size, one option is to consider the minimum and maximum dimensions of each species and derive minimum, maximum and average biovolumes and associated C biomass. Such efforts have for instance been successfully undertaken in the Baltic Sea by the HELCOM Phytoplankton Expert Group (PEG), and resulted in a report compiling a complete list of species with their measured dimensions and biovolumes (Olenina et al., 2006). In this study, the authors put an emphasis on the "hidden dimension" of cells, as some algal dimensions are seldom visible in the microscope during routine cell counts and hence are almost never documented. This is typically the case for the pervalvar axis of many diatoms, which most often lie on their valve face after sedimentation on a glass slide. In most cases assumptions are made regarding this hidden dimension (an example for an assumption can be pervalvar axis = 1/3 of the apical axis), but this information is mostly absent from taxonomic guides, which give at best one or two of the cell dimensions. Hence, further attentiveness is required to document consistent ratios between visible and hidden dimensions for the main diatom species.
In the last decade, a couple of significant studies (Hillebrand et al., 1999;Sun and Liu, 2003) have produced detailed guides of biovolume calculations for phytoplankton species, taking into account the variety and complexity of the numerous diatom shapes by assimilating them into standardized geometric models (19 different shapes were used for this study), which should help harmonize biovolume calculations considerably. As it is not possible to measure every cell's dimensions in one sample, it is usually recommended to measure all dimensions for 25 cells of each species and use the mean value of the obtained cell volume for all occurrences of the same species, although in most cases the standard error in mean biovolume calculation is < 5 % after the measurements of 10 cells (Sun and Liu, 2003). However, Hillebrand et al. (1999) emphasized that seasonal, interannual, spatial and life cycle variations render it inaccurate to use average biovolume data of species throughout the year. Therefore, strict quality standards imply that biovolume should be calculated for each subset of samples, sometimes including different sampling depths of the same water body (Hillebrand et al., 1999).

Data file content
The data file consists of an excel file containing several spreadsheets. A spreadsheet named "dimension-biovolumebiomass" lists all the different name entries, with their corrected names, and associated World Register of Marine Species code (WoRMS, http://www.marinespecies.org). In total, 1364 different taxonomic entries were found, but were reduced to 727 different taxonomic lines after name correction. The original entry and its associated correction follow-ing WoRMS are indicated in two different columns. Up to 607 WoRMS species codes were attributed, but 24 entries were not found in the WoRMS register and were labeled "nf1" to "nf24". Entry lines were also tagged with a "C" for centrics, "P" for pennates and "U" for unidentified diatoms (this last group was not converted to C biomass because of the large uncertainty on cell size). In most instances, taxonomic entries were not associated with cell size measurements. On other occasions, biovolume measurements were provided but lacked corresponding cell size data. Hence, it was virtually impossible to reconstruct each individual calculation method employed for estimating biovolume, when this was often not indicated in the datasets. Keeping the original published biovolumes would almost certainly have introduced a bias between different datasets. We therefore chose to exclude such data, and have documented instead, for every distinct species, the minimum, average and maximum known cell dimensions. The dimensions extracted from the literature were then used to convert all the available abundance data into biovolumes and C biomass using a single standardized method. Each species is allocated one of the 19 possible diatom shapes identified in Sun and Liu (2003) in order to derive the biovolume (V) and surface area (S ) calculation formulas. The figures for the different shapes and formulas extracted from Sun and Liu (2003) are shown in another spreadsheet "diatom shapes" for a quick visual check of the diatom cell shapes. In the spreadsheet "dimensionbiovolume-biomass", the known minimum and maximum dimensions for each species are indicated. In the column "other info", the taxonomist's original observations regarding size are indicated, but most often refers to a unique value -the largest dimension or diameter of the cell. When indications of cell size are given, minimum and maximum dimension columns are amended to fit the observations (indicated by a yellow color). The bibliographical references used to find dimensions for each species are indicated for each entry as a number, which refers to the "reference" spreadsheet, where full references are given. Dimensions written in black correspond to referenced measurements; dimensions written in red refer to a value deduced from illustrations or drawings when a scale bar was present, showing a ratio between two different axes of the cells. Cells labeled in pink indicate that an assumption was made on the ratio between one of the known dimensions and the hidden dimension. The assumption made is always explicitly indicated in another column -for instance, for some Coscinodiscus species pervalvar axis = 1/3 diameter. Minimum and maximum biovolume, surface area and S /V ratios are calculated for every single entry depending on the given dimensions. The cellular biovolumes ranged from 3 µm 3 (Thalassiosira sp.) to 4.71 × 10 9 µm 3 (Ethmodiscus sp.). The total biovolume obtained was then converted to C biomass, similar to the method used in Cornet- Barthaux et al. (2007) using the equation of Eppley et al. (1970) corrected by UNESCO (1974) and Smayda (1978): The spreadsheet "diatom database" is the actual diatom compiled database with the complete information regarding date, location, depth, methods, and taxonomic information. Each line starts with a unique primary key indicator which enables rapid restoration back to the original data file in the event that database sorting or filter commands are used for further computations. Biovolume, surface area, and cellular C content are automatically retrieved from the previous spreadsheet based on the recognition of the original name entry. Abundance data are standardized to one unit (cells l −1 ) and multiplied with C content per cell (pg cell −1 ) to derive total C biomass (converted to µg C l −1 ). Minimum, maximum and average data of size, biovolume and biomass are indicated in the file; however, in this paper, generally averaged data estimates for biomass will be used in discussion.

Quality control
A first run through the database was done to check for all spelling errors and invalid data entries. Suspicious data, for which the abundance values or units were not clear were systematically discarded. A statistical treatment, using Chauvenet's criterion test, was then applied to the database to filter out potential outliers. Only 151 data were identified as outliers using this criterion, and they all corresponded to entry lines with "unidentified diatom species" or "diatom spp.". This is not surprising, as the biomass conversion used in this case is the average between the minimum and maximum biomass found for all diatoms, and logically leads to very spurious biomass values (usually overestimated, probably because unidentified cells are mostly of small sizes). After correcting the database by excluding these outliers, a few average biomass values remained conspicuously elevated. On investigation, they were found to correspond to "unidentified diatom species" or "diatom spp." lines. Therefore, we chose to discard the biovolume calculations for all these entry lines ("U") because the assumptions made on their biovolume were too imprecise; nevertheless, the abundance data from these locations were kept in order to preserve the 1056 relevant data points.

Spatial distribution of data
The database contains 91 704 individual lines (90 648 with converted biomass). There are 9930 unique location, time and depth points (but with multiple species entries) and 2971 unique location and time points (all depths combined). Regarding the spatial distribution of data, the oceanic regions best represented included the North Atlantic, the north Indian, equatorial Atlantic, Arctic, Antarctic and North Pacific areas (Fig. 2). Indonesia, the Gulf of Mexico and Caribbean, the South Pacific, South Atlantic and south Indian are less well covered. This does not mean that samples were not collected and counted, but simply that the data have not been released for public use by their owner or have remained the property of a given government. The largest number of observations was reported in the Northern Hemisphere (NH) between the Equator and 70°N (Fig. 3a). Table 1 shows that the distribution of biomass data, according to latitudinal bands, is clearly skewed towards the mid-Northern Hemisphere with 43.9 % of data between 40°and 60°N. a b

Temporal distribution of data
Most observations were commenced in the 1970s, but a few datasets date as far back as 1933-1934and 1954-1956). As expected, data frequency diminishes after 2000, as newer data need to be published by the relevant Principal Investigators (PIs) before being submitted to databases, a process that usually occurs a few years after the end of a research program. Data were mostly obtained during boreal spring and autumn (37 % in March, April and November), while the boreal winter months were less well covered (11 % in December, January and February).
The highest abundances reported in the database, representing massive blooms (> 10 millions cells l −1 ), were found in Antarctica in the Ross Sea in December 2004 and January 2005, and at the Antarctic Davis station in January 1995. These occurrences are represented by Chaetoceros socialis blooms, Thalassiosira spp. and unidentified pennates. Abundances of up to several million cells l −1 were also reported in a coastal area during the Galicia program off NW Spain (again identified as Chaetoceros socialis). The smallest abundance values were reported for the Indian Ocean and the Mediterranean Sea. The average diatom cell abundance for each time, location and depth was 263 099 cells l −1 and the median value was 7056 cells l −1 .

Global biomass characteristics
Diatom C biomass calculated from cell sizes spans over eight orders of magnitude (Fig. 4). The mean diatom biomass for the entire database is 141.19 µg C l −1 , while the median value is 11.16 µg C l −1 . The mean diatom biomass for the NH is 141.22 µg C l −1 (median 12.60 µg C l −1 ) and 141.27 µg C l −1 (median 4.67 µg C l −1 ) for the Southern Hemisphere (SH). For the whole database, 19 % of biomass data are in the range of 0-1 µg C l −1 , 29 % in the range of 1-10 µg C l −1 , 31 % in the range of 10-100 µg C l −1 , 18 % in the range of 100-1000 µg C l −1 , and only 3 % > 1000 µg C l −1 . The maximum biomass in the NH (12 299 µg C l −1 ) was reported off the coast of NW Spain (43. 42°N-8.43°E) at the surface in July 1990. The biomass maximum was associated with a bloom of Dactyliosolen fragilissimus and Chaetoceros spp. The maximum biomass in the SH (11 174 µg C l −1 ) was observed in the Peruvian upwelling region in March 1974. Here, the surface water bloom was  The biomass uncertainty was calculated as a percentage of the difference between the maximum biomass and minimum biomass normalized to the mean biomass (Fig. 5b). The biomass uncertainty comprised between 100 and 200 % of the average biomass for 96 % of the data, and between 0 and 100 % for the remaining 4 % of data. Uncertainty is strongly sensitive to cell size, and therefore diatom species that span wide size ranges provide the least precise estimates. Only the accurate determination of cell sizes for each species and for each program, location, date and depth will significantly improve this bias.

Latitudinal and depth distribution of biomass estimates
The vast majority of biomass estimates were collected in the 0-100 m layer (Fig. 6a), which is well covered in terms of vertical resolution, while deeper estimates are mostly found at fixed depths below 100 m (150, 200 m) and are more scarce.
The largest range of biomass estimates corresponds to the latitudinal bands most often sampled, between 40°and 60°N (Fig. 6b). Estimates are scant in the SH, but all latitudes are reasonably well covered. There is no clear tendency towards lower or higher biomass according to latitude, except potentially in the Arctic where the range of variation seems to be lower than elsewhere.

Seasonal distribution
There are no clear seasonal trends in the monthly distribution of biomass estimates in the NH (Fig. 7a). The largest range of estimates is observed in June and the lowest in November, but wide amplitude of variation is observed almost for every month. Seasonality seems a bit more marked for the SH, with the lowest range of variations observed between June and September and the highest range between November and March (Fig. 7b). This weak display of seasonality probably originates from the fact that a mix of warm and cold waters and eutrophic and oligotrophic areas are represented in both hemispheres.

Dominant genera and species
Biomass data for all identical taxonomic entries were summed for the entire database, for either genera (Fig. 8) or for individual species (Fig. 9). Out of the 136 identified genera in the database, 32 genera represent 99 % of the total estimated biomass. A boxplot of estimated averaged biomass for all 32 genera is shown in Fig. 8. The median values for all individual genera roughly range between 0.1 and 10 µg C l −1 . Taking into account the 5th and 95th percentiles, average biomass ranges between 0.002 µg C l −1 and 826 µg C l −1 . The largest range of biomass is found for the genus Thalassiosira and the narrowest for Paralia. The percentage contribution of each genus ranked by decreasing order of importance is reported in Table 2. The dominant genus in the database is Rhizosolenia, representing 17.4 % of the total diatom biomass, followed by Chaetoceros (14.5 %) and Thalassiosira (12.6 %). Unidentified pennate and centric diatoms were included in the calculation, and if determined down to genus would inevitably change the relative order of the dominant genera, as they represent 8.2 and 6.6 % of the total biomass, respectively. The other important genera are Dactyliosolen (7.6 %) and Guinardia (7.3 %). Centric diatoms are by far the largest contributors to total biomass (86 %), and the cylindrical shape is dominant overall.
A second boxplot figure is presented in Fig. 9 with the same calculations as in the preceding Fig. 8, but using only the taxonomic entries that were identified down to the species level and excluding all other undetermined species (e.g. Chaeotoceros spp.). Out of the 552 identified species (which may be reduced to a slightly smaller number after elimination of all synonyms in the database), only 43 species contribute 90 % of the total diatom biomass for identified species (47.5 % of the total biomass in the database including all undifferentiated taxa). The median value for these dominant species ranges roughly from 0.1 to 10 µg C l −1 . When extending to the 5th and 95th percentiles, biomass data range from 0.002 to 439 µg C l −1 . The largest range of biomass is found for Rhizosolenia imbricata and the narrowest for Coscinodiscus wailesii. The percentage contribution of each species ranked by decreasing order of importance is reported in Table 3. The predominant species, contributing up to 19 % of total biomass (excluding all unidentified species data), were Dactyliosolen fragilissimus (13.6 %), Rhizosolenia imbricata (10.8 %) and Guinardia striata (8.2 %). The Rhizosolenia species in this list (6/43) alone represent 20.8 % of total biomass (identified to the species level). The seven major Chaetoceros species combined represent 6.1 % of biomass. The most dominant Chaetoceros species in terms of average total biomass was found to be Chaetoceros socialis (2.6 %) followed by Chaetoceros compressus (1.6 %). Again the dominant species contributing to the average total biomass overall were principally represented by centric diatom species.

Discussion
This study is the first effort to compile robust global biomass estimates for marine diatoms. A summary boxplot diagram (Fig. 10) shows that 78 % of the data (without consideration Table 2. Diatom genera in ascending order of contribution to total biomass. 32 genera amount to 99 % of global biomass. Note that unidentified pennate and centric diatoms represent a non-negligible 14.8 % of the total biomass. If they were identified down to genera, the order of dominance for the most abundant groups might change.  of taxa) range between 0.01 and 100 µg C l −1 for the average diatom biomass estimates per depth. However, there remain numerous biases in the present database that require resolution before an accurate diatom biomass dataset can be fully realised in the future. We have identified several major biases from this compilation and acknowledge that resolving them at this point in time is beyond the scope of this paper. These biases are as follows: 1. If the temporal distribution seems to be well covered (Fig. 7), the spatial coverage is still inhomogeneous (Fig. 2) and vast parts of the ocean (in particular the SH) remain undersampled and/or the data remain inaccessible.
2. Blooming/productive areas are often better investigated than oceanic deserts, and when programs do occur in oligotrophic regions, researchers can often refrain from running accurate cell counts when the abundance of a group is very low. Figures 8 and 9 show that for individual genera or species the distribution of data around the median values are mostly skewed towards the higher biomasses. Such a feature indicates cell abundances have been assessed more thoroughly when cells are abundant. Similarly, large cells are more easily identified in light microscopy than smaller cells (typically < 10-20 µm). 4. The biovolume used to convert µm 3 into pg C cell −1 is calculated from the frustule outer dimensions, which do not necessarily match that of the cytoplasm. The latter can be, depending on the species, considerably smaller than the frustule itself. This issue can only be resolved by culture work to determine cellular C content on the main identified species. The impact of this issue means all C biomass estimates must be considered as overestimates and a maximum value per genus or species.
5. Cells change size through their life cycle and with season and depth, and it is therefore inadequate to use average values for cell size, and subsequently for biovolume and carbon biomass calculations. Cell sizes should be measured systematically (for the dominant species) between subsamples and between different areas. This could not be done in the database, where minimum and maximum ranges for each species were considered, and distinction in sizes according to the geographic area could not be taken into account. According to Viličić (1985) the use of literature data from other oceanic regions should be avoided, and measuring cell dimensions for each dataset is the only way to estimate the total cell volume without major error.
6. Regarding the average cell size, Hillebrand et al. (1999) further stated that the biovolume should be calculated from the median of measured linear dimensions, not as a mean (or median) of a set of individually calculated biovolumes. Here, we were not able to calculate median dimensions for lack of data on cell size measurements, so we decided to use the average biovolume calculated from the literature minimum and maximum dimensions, but we acknowledge that this is a rough approximation.
7. In most cases, the hidden dimension of diatoms is not indicated, and cannot be obtained without further manipulation of the cells on glass slides using needles, a task that can be daunting to most people. In this study, assumptions were made on the hidden dimension using ratios between, for instance, the diameter and pervalvar axis for centric diatoms. Clearly, more attention needs to be given to these calculations, and this hidden dimension should be better indicated in taxonomic guides.
8. The cellular carbon content is assumed to be constant and a function of cell volume. However, it is known that depending on growth conditions (irradiance, temperature, nutrients), a degree of plasticity in the cellular C content can be achieved (Finenko et al., 2003). Applying the same conversion factor over a wide size range, as is the case for diatoms, leads to systematic errors and this formulation should also be improved (Menden-Deuer and Lessard, 2000).
These biases are well established and acknowledged in modern treatments of biovolume and biomass estimates (e.g. Cornet-Barthaux et al., 2007), yet nevertheless remain challenging. Substantial progress could be achieved by placing more efforts on the globally dominant species. This database allows the first estimate of the relative contribution of the main diatom genera and species to global biomass, and reveals that a small number of them (< 50) represent between 90 and 99 % of the biomass. Improving size and biovolume determinations on these particular species, as well as according to geographical area, season and life cycle, should thus substantially improve diatom biomass estimates. Guillard and Kilham (1978) published an extensive description of the diatom flora for the main biogeographical provinces, which similarly showed that only a few dozen species were dominant in each province. At a coastal site in the Gulf of Lions (northwestern Mediterranean Sea), a bimonthly survey over 11 yr showed that out of the 91 diatom species that were identified, only 16 species represented 97 % of the combined cell abundances. Incidentally, 10 of these 16 species also appear in the top 50 species identified in Fig. 9. We, therefore, advocate the systematic use of regional atlases reporting full description of cell sizes and biovolume ranges for the dominant species present, which are usually much less numerous than the full extent of diatom diversity. Focusing on improving biomass estimates for the most abundant species identified here should be an achievable task within the next few years, and should considerably improve global diatom biomass estimates. This list of dominant species should of course not be considered as a static unchanging list, as climate change and environmental modifications are highly susceptible to change the order of species dominance in the ocean. However, some species identified here as globally important are seldom the object of laboratory culture work and little is known of their physiology and biogeochemical characteristics. This study, together with the other datasets compiled for the main planktonic functional types, should allow a first comparison of a PFT's relative importance, as well as an estimation of the global heterotrophic to autotrophic planktonic biomass ratio. Looking at coastal and open ocean data separately should also allow for the validation or otherwise of the trophic chain pyramid models proposed by Gasol et al. (1997). By compiling simultaneous reports for most planktonic groups (phytoplankton, bacteria, mesozooplankton and heterotrophic protists) from the literature and in various environments, Gasol et al. (1997) showed that the heterotrophic : autotrophic biomass ratio was higher in open ocean/less productive systems, indicating an inverted biomass pyramid, while coastal/productive areas were characterized by a smaller contribution of heterotrophs relative to autotrophs. According to the authors, these differences reflect consumer-controlled systems in the first case, and resourcecontrolled systems in the latter. The different databases compiled in this special issue could be used to run such comparisons (see also Buitenhuis et al. (2012), introductory paper on this special issue).
Despite the identified biases, the biovolume data compiled in this study are on the same order of magnitude as the literature data. Considering a global integration depth of 100 m as a rough estimate for the euphotic zone depth, diatom biomass data are mostly comprised between 0.01 and 10 g C m −2 , which is on the same order of magnitude as the total autotrophic plankton biomass (diatoms + other groups) by Gasol et al. (1997), which ranged between 0.02 and 31.8 g C m −2 . However, a more extensive comparison with the literature remains difficult because global estimates derived from satellite products are most often given in chlorophyll a concentrations or as net primary production.
Finally, we present an attempt at a first-order estimate of the global diatom biomass (Tables 4 and 5). Following Table 4. Global ocean budget of diatom biomass for the entire dataset expressed in Tg C, Tmol C and Tmol Si, and Si biomass turnover rate estimates in d −1 (see discussion section for calculation details).
All data 0-100 m All data 0-200 m  Luo et al. (2012), depth-integrated biomass values (a minimum of three depths were required for the calculation) were binned to 3 × 3°grid to partially smooth out the uneven spatial distribution of data. The total area of the five main oceans was multiplied by the geometric or arithmetic means of diatom biomass for each ocean. The geometric mean is considered preferentially for this calculation as it is the exact representation of the mean for lognormal distributed data. The dataset was furthermore sorted out between coastal (defined here as bathymetry < 100 m) and open ocean data, representing 552 and 3826 different sites, respectively. The binning procedure is inadequate to use on coastal data only (too little spatial coverage), hence the calculations were run on the entire dataset first (Table 4), then on open ocean data alone (Table 5), the difference reflecting the weight of coastal data. Considering either 100 or 200 m as the depth of integration yields diatom biomass values for the global ocean using all data of 488-470 Tg C (geometric mean) and 2942-3023 Tg C (arithmetic mean), respectively. These values vary slightly considering open ocean data alone (Table 5) and amount to 582-444 Tg C (geometric mean) and 3636-3433 Tg C, respectively (arithmetic mean). After conversion to Si biomass using a Si : C ratio of 0.093, as the average between Sistressed diatoms (0.056, DeLaRocha et al., 2010) and Sireplete diatoms (0.130, Brzezinski et al., 2011a), the global Si budget for diatom biomass amounts to 3.6-3.8 Tmol Si for the global ocean (Table 4) and 3.4-4.5 Tmol Si for the open ocean with coastal data excluded (Table 5). By considering the global gross Si production annual estimate of 240 Tmol Si yr −1 given by Nelson et al. (1995), this converts to a Si biomass turnover rate comprised between 0.15 and 0.19 d −1 (geometric mean). The arithmetic means yield a Si turnover rate of 0.02-0.03 d −1 , which seems to be highly underestimated for diatoms. Next, the mean integrated BSi biomass over 0-200 m (in mmol Si m −2 ) is presented for each basin and compared to literature data for various oceanic provinces (Table 6). Diatom biomass is usually available indirectly through particulate Si measurements in ocean studies, allowing a comparison between our dataset and actual measurements after conversion from C to Si biomass. Our estimates for open ocean data are comprised between 3.3 and 26.9 mmol Si m −2 , which is quite similar to the estimate given in Adjou et al. (2011) of 2 to 26 mmol Si m −2 for High Nutrient Low Chlorophyll (HNLC) and oligotrophic regions. However, the range of variations of integrated BSi data in various hydrological environments can be quite large and may locally be one to three orders of magnitude higher than our basin averages as evidenced in Table 5.
Unfortunately, we did not find any integrated BSi data for the Arctic Ocean to compare with our data. This region presents a 215 % increase of biomass estimates when looking at open ocean data alone (9.9 mmol Si m −2 ), compared to the entire dataset estimate (4.6 mmol Si m −2 ), while the Atlantic, Pacific and Indian Oceans all show a slight decrease (−3 to −7 %) when excluding coastal data, which are generally expected to be skewed towards higher biomasses. This particular feature of the Arctic could be explained by the presence of a broad continental shelf and the impact of large Table 6. Mean integrated BSi (over 200 m) in mmol m −2 calculated from the present database are indicated by the geometric mean and arithmetic means, using a Si : C conversion factor of 0.093 (see discussion section for calculation details). A distinction was made between all available data and open ocean data alone (considering all data points below the 100 m isobath as coastal data). These results are compared to other regional data published in various studies, indicated either as minimum and maximum values or by an average ± SD. The areal surface considered for each ocean were 14. 056, 76.762, 155.557, 68.556, 20.327 Nelson et al. (1995); 5 Brzezinski and Kosman (1996); 6 Queguiner and Brzezinski (2002)  Boxplot of the main diatom genera, contributing to 99 % of the total biomass (log 10 µg C l −1 ) in the database. Red dots represent the 5th and 95th percentiles. Genus contribution to total biomass is arranged in decreasing order of abundance from top to bottom (see Table 2 for relative importance).
riverine inputs, which could induce large differences between coastal and open ocean biomass. The Atlantic Ocean average estimate (combining data from the Baltic and Mediterranean) is the lowest of all regions (3.3-3.4 mmol m −2 ) and compares well with literature data for the Mediterranean Sea, the Bermuda Time Series (BATS) and the North Atlantic.
Much larger values were found in the Atlantic sector of the ACC (Antarctic Circumpolar Current), which is at the boundary with the Southern Ocean and reflects a very different environment. The Pacific Ocean estimate also compares well with open ocean data (HOT, ALOHA, the central, equatorial and southern Pacific), but is much lower than coastal measurements obtained at Monterey Bay or the Santa Barbara basin, which are highly productive coastal systems. The Southern Ocean is the region where the discrepancy between our estimates and measurements is highest, with much lower values than expected for diatoms, and a global budget close to that of the Arctic and Atlantic Oceans. This may be due to poor sampling coverage in the dataset, which is visible in Fig. 5, where very few sampling sites are actually documented. The Indian Ocean shows the highest estimates (26.9-29.1 mmol Si m −2 ) in our dataset and is probably Figure 9. Boxplot of the main diatom species, contributing to 90 % of the total biomass (log 10 µg C l −1 ) in the database. Red dots represent the 5th and 95th percentiles. Species contribution to total biomass is arranged in decreasing order of abundance from top to bottom (see Table 3 for relative importance). All undetermined genera (example Chaetoceros spp.) were left out of the calculation to focus on identified species. Figure 10. Boxplot of the minimum, maximim and mean estimates of diatom biomass (log 10 µg C l 1 −1 ). Red dots represent the 5th and 95th percentiles and black circles the outliers. skewed by data from the Kerguelen Plateau, which displays a massive diatom bloom every year. The only data available for BSi are found in the Subantarctic region, but unfortunately no other data for the central and northern Indian Ocean could be found for comparison.

Conclusions
This study provides the first attempt to compile global abundance and biomass data for diatoms in a unique database, with uniform data treatment. Quantitative and qualitative information are provided, but much more information on species distribution, succession and relative importance between biogeographical provinces and coastal/open ocean systems can be derived from the present database, although such coverage is beyond the scope of this paper. Despite significant identified biases in biovolume calculations and C content conversions, these first estimates may be used in global biogeochemical models implementing diatoms as a model variable. First estimates for the global ocean produce a diatom biomass of 37-49 Tmol C and 3-4 Tmol Si, and an average Si biomass turnover rate of 0.15 to 0.19 d −1 . Spatial coverage, species identification and cell size assessments may still be improved and taxonomists are encouraged to submit future data to data repositories such as PANGAEA so that they may be used to refine future dataset aggregation projects such as this one. We emphasize that less than 50 species represent > 90 % of the total biomass, and that placing more effort on resolving the listed biases for these dominant species first (which are sometimes less well studied) should help to improve the global biomass estimates considerably. Hence, the huge diversity of diatom species in the modern ocean may be reduced down, for more complete studies of size, biovolume and cellular C content assessments, to a more manageable number of taxa for global modeling efforts. However, we should keep in mind that climate and environmental change may alter this dominance list at any time, and that continued taxonomic identification and counting efforts of the entire plankton flora remains crucial. Another goal was to provide a usable data file for taxonomists worldwide so that they can add further diatom count data and compute their biovolume and C biomass in a similar way. This file is available in open access through the PANGAEA database center (see Appendix A), and will evolve with new data submissions.
Along with other papers of this special issue, this study also clearly highlights that taxonomic work and phytoplankton identification skills are far from obsolete and are needed more than ever if we are to achieve robust datasets of planktonic biomass.

A1 Data table
A full table containing all biomass/abundance data points can be downloaded from the data archive PANGAEA, doi:10.1594/PANGAEA.777384. See description of the file in the "Data file content" section (Sect. 2.3). The excel file allowing for automatic biovolume calculation can be used as a starting tool to create regional diatom databases and is available upon demand to the first author. New data additions to this database are welcomed and will be implemented when available.

A2 Gridded NetCDF biomass product
The biomass data has been gridded onto a 360 × 180°grid, with a vertical resolution of six depth levels: 0-5 m, 5-25 m, 25-50 m, 50-75 m, 75-100 m and > 100 m. Data has been converted to NetCDF format for ease of use in model calculation exercises. The NetCDF file can be downloaded from PANGAEA, doi:10.1594/PANGAEA.777384.