The National Eutrophication Survey: lake characteristics and historical nutrient concentrations

. Historical ecological surveys serve as a baseline and provide context for contemporary research, yet many of these records are not preserved in a way that ensures their long-term usability. The National Eutrophication Survey database is currently only available as scans of the original reports (PDF files) with no embedded character information. This limits its searchability, machine readability, and the ability of current and future scientists to systematically evaluate its contents. These data were collected by the United States Environmental Protection Agency between 1972 and 1975 as part of an effort 5 to investigate eutrophication in freshwater lakes and reservoirs. Although several studies have manually transcribed small portions of the database in support of specific studies, there have been no systematic attempts to transcribe and preserve the database in its entirety. Here we use a combination of automated optical character recognition and manual quality assurance procedures to make these data available for analysis. The performance of the optical character recognition protocol was found to be linked to variation in the quality (clarity) of the original documents. For each of the four archival scanned 10 reports, our quality assurance protocol found an error rate between 5.9 and 17%. The goal of our approach was to strike a balance between efficiency and data quality by combining hand-entry of data with digital transcription technologies. The finished database contains information on the physical characteristics, hydrology, and water quality of about 800 lakes in the contiguous United States (doi:10.5063/F1KK98R5). Ultimately, this database could be combined with more recent studies to generate metadata analyses of water quality trends and spatial variation across the continental United States.


Introduction
Effective management of inland freshwater lakes requires an understanding of the factors that affect water quality and how these factors change over time.One of these factors, termed eutrophication, occurs when excess nutrient inputs from human activities fuels increases in algal growth, which can cause hypoxia and decreases in water clarity.Eutrophication of surface waters from increased phosphorus and ni-trogen loading has been observed in connection with altered land use, especially in areas of rapid urbanization and intensive agriculture (Smith et al., 1999(Smith et al., , 2014)).As human populations and their impacts continue to grow, eutrophication is expected to become more widespread (Bennett et al., 2001;Taranu and Gregory-Eaves, 2008).Historical datasets are needed in order to track, understand, and manage eutroph- ication in lakes and reservoirs because they serve as an important baseline for modern studies.
The US Environmental Protection Agency (EPA) designed and implemented the National Eutrophication Survey (NES) in order to investigate the extent of eutrophication in freshwater lakes and reservoirs across the contiguous US.Sampling took place in over 800 lakes and reservoirs from 1972 to 1975 and included a variety of physical, chemical, and biological metrics including data on nutrients and nutrient loading, hydrologic retention time, morphometry, and plankton community diversity.Each lake was sampled on a monthly basis for a period of 1 year.Except for the phytoplankton distribution subset, which we did not transcribe (see Stomp et al., 2011), the NES data are provided as annual averages.Unlike current EPA National Lakes Assessments (NLAs) that select a random sample of lakes across the US, the NES targeted only lakes impacted directly or indirectly by municipal sewage treatment plant discharge (USEPA, 2009(USEPA, , 1975)).Until recently, these data were only available in their entirety as four separate scanned reports representing the northeastern and north-central (northeastern), eastern and southeastern (southeastern), central, and western regions of the US (Fig. 1).In the remainder of the present paper we refer to the former two regions as simply the northeastern and southeastern regions.
To our knowledge, there have been no attempts to transcribe the data into a usable, searchable digital database despite its use in previous studies.For example, large portions of the dataset were used to examine large-scale relationships between residence time and phytoplankton abundance (Soballe and Kimmel, 1987).Also, it was used to predict eutrophication incidence in a Bayesian framework (Lamon and Stow 2004).Smaller portions of the data were used to explore drivers of nutrient loading (Stomp et al., 2011;Brett and Benjamin, 2008).However, to our knowledge, the only study to use the NES dataset and provide a publicly available data supplement is that of Stomp et al. (2011), but their data supplement was limited to a small subset of the available variables relating to phytoplankton community diversity.
The present study is the first to leverage digital transcription technologies to unlock the full NES dataset.In this pa-per, we describe the digital transcription of the full NES dataset with the goal of making the dataset openly accessible to the research community.Specifically, our objective was to exactly reproduce the contents of the original dataset rather than to evaluate its scientific integrity.We introduce and publish the data in an open format that requires no proprietary software.It can be easily downloaded, used for analysis, and amended.The provided summary statistics and figures also allow users to quickly assess the utility of the data.Finally, the code and raw data files are provided to facilitate the extraction of fields not represented in our completed dataset (mostly phytoplankton diversity data).

Methods
Data were collected from multiple locations within the water column and included in situ measurements as well as laboratory analyses.Flow estimates and drainage area calculations were provided by the US Geological Survey and were determined from flow gauges when present.More detailed information on sampling methods, units, equipment, and accuracy can be found in the EPA survey methods publication (USEPA, 1975).Due to the historical nature of the dataset, the NES sampling design differs from more modern efforts (USEPA, 2009).For example, the original NES data were collected from four separate regions of the US over the course of 4 years, whereas current assessments complete nationwide sampling in a single summer.As such, NES data values represent the mean of measurements taken in the spring, summer, and fall in either 1972 (northeastern), 1973 (southeastern), 1974 (central), or 1975 (western) rather than summer measurements taken in a single year.
We obtained the NES archival scanned reports from the EPA National Service Center for Environmental Publications (available at: https://www.epa.gov/nscep).The data for each NES region are contained in four separate files.We extracted the data from each file using automated techniques followed by manual quality assurance and checking of each value.To begin, we enhanced (de-noised) each file using the local adaptive filtering algorithm as provided by the ImageMagick program (v6.8.9-9; available at https://www.imagemagick.org/).Next, we processed the enhanced files using the Tesseract optical character recognition (OCR) program (Ooms, 2017;Smith, 2007).The output of these initial extraction steps was recorded in a set of "raw data" files in which each file contains the raw unprocessed text of each document page.The contents of specific fields in the raw data were extracted to a database using the automated rules provided by the nesR software package (Stachelek, 2017).Finally, all values in the database were manually checked for accuracy against the original scanned reports.Inaccurate OCR outputs were corrected by hand in the final database.Because our goal was to reproduce the data from the original reports and not to verify the technical correctness of the original data, we only changed values if they did not match the original data reports.For example, we did not change data from the five NES lakes that had phosphate (PO 4 ) values exceeding their corresponding total phosphorus (TP) values despite the fact that this is not physically possible (PO 4 is a component of TP).
We provide the final dataset in an open nonproprietary format (comma-delimited, *.csv).In addition, we generated metadata descriptions from the contents of the original scanned reports.All calculations, table construction, and figure generation were performed in R and saved as reproducible R scripts (R Core Team, 2017).Table and figure generation was accomplished with the use of the reshape2, plyr, and sp packages (Wickham, 2016;Pebesma and Bivand, 2017).

Results
The final NES dataset contains observations from 775 lakes and the distribution of these lakes was spatially variable.Although there were more lakes measured in the northeastern and southeastern US, the number of locations was close to evenly distributed among the remaining regions (Fig. 1, Table 1).Specifically, the number of lakes sampled in each re-  In addition to differences in the total number of lakes measured in each region, there were also differences in the proportion of lakes classified as impoundments rather than as natural lakes.For example, slightly more than half of all the lakes studied (462 of 775) were classified as impoundments yet the northeastern region had only 54 impoundments while the southeastern region had 168 impoundments.Conversely, the number of natural lakes sampled in the northeastern region (146 lakes) was more than double that of any other region (77, 48, and 42 for the southeastern, western, and central US, respectively).
The ability to examine these spatial trends was made possible by our OCR procedure, which had 6-17 % accuracy depending on region and archival report scan quality.In total, we carried out approximately 5000 corrections to the automated data product by hand as part of our manual quality control review.A total of approximately 650 lakes had values for at least 80 % of the total number of variables shown in Table 1.On an individual lake basis, the most common "missing" data were nutrient loading estimates for individual point-and nonpoint-source components.In many cases, these data may not actually be missing but they may not have been a component of the budget for that particular lake.For example, not all lakes have industrial land use so no data are expected in these cases.

Code and data availability
Original scanned reports from the EPA are available from the EPA National Service Center for Environmental Publications (https://www.epa.gov/nscep).Our cleaned and useable data are available for download at Stachelek et al. (2017).The data are provided as a zip file, which contains all versions of the data including the raw and quality-checked versions (Stachelek et al., 2017).Moreover, the R package and R code used to scrape and analyze the data are provided by Stachelek (2017) so that the methods may be reproduced and openly available for (re)use.All figures and summary statistics were generated with R scripts available in the data supplement.

Discussion
We have demonstrated an approach for rescuing historical data from scanned documents.In particular, our approach involved a two-step process of automated data scraping followed by curation by hand and quality assurance.Overall, we found that OCR was an efficient method for reducing the labor associated with transcribing analog text records (e.g., Drinkwater et al., 2014).Unfortunately, OCR technology does not have absolute accuracy.In our case, transcription was hampered by poor print and scan quality of the source paper documents.We discovered through our manual validation procedure that the OCR computations produced inaccurate values in approximately 6-17 % of the cells in the complete dataset (n = 4836).We expect that accuracy could be improved by experimenting with varying the window size of the local adaptive thresholding algorithm relative to the document font size.Our ability to experiment with thresholding window size was limited due to the computationally expensive nature of these extractions.
The end result of our approach was data from every lake and nearly every variable in the NES survey dataset.The only primary subset of the NES data that is not included in our final product is the phytoplankton distribution data, which have already been digitally transcribed by Stomp et al. (2011).The results of the present study could be used to explore anthropogenic and environmental drivers of lake eutrophication as well as to verify previously documented trends.One example is the 2007 National Lakes Assessment Report, which included a reanalysis of some of the NES study lakes (USEPA, 2009).This reanalysis considered population level trends in the NES lakes but did not consider trends in individual lakes or potential environmental drivers contributing to observed trends.On a population basis, the NLA reanalysis found that less than 30 % of the NES lakes had increased chlorophyll and phosphorus concentrations.The results of the present study could be used to verify these claims as well as to compare the NES data with more recent work such as the 2012 National Lakes Assessment.Note that sampling techniques may differ from current techniques; thus, care should be given when making comparisons.In addition to their utility in validating historical trends, this dataset has value because it contains data on a number of hydrographic variables that are difficult to estimate, such as water residence (retention) time.Such data are critical to a variety of hydrological and water quality modeling efforts (Brett and Benjamin, 2008).
Although our goal was to digitally transcribe the full NES dataset to facilitate studies on historical nutrient loading, it is worth noting the similarities between the present study and other scientific record digitization initiatives.Such initiatives are common in the climate and ocean sciences but they are just starting to gain momentum in the biological sciences (Allan et al., 2011;Freeman et al., 2017).To our knowledge, the present study is the first large-scale attempt at digitization of historical limnology records.We hope that by making our analysis open and reproducible we will inspire future efforts to recover important records from the pre-digital era.

Figure 3 .
Figure 3. Map of Secchi depth (m) interpolated using inverse distance weighting.

Table 1 .
Number of measurements (n) for each variable in each NES region.

Table 2 .
Mean and standard deviation (SD) for each variable in each NES region.