Biophysics and vegetation cover change : a process-based evaluation framework for confronting land surface models with satellite observations

Land use and land cover change (LULCC) alter the biophysical properties of the Earth’s surface. The associated changes in vegetation cover can perturb the local surface energy balance, which in turn can affect the local climate. The sign and magnitude of this change in climate depends on the specific vegetation transition, its timing and its location, as well as on the background climate. Land surface models (LSMs) can be used to simulate such land–climate interactions and study their impact in past and future climates, but their capacity to model biophysical effects accurately across the globe remain unclear due to the complexity of the phenomena. Here we present a framework to evaluate the performance of such models with respect to a dedicated dataset derived from satellite remote sensing observations. Idealized simulations from four LSMs (JULES, ORCHIDEE, JSBACH and CLM) are combined with satellite observations to analyse the changes in radiative and turbulent fluxes caused by 15 specific vegetation cover transitions across geographic, seasonal and climatic gradients. The seasonal variation in net radiation associated with land cover change is the process that models capture best, whereas LSMs perform poorly when simulating spatial and climatic gradients of variation in latent, sensible and ground heat fluxes induced by land cover transitions. We expect that this analysis will help identify model limitations and prioritize efforts in model development as well as inform where consensus between model and observations is already met, ultimately helping to improve the robustness and consistency of model simulations to better inform land-based mitigation and adaptation policies. The dataset consisting of both harmonized model simulation and remote sensing estimations is available at https://doi.org/10.5281/zenodo.1182145. Published by Copernicus Publications. 1266 G. Duveiller et al.: Biophysics and vegetation cover change


Introduction
Terrestrial vegetation regulates land-climate interactions through both biogeochemical and biogeophysical mechanisms.The role of vegetation from the biogeochemical side relies on its capacity to act either as a carbon sink or a carbon source (Le Quéré et al., 2016).Changes in land cover, which are dominated by forest area loss, have a pronounced effect on climate by reducing the terrestrial carbon stocks (Canadell and Raupach, 2008).However, land cover also controls both radiative and non-radiative biophysical surface properties of vegetation that influence the water, momentum and energy budgets (Bonan, 2008).Land use and land cover change (LULCC) alters these biophysical properties and in turn affects the local climate through changes in the surface energy balance (Anderson et al., 2011;Bonan, 2008;Davin and de Noblet-Ducoudré, 2010;Lee et al., 2011;Mahmood et al., 2014;Pielke et al., 2011).For instance, a vegetation cover transition from forest to grassland typically causes an increase in albedo (because grasses are generally brighter than trees and in cold climates grasses have a more homogeneous snow cover during the cold season; see Betts and Ball, 1997;Jackson et al., 2008;Loranty et al., 2014), but also a decrease in summer evapotranspiration (because grasses have lower aerodynamic conductance, see Bonan, 2008, and they typically have shallower roots and thus cannot access water in deeper soil horizons, e.g.Canadell et al., 1996;Fan et al., 2017;Oliveira et al., 2005).The result of competing biophysical processes on the surface energy balance varies spatially and temporally and can lead to warming or cooling depending on the specific vegetation change and on the background climate (e.g.presence of snow or soil moisture) (Pitman et al., 2011).In some cases, the associated changes in biophysical properties may offset the intended biogeochemical effects of land-based mitigation (Betts, 2000).Yet, policies tackling climate mitigation through land management focus only on biogeochemical mechanisms and neglect their biophysical consequences.
Recent advances have demonstrated that satellite remote sensing observations can provide valuable diagnostics of the effect of vegetation cover change on their biophysical properties (Alkama and Cescatti, 2016;Duveiller et al., 2018b;Forzieri et al., 2017;Li et al., 2015;Zhao and Jackson, 2014).Biophysical surface properties, such as albedo (Schaaf et al., 2002) and land surface temperature (Wan, 2008), are typically more accessible to remote sensing instruments than biogeochemical properties like carbon stocks and fluxes.The latent heat flux of the land can also be estimated from remote sensing observations using data-driven models (Mc-Cabe et al., 2016;Miralles et al., 2011;Mu et al., 2007), while the residual sensible and ground heat fluxes can be obtained from a combination of such datasets by imposing the closure of the surface energy balance (Duveiller et al., 2018b;Forzieri et al., 2017).To exploit such datasets for analysing the biophysical effects of LULCC, two different approaches are typically adopted.The first focuses on places where vegetation cover has changed over a period of time and compares the situation before and after this event, taking care in controlling for the effects of inter-annual climatic variability over a local window (e.g.Silvério et al., 2015;Alkama and Cescatti, 2016).The second approach relies on a space-fortime substitution that isolates the potential impact of a land cover transition by comparing neighbouring areas with similar environmental conditions but contrasting vegetation (e.g.Zhao and Jackson 2014;Li et al., 2015;Peng et al., 2014;Duveiller et al., 2018b).Both appear to yield similar results (Li et al., 2016), but the space-for-time approach allows the exploration of more transitions and over a larger spatial extent since it is not limited to places where actual change has occurred (Duveiller et al., 2018b).
The biophysical consequences of LULCC are known to depend on the background climate (Pitman et al., 2011;Winckler et al., 2017b), which in turn varies with climate change (IPCC, 2014).To better anticipate these changes it is necessary to predict these biophysical effects with robust model frameworks.Land surface models (LSMs) are used to represent terrestrial processes within Earth system models, in which they simulate both the carbon cycle and landatmosphere fluxes of energy, water and momentum.Initiatives like the Land-Use and Climate, Identification of Robust Impacts (LUCID) project (de Noblet-Ducoudré et al., 2012;Pitman et al., 2009) have attempted to evaluate the capacity of LSMs to represent biophysical effects of LULCC by intercomparing several simulations of past LULCC and showing large discrepancies amongst models, especially in separating between turbulent fluxes.Such inter-comparison exercises should also continue within broader initiatives such as the Land Use Model Inter-comparison Project (LUMIP) contribution to the Coupled Model Intercomparison Project Phase 6 (CMIP6) (Lawrence et al., 2016).However, there is a lack of model evaluation against observation-driven datasets, in which the spatial, temporal and climatic patterns can be evaluated at finer scale.Confrontation with observations could considerably contribute towards improving the robustness and consistency of models, but requires special attention to ensure simulations and observations are comparable regarding how vegetation cover is implemented and how biophysical processes are represented.
This study presents a framework for process-oriented model evaluation specifically tailored towards analysing how local biophysical effects of vegetation cover change are represented in LSMs.Simulations from four major LSMs are confronted with satellite remote sensing observations across geographic, seasonal and climatic dimensions for a range of vegetation transitions and for different components of the surface energy balance.The main objectives of this study are to create a harmonized multi-dimensional dataset, to illustrate its content and to demonstrate its utility by evaluating the agreement amongst models and against satellite observations.
Figure 1.Flowchart resuming the processing steps undertaken for the present study.The part in the grey box corresponds to work done in a previous study (Duveiller et al., 2018b).

Material and methods
Isolating the effect of vegetation cover change from both model simulations and observations in order to make them comparable requires a series of dedicated processing steps.To assist the reader in following the methodology developed in this work, Fig. 1 summarizes the main steps in a synthetic flowchart.

Remote sensing estimations
The observation part of the analysis is based on satellite remote sensing observations to assess the effects of vegetation on the surface energy balance for different vegetation cover types (Duveiller et al., 2018a).This remote sensing dataset (RS dataset for short) consists of spatially and seasonally explicit estimates of changes in surface properties following specific vegetation transitions.These surface properties are albedo, land surface temperature (LST) and evapotranspiration (ET), obtained from the respective MODIS products MCD43C3 (Schaaf et al., 2002), MYD11C3 (Wan, 2008) and MOD16A2 (Mu et al., 2011).The changes in these variables are calculated at the original scale of the product 0.05 • , but the dataset is provided at a spatial resolution of 1 • , with each cell representing the mean changes occurring at the finer scale of 0.05 • .This coarser spatial resolution is necessary for a specific step to ingest CERES EBAF surface radiation data (Kato et al., 2013) in the processing chain, but is also ideal to align the dataset with the simulations of LSMs.The data represent a multi-annual average year with a monthly temporal resolution.This synthetic year is constructed from the median values for a given month over the period 2008-2012 for every 0.05 • pixel.The original land cover map used to build this dataset is the ESA CCI land cover map for 2010 (ESA, 2017), but with a simplified reclassification of land cover types into major vegetation classes according to the International Geosphere-Biosphere Programme (IGBP) classification scheme.A total of 45 distinct vegetation transitions are provided in the RS dataset.Although these are referred to as vegetation transitions, the information does not come from observations over transient vegetation changes, but rather from paired observations of distinct vegetation cover types at the same location.As a result, values for a given pair are only available where there is sufficient spatial abundance, or co-occurrence, of both vegetation types.For more details on the dataset and how it was produced, readers may refer to Duveiller et al. (2018a).
The surface property variables from the RS dataset are net radiation (Rn), latent heat flux (LE), and the sum of sensible and ground heat flux (H + G).Sensible and ground heat fluxes have to be considered together because they cannot be www.earth-syst-sci-data.net/10/1265/2018/ Earth Syst.Sci.Data, 10, 1265-1279, 2018 directly retrieved from satellites and are computed as a residual flux from the closure of the surface energy balance.However, it can be considered that H + G is dominated by H since the ground heat values are generally much smaller and can be neglected at annual scale.In this study net radiation is considered positive when the flux goes from the atmosphere to the ground, while latent, sensible and ground heat fluxes are positive when they exit from the surface to the atmosphere.

Land surface model simulations
To simulate the biophysical effects of local vegetation transitions that are comparable to the RS dataset, we need to run LSMs forced by a realistic climate and with idealized vegetation distributions.The four models evaluated here are OR-CHIDEE (Krinner et al., 2005), JULES (Best et al., 2011;Clark et al., 2011), JSBACHv3.1 (Reick et al., 2013) and CLM4.5 (Oleson et al., 2013) Models differ in how they represent the surface energy balance per plant functional type (PFT) at the sub-grid level.Not all models can calculate heat fluxes per PFT within a grid cell, and thus some need to resort to flux aggregation at grid cell level to derive resulting variables such as temperature.
To overcome this problem and isolate the effect of vegetation cover change on the surface energy budget, simulations are made in which the entire grid cell is covered by a single PFT.Separate simulations are run for each PFT of every model, in which the entire surface of the Earth is covered by a single PFT.The effect of a change in PFT can then be retrieved by subtracting values of biophysical fluxes between the two corresponding simulations.Since there is no feedback between the vegetation and the climate in this set-up, having such homogeneous distributions of vegetation across vast geographic extents does not generate climate biases outside of the grid cell.

Harmonizing vegetation classes
The models differ in how they represent vegetation using different PFTs, each with their own parametrization.To facilitate the harmonization with the IGBP vegetation classes in the remote sensing dataset, only 6 broad vegetation classes are considered: evergreen broadleaf trees (EvgTr), deciduous broadleaf trees (DecTr), needleleaf trees (NedTr), shrubs (Shrub), grasses (Grass) and crops (Crops).A total of 15 transitions (from paired comparisons of PFTs) are thus avail-able and can also be used to represent inverse transitions (e.g.LE for DecTr to Crops is equal to − LE for Crops to DecTr).To obtain these broad vegetation classes from the RS dataset, the IGBP classes of evergreen and deciduous needleleaf forest (ENF and DNF, respectively) were merged into NedTr, whereas classes not represented by models or mixed classes such as woody savannas (SAV), mixed forests (MF) and wetlands (WET) have been omitted.The other three classes (Shrub, Grass and Crops) can be directly assigned with their corresponding classes in the scheme used in the RS dataset (SHR, GRA and CRO).
The harmonization of the different modelled PFTs to these 6 broad classes required the use of some decision rules that are summarized in Table 1.PFTs whose differences relate to their climatic regime (such as "Tropical broadleaf deciduous" and "Temperate broadleaf deciduous" in ORCHIDEE) are geographically separated and can be aggregated into a single global PFT.The spatial representation of the climate zones is taken from the revisited Köppen-Geiger classification product (Kottek et al., 2006).A similar approach is adopted to keep all needleleaf tree PFTs in a single layer, as the deciduous needleleaf trees are predominantly located in a well-defined geographic area in Siberia without a strong overlap with evergreen needleleaf trees.For grasses and crops, LSMs typically make a separation between C 3 and C 4 systems for carbon fixation, which is not currently feasible to detect from remote sensing observations (and thus is absent in the RS dataset).For the two classes, Grass and Crops, the decision rule adopted is to assign the dominant photosynthetic pathway (C 3 or C 4 ) within a grid cell to the entire grid cell.Since different models may have a different default PFT distribution map, the PFT distribution of JSBACH (based on work by Knorr and Heimann, 2001) is selected here as reference and used for the harmonization of the other models as well.An exception to this rule is applied for JULES, which represents crops as grasses.In this case, to maximize information content the Crops class is assigned exclusively with the C 3 grass PFT and the Grass class contains only the C 4 grass PFT.There are some further model-specific details in the harmonization procedure.For CLM, the Crops simulation is composed exclusively of the generic C 3 crop.Even if some LSMs can simulate some crop managements options, such as irrigation, this has been switched off to maximize inter-comparability amongst model runs.Regarding trees, JULES does not distinguish PFTs based on phenology, considering only the difference between broadleaf and needleleaf trees.The EvgTr and DecTr simulations will therefore be identical in JULES.Shrubs are simulated as a PFT by all models except for ORCHIDEE, for which the Shrub class remains empty.
As mentioned previously, the RS dataset can only provide reliable estimations where the vegetation types of interest locally co-exist.Although the models could theoretically simulate the vegetation even where is does not naturally occur, Earth Syst.Sci.Data, 10, 1265Data, 10, -1279Data, 10, , 2018 www.earth-syst-sci-data.net/10/1265/2018/  The spatial merging of different PFTs into single simulation layers requires assignment from ancillary maps indicated by the following superscripts: a Köppen-Geiger classification (Kottek et al., 2006) to distinguish arid, tropical, temperate and boreal climate zones; b default PFT distribution of JSBACH including dominant photosynthetic pathways (Knorr and Heimann, 2001).
the harmonized dataset presented here only includes simulated values over the areas where the RS data are available.

Protocol to evaluate agreement
The harmonized dataset is presented and analysed along three different bi-dimensional spaces.The first is geographic space, in which mean annual values per pixel are obtained by averaging all monthly observations.The second is labelled seasonal space, in which averages are made along latitudinal bands for each month, illustrating the seasonal course of the variables, such as in Hovmöller diagrams (Hovmöller, 1949).The third is climatic space, in which variables are analysed along temperature and precipitation gradients.The climatic axes of this space are mean annual temperature and annually cumulated precipitation for the period 2008-2012 based on the CRU TS4.00 climate data (Harris et al., 2014).The rationale behind using these spaces is to encourage more processbased model evaluation by ensuring that the agreement or disagreement between models and observations is coherent spatially, seasonally and climatically.
In this context of quantifying the biophysical effects of vegetation cover change, neither the satellite-derived estimations nor any of the model simulations can pretend to be an absolute reference, as all of them have some level of assumptions and uncertainties.While observation-driven datasets are usually taken as a reference over model simulations, in some cases the latter can also serve to evaluate the quality of the former (Massonnet et al., 2016).In order to evaluate the agreement without setting a single product as a reference, we measure agreement based on a metric that has the property of being symmetric.This means the value of the metric remains numerically unchanged whether it is applied to products X and Y , or if these are inverted.This is not the case when using a coefficient of determination R 2 from a standard regression of Y on X, which differs from that of a regression of X on Y .Beyond being symmetric, the index of agreement λ that we use is also dimensionless, bounded (between 0 for no agreement to 1 for perfect agreement) and easy to compute (Duveiller et al., 2016).Furthermore, its interpretation is relatively intuitive for practitioners since its value is the same as that of the familiar correlation coefficient r when there are no additive or multiplicative bias contributing to the disagreement between X and Y .When there are biases, its value reduces proportionately to this bias.For two sets, X and Y , each containing n values, the index is defined as follows: where µ and σ represent the mean and standard deviation, respectively.κ is a term set to zero if the correlation between X and Y is positive, and otherwise set as follows: Besides this index of agreement, the analysis also uses the correlation coefficient r and the mean absolute bias B as defined by the following formulas:

Results
Due to its high dimensionality, it is challenging to illustrate exhaustively all the facets of information represented in the dataset.Therefore, this section starts by describing a selection of cases of how different products portray changes in geographic, seasonal and climatic space of a given variable following a specific vegetation cover transition.
The first case, shown in Fig. 2, consists of changes in LE resulting from the conversion of evergreen broadleaf trees to crops (EvgTr to Crops) corresponding to a common land cover change associated with tropical deforestation.It is clear from this figure that not all models reproduce the expected effects of tropical deforestation (i.e. a reduction in LE due to crops having shallower roots and thus less access to water for transpiration) that are seen in the RS dataset.JSBACH and CLM see an increase in mean annual LE following deforestation in large parts of the world.ORCHIDEE and JULES generally predict the sign correctly, a reduction of LE, but for JULES the behaviour in the seasonal and climatic space shows contrasting patterns to those of RS, while those of JS-BACH might be more in line despite the constant bias of this model.Please note that in this and similar figures presented in this work, data for a given transition are only available for areas where there is a local co-occurrence of the two vegetation types according to the land cover distribution used in the RS dataset.
The second case, displayed in Fig. 3, concerns changes in H + G following the conversion of broadleaf deciduous to needleleaf trees (DecTr to NedTr).This transition can rep-resent for instance changes in planted forest species, such as what occurred for most of the past 250 years in Europe (Naudts et al., 2016).According to the RS dataset, this change is associated with a general increase in H + G, particularly in the summer period and across all latitudes (at least those in which there is a joint presence of DecTr and NedTr, and thus a higher likelihood that this conversion occurs).ORCHIDEE and JSBACH do not capture the sign of change in H + G dynamic, sometimes showing a reduction in H + G that is quite large for ORCHIDEE in the summertime and in warmer climates.CLM and JULES show more consistent patterns with RS.However, JULES overestimates the magnitude of change, especially in warmer climates, while CLM simulates a higher rise in H + G in spring in northern mid-latitudes to high latitudes that is not present in the observations.
The effect on net radiation caused by vegetation change from grasses to needleleaf trees (Grass to NedTr) is the third case, illustrated in Fig. 4.This transition reflects the effect of a northward expansion of the boreal forests into tundra, which is expected to happen as the temperatures in the higher latitudes increase.It also represents land abandonment and reforestation in the mid-latitudes.The general direction of change is captured by all models, showing how transforming grasslands to forests leads to an increase in net radiation, mostly due to the increase in shortwave radiation absorbed by the darker canopy.However, the geographic, seasonal and climatic patterns differ from observations and vary across the models.Seasonally, the RS dataset illustrates the strong snow effect on albedo in northern latitudes in spring, when the radiation load increases substantially as the days become longer while snow cover remains, thus amplifying the albedo differences between snow-covered grasses and darker evergreen trees.The higher increases in Rn that are present in the RS observations along the mountain ranges in North America are only captured by JULES.The magnitude of the spring albedo effect for this transition is slightly underestimated in OR-CHIDEE and overestimated in JSBACH and CLM, although for CLM it appears to extend more in time and latitude than what is reported in the RS dataset.While these discrepancies might be due more to misrepresentation of snow-related processes within the models than to misrepresentation of vegetation, it remains a good example of a process-based model evaluation, since it focuses on the comprehensive biophysical effect of the process of vegetation cover change.
For any vegetation transition, the similarities and discrepancies, both amongst models and with the RS dataset, can be summarized synthetically in a single diagram for all fluxes and for the three spaces under investigation (geographic, seasonal and climatic).Figure 5 shows such a diagram for the case of the tropical deforestation transition EvgTr to Crops, the same transition which is represented in Fig. 2. Every panel in Fig. 5 provides an inter-comparison of the pair-wise agreement between products either with the index λ, using squares in the lower right triangle of the panel, or with the correlation coefficient, r, using circles in the upper left triangle.Whilst the sizes of the symbols represent the relative value of the metric, the colour provides the value of the mean absolute bias.For two products X and Y , all metrics (λ, r and B) are calculated using the respective equations with the aggregated values over the three reasoning spaces, i.e. the binned values in geographic, seasonal and climatic space, such as those represented in Fig. 2. Generally Rn is better represented across the board, especially the seasonal patterns, and LE tends to suffer larger biases.The agreement amongst the models can be also gauged by comparing the size of the symbols within the triangles of Fig. 5. Analogous plots to those in Fig. 5 are available for all 15 transitions in the Supplement.
To provide a general overview for all transitions, Fig. 6 shows the agreement for each model with the RS dataset for all three fluxes averaged over geographic, seasonal and climatic space.The transitions are ordered according to the magnitude of gross transitions that have occurred in the recent past between 2000 and 2015 according to the annual ESA CCI annual land cover maps (ESA, 2017), with EvgTr to Crops being the largest.Generally, the agreement in geographic space is very poor, followed by occasional agree-Earth Syst.Sci.Data, 10, 1265Data, 10, -1279Data, 10, , 2018 www.earth-syst-sci-data.net/10/1265/2018/ q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q Geographic Seasonal Climatic ment in climatic space, and then more frequent agreement regarding the seasonal cycle.Net radiation is the variable that models simulate best, particularly the seasonal patterns, while H + G come second and LE comes third.Overall, the transition in which there is highest agreement between models and RS across all variables and spaces is DecTr to Crops, while the one with least agreement is Grass to Crops.

LE
No model stands out as having consistently better performance than the others.Models differ in which transitions they simulate best.JULES performs best for NedTr to Grass, CLM for EvgTr to Crops, and both JSBACH and OR-CHIDEE for DecTr to Crops.When looking at each variable across transitions and facets, the models showing highest mean agreement for LE, H + G and Rn are JULES, CLM and JULES respectively.Disregarding the situations when agreement is very poor, rare are the cases when all models similarly agree with RS, i.e. when all colours in a single box of Fig. 6 have similar colours.Those that stand out always involve Rn and are seasonal agreement for Grass to DecTr and both seasonal and climatic agreement for DecTr to Crops.
There are also cases in which a single model stands out as having much higher agreement than all the others, such as CLM for the spatial agreement of Rn for EvgTr to Crops, JULES for the seasonal agreement of Rn for NedTr to DecTr, JSBACH for climatic agreement of H + G for NedTr to Shrub and ORCHIDEE for climatic agreement of H + G for NedTr to Crops.
A final synoptic summary is provided in Fig. 7 that encompasses all transition and all fluxes and separates the total agreement in the recurrent three spaces: geographic, seasonal and climatic.A further division is made by distinguishing between the mean agreement amongst models (analogously to the triangle of values mentioned for Fig. 5) and the mean model agreement with remote sensing.The more salient feature is that models seem to agree most over Europe.Intermodel agreement is also high over northern America, but with the notable exception of the southeast of the United States.Inter-model agreement is also higher in drier and colder areas.However, for many of these areas models do not agree with RS.Western Canada and southern Australia appear as the places where there is the strongest agreement with RS, while the tropics show decisively lower agreement.The higher agreement amongst models in northern latitudes is maintained across the seasons, but the agreement with RS is only high in spring, probably due to the capacity of some models to catch the snow-induced albedo changes when trees are replaced by shrubs or grasses.The overall agreement in climate space indicates how for warm climates, models agree amongst themselves less in the more humid conditions, while there is generally a large disagreement with RS for all conditions.In colder climates, inter-model agreement is high but agreement with RS is higher for wetter conditions.

Discussion
This model-evaluation framework specifically targeting the biophysical effects of LULCC is unique in that it brings model simulations and observation-driven estimates together.By focusing on a model set-up with prescribed homogeneous vegetation types within grid cells, the biophysical impacts of specific LULCC transitions within the models can be recombined to match RS observations without requiring a complex disaggregation of the energy balance fluxes per sub-grid PFT.The resulting harmonized dataset should be of interest for a range of stakeholders.Model developers will find it useful to assess how their model performs with respect to other models and to an observational benchmark, which in turn can serve to identify areas across geographic, seasonal and climate space where model development efforts should be prioritized.people making decisions based on conclusions derived from model outputs, the dataset and the evaluation can provide a welcome overview of model performance across space, time and climate zones, along with an overall an idea of the current level of uncertainty associated with using these tools for estimating the biophysical impacts of LULCC.
The overall picture of the general benchmarking exercise of model performances is not encouraging.For various vegetation transitions, models do not even agree amongst themselves on the magnitude nor the sign of the change.The study confirms with observational data what previous analyses had reported based on model inter-comparisons regarding how models have more difficulties to simulate turbulent fluxes than radiative ones (de Noblet-Ducoudré et al., 2012), despite that the former have been shown to drive the local temperature response to land cover and management (Bright et al., 2017).As can be expected, the seasonal patterns observed in the RS dataset are better simulated than climatic or spatial patterns.Models are especially poor in capturing the spatial patterns, arguably because LSMs typically use the same parameterization for a given PFT across the globe, thereby disregarding the spatial variability of traits that can naturally occur, which in turn is sampled by the observation-driven estimates.In this sense, some model improvement could come by adopting the concept of optical functional types, based on traits detectable by remote sensing (Ustin and Gamon, 2010).
Models also tend to agree more amongst themselves than with observations.This may stem from similarities in the construction of models and their underlying assumptions.Europe may come out as a place of higher inter-model agreement because vegetation models were based heavily on information on temperate ecosystems, resulting in a better representation of temperature deciduous systems than drought-deciduous systems (Morales et al., 2005).Data for other ecosystems have only become available in more recent decades.Inter-model agreement, and to a lesser ex-tent RS-model agreement, is also higher in the areas where flux measurements from eddy-covariance towers, frequently used for carbon cycle calibration, are denser (Schimel et al., 2015).This converges towards an evident conclusion that strengthening the observational base is still essential to ensure the quality of model results.As discussed in Duveiller et al. (2018b), the RS dataset also has shortcomings, and these may partly explain discrepancies with respect to model simulations.A valid criticism is that the RS dataset relies on an evapotranspiration product (Mu et al., 2011) that has been shown to underperform compared to other satellite-driven products both at local (Michel et al., 2016) and at global scales (Miralles et al., 2016).However, this was the only available product at the sufficiently fine spatial resolution of 0.05 • , and despite the underperformance, ancillary analyses in Duveiller et al. (2018a, b) suggest that the product quality is sufficient for the purpose of studying local differences.
Some caveats regarding the specific model set-up need to be highlighted.The first relates to the spatial scale at which processes are represented.The use of homogeneous PFTs across grid cells isolates the effect of a total vegetation cover change within an LSM grid, but this has a natural trade-off: intrinsically heterogeneous ecosystems, such as savanna systems and taiga, could not be evaluated as these are represented by models using a mixture of tree and grass PFTs.A dedicated evaluation could be done in a more sophisticated version of this work using the savanna class transitions in the RS dataset (which was not used here), but would require the use of the same prescribed mixture of trees and grasses for all models.Such an exercise would probably reveal strong changes in biophysical effects linked to canopy roughness.Another option could be to do the entire exercise on LAI, which can integrate this heterogeneity, comparing changes in modelled LAI with LAI estimated from satellite remote sensing.A related caveat in the current set-up linked to mixed systems is that a within-grid cell bias may result from the mismatch between climate and vegetation.In an allforest simulation over a grid cell subjected to the real climate observed over a savanna, the trees may be more stressed than if they were mixed with low-evaporative grasses.Therefore, transitions of forests to grasses reported in this dataset may be overestimating the reduction of processes such as evapotranspiration over savanna regions.Overall, these issues illustrate how increasing the spatial resolution of the model simulations to better match that of observations should improve their inter-comparisons.This would result both from a better characterization of vegetation heterogeneity, but also from enabling LSMs to better resolve local climate variability and the resulting biophysical effects of LULCC.
Another particularity of the model set-up employed here is that only the local first-order biophysical effects of LULCC are explored.Non-local effects related to LULCC occurring elsewhere, as explored by modelling exercises such as Winckler et al. (2017a), are not considered here because they cannot be directly estimated by remote sensing diagnostics.Given that the analysis is also based on uncoupled LSMs runs, there are no possible feedbacks of the vegetation cover change on global climate, nor are there any local atmospheric feedbacks.Considering second-order effects stemming from the bi-directional land-climate interactions would require using LSMs coupled with a general circulation model within an Earth system model in which some cells are affected by LULCC, but this is beyond the scope of the present work.
The inter-comparison exercise presented here can be extended and improved.Doing so could further address other limitations of the current set-up, such as the fact that only a single source of meteorological forcing data is used to run the model simulations.Such datasets, based on climate reanalysis, can be particularly prone to errors and uncertainties in data-poor regions.Initiatives such as LUMIP (Lawrence et al., 2016) could use the present framework to run LSMs with different forcing datasets and evaluate how the simulated biophysical impacts of LULCC are sensitive to the quality of the input data.Another criticism of the inter-comparison is the mismatch between model runs, which do not explicitly include the effects of land management, and the RS dataset, which intrinsically does, simply because these are present in the observations.The biophysical impacts of land management changes have been shown to be as important as the effects of land cover change (Luyssaert et al., 2014), and could thus further account for the discrepancies between models and observations.The exercise could thus be extended with runs that include management for the models which can effectively simulate it, and it could also evaluate the improvement based on the RS benchmark.Finally, the exercise could be improved by extending the observational part to an ensemble of RS datasets.The input biophysical variables used to construct the RS dataset, namely albedo, land surface temperature and evapotranspiration, could be derived from different satellite instruments and based on other, ideally better algorithms.Ultimately, the RS dataset could be based on products from geostationary satellites to be able to study the diurnal patterns of biophysical effects of LULCC and how these are represented in the models.

Conclusions
This paper presents a process-oriented model evaluation framework for biophysical effects of vegetation cover change.A harmonized multi-dimensional dataset has been generated including dedicated simulations from four major LSMs along with observation-driven estimations based on satellite imagery.The analysis of these data along geographic, seasonal and climate dimensions results in an overview of model performance that can serve to highlight hotspots of agreement and disagreement both amongst models and with respect to an observational benchmark.The overall capacity of current LSMs to represent biophysical effects of LULCC is low.The seasonal cycle of radiative fluxes is the process that models capture best, whilst performance drops considerably when considering spatial and climatic gradients for all fluxes.We anticipate that the dataset will serve to identify specific model shortcomings with respect to observations and to other models, but also to highlight where models can be trusted more and where model development should be prioritized.This should in turn contribute to the larger goal to develop and inform land-based mitigation and adaptation policies that account for both biogeochemical and biophysical vegetation impacts on climate.Improving the robustness and consistency of land-surface models is essential to develop and inform land-based mitigation and adaptation policies that account for both biogeochemical and biophysical vegetation impacts on climate.
either C 3 grass or C 4 grass, based on the dominant photosynthetic pathway at grid cell level b .C 4 grass Grid cell assigned either C 3 crop or C 4 crop, based on the dominant photosynthetic pathway at grid cell level b Grid cell assigned either C 3 grass or C 4 grass, based on the dominant photosynthetic pathway at grid cell level b .Crops (Crops) Grid cell assigned either C 3 crop or C 4 crop, based on the dominant photosynthetic pathway at grid cell level b .C 3 grass Standard C 3 crop (representing wheat) Grid cell assigned either C 3 crop or C 4 crop, based on the dominant photosynthetic pathway at grid cell level b .

Figure 2 .
Figure 2. Inter-comparison of how the land surface models and remote sensing (RS) estimate a change in latent heat flux (LE) for the vegetation transition from evergreen broadleaf trees to crops (Crops -EvgTr).We summarize the data within three different comparison spaces to illustrate (a) the spatial variability, representing the annual mean for each pixel; (b) the seasonal variability; and (b) the variability in climate space, represented by mean annual temperature and annually cumulated precipitation.

Figure 3 .
Figure 3. Same as Fig. 2 for a change in the residual flux of the energy balance composed of both sensible heat and ground heat fluxes (H + G) for the vegetation transition from deciduous broadleaf trees to needleleaf trees (DecTr -NedTr).

Figure 4 .
Figure 4. Same as Fig. 2 for a change in the net radiative flux (Rn) for the vegetation transition from grasses to needleleaf trees (Grass -NedTr).

Figure 5 .
Figure5.Summary of the agreement between land surface models amongst themselves and with the remote sensing estimations for the vegetation transition from evergreen broadleaf trees to crops (Crops -EvgTr).The agreement is measured using the index of agreement λ (size of the squares), the Pearson correlation coefficient r (size of the circles, red border indicates negative correlation) and the absolute bias (colour of the symbols).The data used to calculate these metrics are the values previously averaged in bins according to the spatial, seasonal and climatic analysis spaces shown in Fig.2.Hence, these metrics relate only to areas where both vegetation types locally co-exist in reality.The fluxes represented are net radiation (Rn), latent heat flux (LE), and the combination of the sensible and ground heat fluxes (H + G).

Figure 6 .
Figure6.Summary of the agreement between models and remote sensing for all fluxes and all transitions in each of three facets of analysis: spatial, seasonal and climatic.The position of each triangle represents one of the four land surface models as shown in the top corner: ORCHIDEE (ORC), the Community Land Model (CLM), JSBACH (JSB) and JULES (JUL).The fluxes represented are net radiation (Rn), latent heat flux (LE), and the combination of the sensible and ground heat fluxes (H + G).The colour of the triangle represents the value of the index of agreement λ.Below each transition label, the number n provides the total number of original individual spatio-temporal records used to calculate each metric.The order of the transitions (from top to bottom) corresponds to the order of gross changes that have occurred between 2000 and 2015 according the ESA CCI land cover maps, which is provided below each transition label in megahectares (Mha).

Figure 7 .
Figure 7. Agreement amongst models and between models and remote sensing for all transitions and all fluxes together.The agreement amongst models (a) is calculated as the mean of all λ values calculated for each model pairs, while the agreement with remote sensing (b) is the mean values of λ for each model with respect to the remote sensing dataset.

Table 1 .
Summary of how the specific plant functional types (PFTs) of the different land surface models (LSMs) are mapped into the six broad vegetation classes used in this study.Model-specific nomenclature of PFT is used in italics.
Developers of LSMs that are not included in this study can follow the protocol and use the dataset to evaluate the resulting model performance.Model users can use the dataset and the analysis to choose the LSM that performs best over their areas of interest.For Earth Syst.Sci.Data, 10, 1265Data, 10,  -1279Data, 10,  , 2018www.earth-syst-sci-data.net/10/1265/2018/