## Outlines

This library has been created for model selection to predict species distributions (Biomass, density or presence/absence). Its final aim is to produce maps of predicted distributions.
In this vignette, we use datasets from library dismo to show the different possibilities of the library with species distribution modelling, from model selection to predictive mapping. Dataset bradypus contains geographic position of species occurence. The different model types available in library ModelSelect to predict probability of presence all rely on presence/absence data. As the dataset only contains presence data, we need to create a pseudo-absence set of observations. Here, we do not discuss this method. We just imagine that we have a dataset of real observations of presence and real observations of absences, so that we can use presence-absence models.
We create a variable of interest (Obs) with presences (1) and absences (0). This is in accordance with our data modelling case datatype = "PA" and its sub-cases. For other types of data, refer to the documentation (in particular, options in modelselect_opt) to correctly set your modelling parameters.
*Note that most figures of the vignette are saved in “inst” so that model selection is not run during vignette building. However, code in the vignettes can be run on your own computer and should return the same outputs. To do so, open and run this Rmd file: /Users/runner/work/_temp/Library/SDMSelect/SDM_Selection/SDM_Selection.Rmd.*

## Load libraries and other setup

library(SDMSelect)
library(ggplot2)
require(dismo)

set.seed(50)
# Temp directory for saving all outputs
tmpdir <- paste0(tempdir(), "/out_SDM")
dir.create(tmpdir)

## Covariates

Covariates are environmental covariates that will be used to predict distribution of the species. In this vignette, we use raster already prepared from package dismo. This means that all rasters have the same projection, extent and resolution and can be stacked into a multi-layer Raster. In this case, they can directly be stacked together. If this is not your case, you can use function Prepare_r_multi to reproject, resample, crop all rasters of covariates to the dimensions of a reference raster (if needed). You only need to provide a vector of all paths to these rasters. Rasters representing factors / categories (like biome.grd here), should be resampled using nearest neighbour method (if they need to be resampled), thus they have to be declared with is.factor option.

# World map for plots
data(wrld_simpl, package = "maptools")

# Vectors of paths to rasters of covariates
covariates.paths <-
list.files(file.path(system.file(package = "dismo"), 'ex'),
modelselect_opt$datatype <- "PA" Procedure selects combination of covariates in each iteration from one covariate to the maximum defined (modelselect_opt(“Max_nb_Var”) = 3). This reproduces the procedure separately for all model types defined (modelselect_opt(“modeltypes”) = PAGLM, PAGLMns, PA). The output of findBestModel function is the link to a zipped file of all outputs saved in the saveWD directory. Messages in the console shows the steps of the iterative model selection procedure. res.file <- findBestModel(x = data.new, datatype = "PA", corSpearman = corSpearman, saveWD = tmpdir, verbose = 1) ## Order models according to quality of prediction Model selection was realised separately for each distribution tested. The exact same k-fold cross-validation datasets have been used to keep the best model at each step of the iteration procedure. All indices of goodness of fit can thus be compared among the distributions tested with paired statistical tests. This allows to order all models tested. The best model and the following ones not statistically worse than the best one (one-sided p-value >= modelselect_opt$lim_pvalue_final) are retained.

# Order models and find the bests
BestModels <- ModelOrder(saveWD = tmpdir, plot = TRUE)

This results in figures showing the distribution of the goodness of fit (here AUC) issued from cross-validation for each model. This allows to see that predictive power of some models is not very different than the best one. Coloured boxplots are those of AUC distribution not statistically different from the best one. Models are ranged by family chosen and number of covariates added during the forward stepwise procedure.

Two tables are available:

• “VeryBestModels_crossV” retains the best model and the ones not significantly worse than the first one.
Num model Mean_AUC_crossV Mean_RMSE_crossV AIC modeltype
42 dataY ~ s(bio16,k=5) + s(bio5,k=5) + s(bio1,k=5) 0.88 0.32 282 PA
• “BestForModeltypes” retains the first two best models of each “modeltype”.
Num model Mean_AUC_crossV Mean_RMSE_crossV AIC modeltype
42 dataY ~ s(bio16,k=5) + s(bio5,k=5) + s(bio1,k=5) 0.88 0.32 282 PA
43 dataY ~ s(bio6,k=5) + s(bio5,k=5) + s(bio12,k=5) 0.86 0.34 302 PA
15 dataY ~ bio12 + bio6 + poly(bio5,3) 0.86 0.34 308 PAGLM
16 dataY ~ bio12 + bio6 + poly(bio5,2) 0.86 0.34 310 PAGLM
27 dataY ~ ns(bio7, df=2) + ns(bio5, df=3) + ns(bio8, df=2) 0.84 0.35 320 PAGLMns
23 dataY ~ ns(bio7, df=2) + ns(bio5, df=3) 0.84 0.35 320 PAGLMns

Covariates selected in all models not statistically different from the best one may be explored to choose the best model by expertise. Covariates are ordered based on occurrence in the models selected. You may want to choose between the best model and the model with most selected covariates.

## Predictions of the best model

We can choose one model, the best one here, and create a set of figure outputs to explore what it is predicting. Figures are saved in saveWD and its compressed version (its path is the output of the function, here stored in res.file)

Num.Best <- BestModels$VeryBestModels_crossV$Num[1]
res.file <- ModelResults(saveWD = tmpdir, plot = TRUE,
Num = Num.Best)

In the case of a model with presence-absence, the outputs are the following.

• Analysis of variance. Deviance explained by each covariate when added sequentially in the order specified by the cross-validation procedure. “%Exp.Dev” stands for percentage of explained deviance. “AUC” is the mean AUC on the validation datasets as issued from the cross-validation procedure, “>Diff” being the difference of cross-validation AUC with the previous step.

• Marginal predictions by covariates. These are simplified marginal effects of each covariate in the range of observations. All covariates values except one have been fixed so that their combination give a prediction close to the mean of observations. These simplified marginals figures are to be read for their relative effect and not for their absolute prediction.

• Comparison of predictions against observations and proposition of threshold values to separate presence from absences. The best threshold value is based on cross-validation outputs. This is the value closest to selectivity equals specificity.

• When the best model has been selected, it is used for prediction. All data are used to fit the model, predictions are made on the same dataset. For each observation, the model predicts a probability of presence between 0 and 1, although observations were 0 or 1. The figure is a “pirateplot” from package {yarrr}. It shows the distribution of predictions for the best model separating presence-absences observed.
• The Bean plot is a smoothed density (inspired by Kampstra et al. (2008)).
• The (Bar/Line Center) shows the mean of the distribution
• The band represents an inference interval around the mean: Bayesian Highest Density Interval
• Other indices have been calculated during the cross-validation procedure.
• The best THD is the best threshold value to separate presences from absences considering quality of prediction of the model. Indeed, if the cross-validation built 10 times 10-fold cross-validation models, then there are 100 validations. For each of the 100 model, the best threshold is the value on the ROC curves that equilibrates selectivity and sensitivity of the predictions. The best THD on the figure is the average of the 100 cross-validation thresholds. (Best THD close to 0.5 shows well-balanced dataset with trustwirthy predictions. lower (resp. higher) value show a desequilibrium towards absences (resp. presences) in the combination of covariates selected)
• Similarly, MaxZero is the average of 100 cross-validation threshold calculated as the threshold below which 95% of absence are correctly classified as absences. (the closest to the best THD, the better)
• Similarly, MaxUn is the average of 100 cross-validation threshold calculated as the threshold above which 95% of presences are correctly classified as presences. (the closest to the best THD, the better)

## Species distribution mapping

### Probability of presence

The outputs of the best model can be used to predict species distribution. The model selected is coupled with the raster of covariates to predict a probability of presence in each cell of the raster. Function Map_predict returns a RasterStack with these predictions.

pred.r <- Map_predict(object = covariates, saveWD = tmpdir, Num = Num.Best)
predr <- system.file("SDM_Selection", "predr.tif", package = "SDMSelect")
pred.r <- raster::stack(predr)
predrNames <- system.file("SDM_Selection", "predrNames.rds", package = "SDMSelect")
names(pred.r) <- readr::read_rds(predrNames)

The first output of interest is the first layer resp.fit of the RasterStack. Here, the predicted probability of presence. SDMSelect is provided with gplot_data, a modified version of rasterVis::gplot which allows to prepare a (part of) raster data for ggplot visualisation instead of visualizing it directly. Maybe useful in some situations.

# Get partial raster data to plot in ggplot
pred.r.gg <- gplot_data(pred.r)
# Plot
ggplot() +
geom_tile(
data = dplyr::filter(pred.r.gg, variable == "resp.fit"),
aes(x = x, y = y, fill = value)) +
scale_fill_gradient("Probability", low = 'yellow', high = 'blue') +
coord_equal()

### Uncertainties

Function Map_predict also calculates different indices for the assessment of the uncertainty of prediction. For instance, minimum (quantile = 5%) and maximum (quantile = 95%) of possible predictions. (This is based on quantiles predictions in the scale of the link function, then transformed in the scale of the response.)

rasterVis::gplot(raster::dropLayer(pred.r, which(!names(pred.r) %in% c("Q5", "Q95")))) +
geom_tile(aes(fill = value)) +
facet_grid(~variable) +
scale_fill_gradient("Probability", low = 'yellow', high = 'blue') +
coord_equal()

The inter-quartile range (IQR) and the IQR divided by the median (IQR.M) give an idea of the absolute and relative uncertainty of the predictions. This is similar to standard deviation and coefficient of variation respectively but better here because predictions are not Gaussian shaped.

rasterVis::gplot(raster::dropLayer(pred.r, which(!names(pred.r) %in% c("IQR")))) +
geom_tile(aes(fill = value)) +
facet_grid(~variable) +
scale_fill_gradient("Absolute\nDispersion", low = 'white', high = 'red') +
coord_equal()

rasterVis::gplot(raster::dropLayer(pred.r, which(!names(pred.r) %in% c("IQR.M")))) +
geom_tile(aes(fill = value)) +
facet_grid(~variable) +
scale_fill_gradient("Relative\nDispersion\nto median", low = 'white', high = 'red') +
coord_equal()

### Separating presence from absences

The best threshold value to separate presences from absences may not be 0.5 when the presence/absence data are not equilibrated among covariates selected. The best threshold value (BestTHD) is shown above in the plot comparing observations and predictions. We can retrieve this value with function model_select.

model_selected <- model_select(
saveWD = tmpdir,
new.data = data.new,
Num = Num.Best)
BestThd <- model_selected\$Seuil
BestThdFile <- system.file("SDM_Selection", "BestThd.rds", package = "SDMSelect")
BestThd <- readr::read_rds(BestThdFile)

A better representation of the probabilities of presence should be with a double colour gradient, centred on the best threshold value (thd = 0.24).

rasterVis::gplot(raster::raster(pred.r, "resp.fit")) +
geom_tile(aes(fill = value)) +
low = 'white', mid = 'yellow',  high = 'blue',
midpoint = BestThd) +
coord_equal()

Considering uncertainty of prediction, another output is the probability (%) to be over the best threshold value (thd = 0.24). This gives an idea of the risk to predict a presence, considering the uncertainty on the probability of presence and the best threshold value.

rasterVis::gplot(raster::raster(pred.r, "ProbaSup")) +
geom_tile(aes(fill = value)) +
coord_equal()
Finally, a model should not extrapolate out of the range of the data. The Map_predict function retrieves the range of covariates in the dataset used for modelling and create a raster mask for each covariate when out of the range of data observations. In the present example mask is not meaningful because of the absence data covering the entire area.
rasterVis::gplot(raster::dropLayer(pred.r, which(!grepl("mask", names(pred.r))))) +
coord_equal()