Prepare dataset to be included in the model selection procedure

This specifies data, covariates and allows resampling dataset in a regular grid for autocorrelation purposes

Prepare_dataset(
  x,
  var = 1,
  cov = 2:ncol(x),
  coords,
  proj4string = NULL,
  RefRaster = NULL,
  datatype = "Cont",
  na.rm = TRUE
)

Arguments

x	data.frame or SpatialPointsDataFrame of observations with covariates
var	column name or number of variable to be predicted
cov	column name or numbers to be used as covariates
coords	name or number of column with coordinates x,y. Or a 2-column matrix of x-y coordinates. If missing, output object will be a simple data.frame except if x is SpatialPointsDataFrame.
proj4string	projection of the dataset. Default to x projection if exist or '+proj=longlat +datum=WGS84'.
RefRaster	raster as in library(raster). Raster used as smallest grid for data gridding procedure. Resampling dataset into regular grid to decrease spatial-autocorrelation. See details.
datatype	string. Choose among "Cont" (continuous data, even if only positive), "PA" (presence-absence data) or "Count" (count data). Required when RefRaster is not NULL. dataY average for regular grid resampling are rounded for count models.
na.rm	logical. If TRUE, removes complete dataset rows with NA values. For further analysis, in particular with cross-validation procedure, NA values in the dataset is a big problem. This will bias cross-validation indices as they won't be calculated on the same amount of data depending on covariates with NA in the model tested. Thus, although this is drastic, rows of data with NA values are removed from the dataset with a warning.

Value

return a dataset where variable of interest is called 'dataY' (for compatibility with other functions. To be changed later to allow user defined name) and where only covariates of interest are kept. 'factor_' is added to covariates names that are factors (for compatibility with other functions).

Details

RefRaster recommendation : If there is spatial autocorrelation in the data, use the higher resolution covariate raster as a reference or use spatialcor_dist to determine smallest resolution to choose. If sampling plan is supposed not unbalanced regarding covariates distribution, set RefRaster to NULL. Set to NULL if distribution tested in the models will be KrigeGLM or KrigeGLM.dist (see AIC_indices or crossV_indices).