R/DataPreparation.R
Prepare_dataset.Rd
This specifies data, covariates and allows resampling dataset in a regular grid for autocorrelation purposes
Prepare_dataset( x, var = 1, cov = 2:ncol(x), coords, proj4string = NULL, RefRaster = NULL, datatype = "Cont", na.rm = TRUE )
x | data.frame or SpatialPointsDataFrame of observations with covariates |
---|---|
var | column name or number of variable to be predicted |
cov | column name or numbers to be used as covariates |
coords | name or number of column with coordinates x,y. Or a 2-column matrix of x-y coordinates. If missing, output object will be a simple data.frame except if x is SpatialPointsDataFrame. |
proj4string | projection of the dataset. Default to x projection if exist or '+proj=longlat +datum=WGS84'. |
RefRaster | raster as in library(raster). Raster used as smallest grid for data gridding procedure. Resampling dataset into regular grid to decrease spatial-autocorrelation. See details. |
datatype | string. Choose among "Cont" (continuous data, even if only positive), "PA" (presence-absence data) or "Count" (count data). Required when RefRaster is not NULL. dataY average for regular grid resampling are rounded for count models. |
na.rm | logical. If TRUE, removes complete dataset rows with NA values. For further analysis, in particular with cross-validation procedure, NA values in the dataset is a big problem. This will bias cross-validation indices as they won't be calculated on the same amount of data depending on covariates with NA in the model tested. Thus, although this is drastic, rows of data with NA values are removed from the dataset with a warning. |
return a dataset where variable of interest is called 'dataY' (for compatibility with other functions. To be changed later to allow user defined name) and where only covariates of interest are kept. 'factor_' is added to covariates names that are factors (for compatibility with other functions).
RefRaster recommendation : If there is spatial autocorrelation in the data,
use the higher resolution covariate raster as a reference or use spatialcor_dist
to
determine smallest resolution to choose. If sampling plan is supposed not unbalanced
regarding covariates distribution, set RefRaster to NULL. Set to NULL if distribution
tested in the models will be KrigeGLM or KrigeGLM.dist (see AIC_indices
or
crossV_indices
).