R/IterCrossV_functions.R
best_distri.Rd
A function to rank distributions (of same length) and statistically compare them to the best one
best_distri( x, w, test = c("wilcoxon", "vanderWaerden", "median", "KruskalWallis"), na.max = 0.5, p.min = 0.01, silent = TRUE, cl = NULL )
x | Typically a matrix where rows are different distributions of the same length to be compared while paired |
---|---|
w | vector of weights with the same length a ncol(x) if outputs do not have the same weight. Used for weighted.mean and for p-value calculation. |
test | test used to compare distribution as used by
|
na.max | proportion maximum of NA value allowed in one distribution. If proportion of NA is upper na.max, model is ranked at the end and no p-value is calculated |
p.min | minimum p-value under which the order of distribution is not important because following distributions will not be kept... If set, when p-value is lower than p.min, distributions are supposed significantly "worse" than the best one. Remaining distributions are ordered according to their mean and p-values are not calculated. |
silent | Logical Whether to show % remained or not |
cl | a cluster as made with |
orderModels: number of columns of x re-ordered from best to worse p.values: p-values of difference between all distributions and the best one ordered like orderModels p.min.test: Logical. FALSE if distribution is ordered after the first distribution occurring with a p.value lower than p.min. Indeed, large distributions with high outliers may be not significantly different than distribution 1.
This function has been developed to compare indices of goodness of fit calculated after a cross-validation procedure. The best distribution is the one being the best on average for all cross- validation sub-samples. The best average hides extreme values that may be due to particular crossV samples (chosen randomly). Distribution are then compared statistically to the best one with paired test. Because the k-fold may return folds with different lengths, the weight of each fold may be corrected with the w parameter.
Because values compared do not necessarily follow a normal distribution, t.test is not the best mean comparison test. Wilcoxon do not require normality and is thus more appropriate here.
Size of validation set is not equal, in particular if there are factor covariates. wilcoxon.test is thus weighted accordingly
the only weighted wilcoxon test is from library(survey).
See svyranktest