smile.validation

Interface CrossValidation

• public interface CrossValidation
Cross-validation is a technique for assessing how the results of a statistical analysis will generalize to an independent data set. It is mainly used in settings where the goal is prediction, and one wants to estimate how accurately a predictive model will perform in practice. One round of cross-validation involves partitioning a sample of data into complementary subsets, performing the analysis on one subset (called the training set), and validating the analysis on the other subset (called the validation set or testing set). To reduce variability, multiple rounds of cross-validation are performed using different partitions, and the validation results are averaged over the rounds.
• Method Detail

• of

static Bag[] of(int n,
int k)
Creates a k-fold cross validation.
Parameters:
n - the number of samples.
k - the number of rounds of cross validation.
Returns:
k-fold data splits.
• of

static Bag[] of(int n,
int k,
boolean shuffle)
Creates a k-fold cross validation.
Parameters:
n - the number of samples.
k - the number of rounds of cross validation.
shuffle - whether to shuffle samples before splitting.
Returns:
k-fold data splits.
• of

static Bag[] of(int[] category,
int k)
Cross validation with stratified folds. The folds are made by preserving the percentage of samples for each group.
Parameters:
category - the strata labels.
k - the number of folds.
• nonoverlap

static Bag[] nonoverlap(int[] group,
int k)
Cross validation with non-overlapping groups. The same group will not appear in two different folds (the number of distinct groups has to be at least equal to the number of folds). The folds are approximately balanced in the sense that the number of distinct groups is approximately the same in each fold.

This is useful when the i.i.d. assumption is known to be broken by the underlying process generating the data. For example, when we have multiple samples by the same user and want to make sure that the model doesn't learn user-specific features that don't generalize to unseen users, this approach could be used.

Parameters:
group - the group labels of the samples.
k - the number of folds.
• classification

static <T,M extends Classifier<T>> ClassificationValidations<M> classification(int k,
T[] x,
int[] y,
java.util.function.BiFunction<T[],int[],M> trainer)
Runs classification cross validation.
Parameters:
k - k-fold cross validation.
x - the samples.
y - the sample labels.
trainer - the lambda to train a model.
Returns:
the validation results.
• classification

static <M extends DataFrameClassifierClassificationValidations<M> classification(int k,
Formula formula,
DataFrame data,
java.util.function.BiFunction<Formula,DataFrame,M> trainer)
Runs classification cross validation.
Parameters:
k - k-fold cross validation.
formula - the model specification.
data - the training/validation data.
trainer - the lambda to train a model.
Returns:
the validation results.
• regression

static <T,M extends Regression<T>> RegressionValidations<M> regression(int k,
T[] x,
double[] y,
java.util.function.BiFunction<T[],double[],M> trainer)
Runs regression cross validation.
Parameters:
k - k-fold cross validation.
x - the samples.
y - the response variable.
trainer - the lambda to train a model.
Returns:
the validation results.
• regression

static <M extends DataFrameRegressionRegressionValidations<M> regression(int k,
Formula formula,
DataFrame data,
java.util.function.BiFunction<Formula,DataFrame,M> trainer)
Runs regression cross validation.
Parameters:
k - k-fold cross validation.
formula - the model specification.
data - the training/validation data.
trainer - the lambda to train a model.
Returns:
the validation results.