Package smile.validation
Interface CrossValidation
public interface CrossValidation
Cross-validation is a technique for assessing how the results of a
statistical analysis will generalize to an independent data set.
It is mainly used in settings where the goal is prediction, and one
wants to estimate how accurately a predictive model will perform in
practice. One round of cross-validation involves partitioning a sample
of data into complementary subsets, performing the analysis on one subset
(called the training set), and validating the analysis on the other subset
(called the validation set or testing set). To reduce variability, multiple
rounds of cross-validation are performed using different partitions, and the
validation results are averaged over the rounds.
-
Method Summary
Modifier and TypeMethodDescriptionstatic <M extends DataFrameClassifier>
ClassificationValidations<M> classification
(int round, int k, Formula formula, DataFrame data, BiFunction<Formula, DataFrame, M> trainer) Repeated cross validation of classification.static <T,
M extends Classifier<T>>
ClassificationValidations<M> classification
(int round, int k, T[] x, int[] y, BiFunction<T[], int[], M> trainer) Repeated cross validation of classification.static <M extends DataFrameClassifier>
ClassificationValidations<M> classification
(int k, Formula formula, DataFrame data, BiFunction<Formula, DataFrame, M> trainer) Cross validation of classification.static <T,
M extends Classifier<T>>
ClassificationValidations<M> classification
(int k, T[] x, int[] y, BiFunction<T[], int[], M> trainer) Cross validation of classification.static Bag[]
nonoverlap
(int[] group, int k) Cross validation with non-overlapping groups.static Bag[]
of
(int n, int k) Creates a k-fold cross validation.static <M extends DataFrameRegression>
RegressionValidations<M> regression
(int round, int k, Formula formula, DataFrame data, BiFunction<Formula, DataFrame, M> trainer) Repeated cross validation of regression.static <T,
M extends Regression<T>>
RegressionValidations<M> regression
(int round, int k, T[] x, double[] y, BiFunction<T[], double[], M> trainer) Repeated cross validation of regression.static <M extends DataFrameRegression>
RegressionValidations<M> regression
(int k, Formula formula, DataFrame data, BiFunction<Formula, DataFrame, M> trainer) Cross validation of regression.static <T,
M extends Regression<T>>
RegressionValidations<M> regression
(int k, T[] x, double[] y, BiFunction<T[], double[], M> trainer) Cross validation of regression.static Bag[]
stratify
(int[] category, int k) Cross validation with stratified folds.static <M extends DataFrameClassifier>
ClassificationValidations<M> stratify
(int round, int k, Formula formula, DataFrame data, BiFunction<Formula, DataFrame, M> trainer) Repeated stratified cross validation of classification.static <T,
M extends Classifier<T>>
ClassificationValidations<M> stratify
(int round, int k, T[] x, int[] y, BiFunction<T[], int[], M> trainer) Repeated stratified cross validation of classification.static <M extends DataFrameClassifier>
ClassificationValidations<M> Stratified cross validation of classification.static <T,
M extends Classifier<T>>
ClassificationValidations<M> stratify
(int k, T[] x, int[] y, BiFunction<T[], int[], M> trainer) Stratified cross validation of classification.
-
Method Details
-
of
Creates a k-fold cross validation.- Parameters:
n
- the number of samples.k
- the number of rounds of cross validation.- Returns:
- k-fold data splits.
-
stratify
Cross validation with stratified folds. The folds are made by preserving the percentage of samples for each group.- Parameters:
category
- the strata labels.k
- the number of folds.- Returns:
- k-fold data splits.
-
nonoverlap
Cross validation with non-overlapping groups. The same group will not appear in two different folds (the number of distinct groups has to be at least equal to the number of folds). The folds are approximately balanced in the sense that the number of distinct groups is approximately the same in each fold.This is useful when the i.i.d. assumption is known to be broken by the underlying process generating the data. For example, when we have multiple samples by the same user and want to make sure that the model doesn't learn user-specific features that don't generalize to unseen users, this approach could be used.
- Parameters:
group
- the group labels of the samples.k
- the number of folds.- Returns:
- k-fold data splits.
-
classification
static <T,M extends Classifier<T>> ClassificationValidations<M> classification(int k, T[] x, int[] y, BiFunction<T[], int[], M> trainer) Cross validation of classification.- Type Parameters:
T
- the data type of samples.M
- the model type.- Parameters:
k
- k-fold cross validation.x
- the samples.y
- the sample labels.trainer
- the lambda to train a model.- Returns:
- the validation results.
-
classification
static <M extends DataFrameClassifier> ClassificationValidations<M> classification(int k, Formula formula, DataFrame data, BiFunction<Formula, DataFrame, M> trainer) Cross validation of classification.- Type Parameters:
M
- the model type.- Parameters:
k
- k-fold cross validation.formula
- the model specification.data
- the training/validation data.trainer
- the lambda to train a model.- Returns:
- the validation results.
-
classification
static <T,M extends Classifier<T>> ClassificationValidations<M> classification(int round, int k, T[] x, int[] y, BiFunction<T[], int[], M> trainer) Repeated cross validation of classification.- Type Parameters:
T
- the data type of samples.M
- the model type.- Parameters:
round
- the number of rounds of repeated cross validation.k
- k-fold cross validation.x
- the samples.y
- the sample labels.trainer
- the lambda to train a model.- Returns:
- the validation results.
-
classification
static <M extends DataFrameClassifier> ClassificationValidations<M> classification(int round, int k, Formula formula, DataFrame data, BiFunction<Formula, DataFrame, M> trainer) Repeated cross validation of classification.- Type Parameters:
M
- the model type.- Parameters:
round
- the number of rounds of repeated cross validation.k
- k-fold cross validation.formula
- the model specification.data
- the training/validation data.trainer
- the lambda to train a model.- Returns:
- the validation results.
-
stratify
static <T,M extends Classifier<T>> ClassificationValidations<M> stratify(int k, T[] x, int[] y, BiFunction<T[], int[], M> trainer) Stratified cross validation of classification.- Type Parameters:
T
- the data type of samples.M
- the model type.- Parameters:
k
- k-fold cross validation.x
- the samples.y
- the sample labels.trainer
- the lambda to train a model.- Returns:
- the validation results.
-
stratify
static <M extends DataFrameClassifier> ClassificationValidations<M> stratify(int k, Formula formula, DataFrame data, BiFunction<Formula, DataFrame, M> trainer) Stratified cross validation of classification.- Type Parameters:
M
- the model type.- Parameters:
k
- k-fold cross validation.formula
- the model specification.data
- the training/validation data.trainer
- the lambda to train a model.- Returns:
- the validation results.
-
stratify
static <T,M extends Classifier<T>> ClassificationValidations<M> stratify(int round, int k, T[] x, int[] y, BiFunction<T[], int[], M> trainer) Repeated stratified cross validation of classification.- Type Parameters:
T
- the data type of samples.M
- the model type.- Parameters:
round
- the number of rounds of repeated cross validation.k
- k-fold cross validation.x
- the samples.y
- the sample labels.trainer
- the lambda to train a model.- Returns:
- the validation results.
-
stratify
static <M extends DataFrameClassifier> ClassificationValidations<M> stratify(int round, int k, Formula formula, DataFrame data, BiFunction<Formula, DataFrame, M> trainer) Repeated stratified cross validation of classification.- Type Parameters:
M
- the model type.- Parameters:
round
- the number of rounds of repeated cross validation.k
- k-fold cross validation.formula
- the model specification.data
- the training/validation data.trainer
- the lambda to train a model.- Returns:
- the validation results.
-
regression
static <T,M extends Regression<T>> RegressionValidations<M> regression(int k, T[] x, double[] y, BiFunction<T[], double[], M> trainer) Cross validation of regression.- Type Parameters:
T
- the data type of samples.M
- the model type.- Parameters:
k
- k-fold cross validation.x
- the samples.y
- the response variable.trainer
- the lambda to train a model.- Returns:
- the validation results.
-
regression
static <M extends DataFrameRegression> RegressionValidations<M> regression(int k, Formula formula, DataFrame data, BiFunction<Formula, DataFrame, M> trainer) Cross validation of regression.- Type Parameters:
M
- the model type.- Parameters:
k
- k-fold cross validation.formula
- the model specification.data
- the training/validation data.trainer
- the lambda to train a model.- Returns:
- the validation results.
-
regression
static <T,M extends Regression<T>> RegressionValidations<M> regression(int round, int k, T[] x, double[] y, BiFunction<T[], double[], M> trainer) Repeated cross validation of regression.- Type Parameters:
T
- the data type of samples.M
- the model type.- Parameters:
round
- the number of rounds of repeated cross validation.k
- k-fold cross validation.x
- the samples.y
- the response variable.trainer
- the lambda to train a model.- Returns:
- the validation results.
-
regression
static <M extends DataFrameRegression> RegressionValidations<M> regression(int round, int k, Formula formula, DataFrame data, BiFunction<Formula, DataFrame, M> trainer) Repeated cross validation of regression.- Type Parameters:
M
- the model type.- Parameters:
round
- the number of rounds of repeated cross validation.k
- k-fold cross validation.formula
- the model specification.data
- the training/validation data.trainer
- the lambda to train a model.- Returns:
- the validation results.
-