Interface CrossValidation


public interface CrossValidation
Cross-validation is a technique for assessing how the results of a statistical analysis will generalize to an independent data set. It is mainly used in settings where the goal is prediction, and one wants to estimate how accurately a predictive model will perform in practice. One round of cross-validation involves partitioning a sample of data into complementary subsets, performing the analysis on one subset (called the training set), and validating the analysis on the other subset (called the validation set or testing set). To reduce variability, multiple rounds of cross-validation are performed using different partitions, and the validation results are averaged over the rounds.
  • Method Details

    • of

      static Bag[] of(int n, int k)
      Creates a k-fold cross validation.
      Parameters:
      n - the number of samples.
      k - the number of rounds of cross validation.
      Returns:
      k-fold data splits.
    • stratify

      static Bag[] stratify(int[] category, int k)
      Cross validation with stratified folds. The folds are made by preserving the percentage of samples for each group.
      Parameters:
      category - the strata labels.
      k - the number of folds.
      Returns:
      k-fold data splits.
    • nonoverlap

      static Bag[] nonoverlap(int[] group, int k)
      Cross validation with non-overlapping groups. The same group will not appear in two different folds (the number of distinct groups has to be at least equal to the number of folds). The folds are approximately balanced in the sense that the number of distinct groups is approximately the same in each fold.

      This is useful when the i.i.d. assumption is known to be broken by the underlying process generating the data. For example, when we have multiple samples by the same user and want to make sure that the model doesn't learn user-specific features that don't generalize to unseen users, this approach could be used.

      Parameters:
      group - the group labels of the samples.
      k - the number of folds.
      Returns:
      k-fold data splits.
    • classification

      static <T, M extends Classifier<T>> ClassificationValidations<M> classification(int k, T[] x, int[] y, BiFunction<T[],int[],M> trainer)
      Cross validation of classification.
      Type Parameters:
      T - the data type of samples.
      M - the model type.
      Parameters:
      k - k-fold cross validation.
      x - the samples.
      y - the sample labels.
      trainer - the lambda to train a model.
      Returns:
      the validation results.
    • classification

      static <M extends DataFrameClassifier> ClassificationValidations<M> classification(int k, Formula formula, DataFrame data, BiFunction<Formula,DataFrame,M> trainer)
      Cross validation of classification.
      Type Parameters:
      M - the model type.
      Parameters:
      k - k-fold cross validation.
      formula - the model specification.
      data - the training/validation data.
      trainer - the lambda to train a model.
      Returns:
      the validation results.
    • classification

      static <T, M extends Classifier<T>> ClassificationValidations<M> classification(int round, int k, T[] x, int[] y, BiFunction<T[],int[],M> trainer)
      Repeated cross validation of classification.
      Type Parameters:
      T - the data type of samples.
      M - the model type.
      Parameters:
      round - the number of rounds of repeated cross validation.
      k - k-fold cross validation.
      x - the samples.
      y - the sample labels.
      trainer - the lambda to train a model.
      Returns:
      the validation results.
    • classification

      static <M extends DataFrameClassifier> ClassificationValidations<M> classification(int round, int k, Formula formula, DataFrame data, BiFunction<Formula,DataFrame,M> trainer)
      Repeated cross validation of classification.
      Type Parameters:
      M - the model type.
      Parameters:
      round - the number of rounds of repeated cross validation.
      k - k-fold cross validation.
      formula - the model specification.
      data - the training/validation data.
      trainer - the lambda to train a model.
      Returns:
      the validation results.
    • stratify

      static <T, M extends Classifier<T>> ClassificationValidations<M> stratify(int k, T[] x, int[] y, BiFunction<T[],int[],M> trainer)
      Stratified cross validation of classification.
      Type Parameters:
      T - the data type of samples.
      M - the model type.
      Parameters:
      k - k-fold cross validation.
      x - the samples.
      y - the sample labels.
      trainer - the lambda to train a model.
      Returns:
      the validation results.
    • stratify

      static <M extends DataFrameClassifier> ClassificationValidations<M> stratify(int k, Formula formula, DataFrame data, BiFunction<Formula,DataFrame,M> trainer)
      Stratified cross validation of classification.
      Type Parameters:
      M - the model type.
      Parameters:
      k - k-fold cross validation.
      formula - the model specification.
      data - the training/validation data.
      trainer - the lambda to train a model.
      Returns:
      the validation results.
    • stratify

      static <T, M extends Classifier<T>> ClassificationValidations<M> stratify(int round, int k, T[] x, int[] y, BiFunction<T[],int[],M> trainer)
      Repeated stratified cross validation of classification.
      Type Parameters:
      T - the data type of samples.
      M - the model type.
      Parameters:
      round - the number of rounds of repeated cross validation.
      k - k-fold cross validation.
      x - the samples.
      y - the sample labels.
      trainer - the lambda to train a model.
      Returns:
      the validation results.
    • stratify

      static <M extends DataFrameClassifier> ClassificationValidations<M> stratify(int round, int k, Formula formula, DataFrame data, BiFunction<Formula,DataFrame,M> trainer)
      Repeated stratified cross validation of classification.
      Type Parameters:
      M - the model type.
      Parameters:
      round - the number of rounds of repeated cross validation.
      k - k-fold cross validation.
      formula - the model specification.
      data - the training/validation data.
      trainer - the lambda to train a model.
      Returns:
      the validation results.
    • regression

      static <T, M extends Regression<T>> RegressionValidations<M> regression(int k, T[] x, double[] y, BiFunction<T[],double[],M> trainer)
      Cross validation of regression.
      Type Parameters:
      T - the data type of samples.
      M - the model type.
      Parameters:
      k - k-fold cross validation.
      x - the samples.
      y - the response variable.
      trainer - the lambda to train a model.
      Returns:
      the validation results.
    • regression

      static <M extends DataFrameRegression> RegressionValidations<M> regression(int k, Formula formula, DataFrame data, BiFunction<Formula,DataFrame,M> trainer)
      Cross validation of regression.
      Type Parameters:
      M - the model type.
      Parameters:
      k - k-fold cross validation.
      formula - the model specification.
      data - the training/validation data.
      trainer - the lambda to train a model.
      Returns:
      the validation results.
    • regression

      static <T, M extends Regression<T>> RegressionValidations<M> regression(int round, int k, T[] x, double[] y, BiFunction<T[],double[],M> trainer)
      Repeated cross validation of regression.
      Type Parameters:
      T - the data type of samples.
      M - the model type.
      Parameters:
      round - the number of rounds of repeated cross validation.
      k - k-fold cross validation.
      x - the samples.
      y - the response variable.
      trainer - the lambda to train a model.
      Returns:
      the validation results.
    • regression

      static <M extends DataFrameRegression> RegressionValidations<M> regression(int round, int k, Formula formula, DataFrame data, BiFunction<Formula,DataFrame,M> trainer)
      Repeated cross validation of regression.
      Type Parameters:
      M - the model type.
      Parameters:
      round - the number of rounds of repeated cross validation.
      k - k-fold cross validation.
      formula - the model specification.
      data - the training/validation data.
      trainer - the lambda to train a model.
      Returns:
      the validation results.