Model Validation

When training a supervised model, we should always evaluate the goodness of fit of the model. This helps on model selection and also hyperparameter tuning. First of all, we should note that the error of the model as measured on the training data is likely to be lower than the actual generalization error.

Evaluation Metrics

Although most supervised learning algorithms try to minimize the empirical error (regularized or not), we should not use only error rate or accuracy as the objective measure. For example, if a highly unbalanced data contains 99% positive sample, a naive algorithm that classifies everything as positive will have 99% accuracy. However, it is useless.

For classification, Smile has the following measures:

  • The accuracy is the proportion of true results (both true positives and true negatives) in the population.
  • The sensitivity or true positive rate (TPR) (also called hit rate, recall) is a statistical measures of the performance of a binary classification test. Sensitivity is the proportion of actual positives which are correctly identified as such.
    
        TPR = TP / P = TP / (TP + FN)
        
  • The specificity (SPC) or true negative rate is a statistical measures of the performance of a binary classification test. Specificity measures the proportion of negatives which are correctly identified.
    
        SPC = TN / N = TN / (FP + TN) = 1 - FPR
        
  • The precision or positive predictive value (PPV) is ratio of true positives to combined true and false positives, which is different from sensitivity.
    
        PPV = TP / (TP + FP)
        
  • The false discovery rate (FDR) is ratio of false positives to combined true and false positives, which is actually 1 - precision.
    
        FDR = FP / (TP + FP)
        
  • Fall-out, false alarm rate, or false positive rate (FPR) is
    
        FPR = FP / N = FP / (FP + TN)
        
    Fall-out is actually Type I error and closely related to specificity (1 - specificity).
  • The F-score (or F-measure) considers both the precision and the recall of the test to compute the score. The traditional or balanced F-score (F1 score) is the harmonic mean of precision and recall, where an F1 score reaches its best value at 1 and worst at 0.

    The general formula involves a positive real β so that F-score measures the effectiveness of retrieval with respect to a user who attaches β times as much importance to recall as precision.

In Smile, the class label 1 is regarded as positive while 0 as negative. Note that not all measures can be applied to multi-class data. If one applies such a measure (e.g. specificity and sensitivity) on multi-class data regardlessly, the results may not make sense and all others are regarded as negative. Note that in these situations, only label 1 is regarded as positive and any other values are treated as negative class.

The below example shows how to calculate the accuracy of a multi-class model.


    val segTrain = read.arff("data/weka/segment-challenge.arff")
    val segTest = read.arff("data/weka/segment-test.arff")

    val model = randomForest("class" ~, segTrain)
    val pred = model.predict(segTest)

    smile> accuracy(segTest("class").toIntArray, pred)
    res5: Double = 0.9728395061728395
    

    var segTrain = Read.arff("data/weka/segment-challenge.arff");
    var segTest = Read.arff("data/weka/segment-test.arff");

    var model = RandomForest.fit(Formula.lhs("class"), segTrain);
    var pred = model.predict(segTest);

    jshell> Accuracy.of(segTest.column("class").toIntArray(), pred)
    $161 ==> 0.9617283950617284
          

Sensitivity and specificity are closely related to the concepts of type I and type II errors. For any test, there is usually a trade-off between the measures. This trade-off can be represented graphically using an ROC curve. When using normalized units, the area under the ROC curve is equal to the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one (assuming 'positive' ranks higher than 'negative').

The following example calculates various metrics for a binary classification problem.


    val toyTrain = read.csv("data/classification/toy/toy-train.txt", delimiter='\t', header=false)
    val toyTest = read.csv("data/classification/toy/toy-test.txt", delimiter='\t', header=false)

    val x = toyTrain.select(1, 2).toArray
    val y = toyTrain.column(0).toIntArray
    val model = logit(x, y, 0.1, 0.001)

    val testx = toyTest.select(1, 2).toArray
    val testy = toyTest.column(0).toIntArray
    val pred = testx.map(model.predict(_))

    smile> accuracy(testy, pred)
    res7: Double = 0.81435

    smile> recall(testy, pred)
    res8: Double = 0.7828

    smile> sensitivity(testy, pred)
    res9: Double = 0.7828

    smile> specificity(testy, pred)
    res10: Double = 0.8459

    smile> fallout(testy, pred)
    res11: Double = 0.15410000000000001

    smile> fdr(testy, pred)
    res12: Double = 0.16447859963710107

    smile> f1(testy, pred)
    res13: Double = 0.808301925757654

    // Calculate posteriori probability for AUC computation.
    val posteriori = new Array[Double](2)
    val prob = testx.map { x =>
            model.predict(x, posteriori)
            posteriori(1)
        }

    smile> auc(testy, prob)
    res17: Double = 0.8650958
    

    var toyTrain = Read.csv("data/classification/toy/toy-train.txt", CSVFormat.DEFAULT.withDelimiter('\t'));
    var toyTest = Read.csv("data/classification/toy/toy-test.txt", CSVFormat.DEFAULT.withDelimiter('\t'));

    var x = toyTrain.select(1, 2).toArray();
    var y = toyTrain.column(0).toIntArray();
    var model = LogisticRegression.fit(x, y, 0.1, 0.001, 100);

    var testx = toyTest.select(1, 2).toArray();
    var testy = toyTest.column(0).toIntArray();
    var pred = Arrays.stream(testx).mapToInt(xi -> model.predict(xi)).toArray();

    jshell>     Accuracy.of(testy, pred)
    $171 ==> 0.81435

    jshell>     Recall.of(testy, pred)
    $172 ==> 0.7828

    jshell>     Sensitivity.of(testy, pred)
    $173 ==> 0.7828

    jshell>     Specificity.of(testy, pred)
    $174 ==> 0.8459

    jshell>     Fallout.of(testy, pred)
    $175 ==> 0.15410000000000001

    jshell>     FDR.of(testy, pred)
    $176 ==> 0.16447859963710107

    jshell> FMeasure.of(testy, pred)
    $177 ==> 0.808301925757654

    // Calculate posteriori probability for AUC computation.
    var posteriori = new double[2];
    var prob = Arrays.stream(testx).mapToDouble(xi -> {
            model.predict(xi, posteriori);
            return posteriori[1];
        }).toArray();

    jshell> AUC.of(testy, prob)
    $180 ==> 0.8650958
          

For regression, Smile has the following measures:

  • MSE (mean squared error) and RMSE (root mean squared error).
  • MAD (mean absolute deviation error).
  • RSS (residual sum of squares).

Out-of-sample Evaluation

The generalization error (also known as the out-of-sample error) is a measure of how accurately an algorithm is able to predict outcome values for previously unseen data. Ideally, test data should be statistically independent from training data. But in practice, we usually have only one historical dataset and the evaluation of a learning algorithm may be sensitive to sampling error. In what follows, we discuss various testing mechanisms.

We provide both Java and Scala helper functions for testing. The Java helper functions are the static methods of the class smile.validation.Validation. The Scala one are in the package object of smile.validation and can be accessed directly in the Shell.

Hold-out Testing

Hold-out testing assume that all data samples are independently and identically distributed (this is also the basic assumption of most learning algorithms). A part of the data is held out for testing. Many benchmark data contain a separate test dataset.


    // Test a generic classifier. Only accuracy is calculated.
    def test[T](x: Array[T], y: Array[Int], testx: Array[T], testy: Array[Int])(trainer: => (Array[T], Array[Int]) => Classifier[T]): Classifier[T]

    // Test a binary classifier. Report accuracy, sensitivity/recall, specificity, precision, and F-Score.
    def test2[T](x: Array[T], y: Array[Int], testx: Array[T], testy: Array[Int])(trainer: => (Array[T], Array[Int]) => Classifier[T]): Classifier[T]

    // Test a binary soft classifier. In addition to test2, AUC is caulcated.
    def test2soft[T](x: Array[T], y: Array[Int], testx: Array[T], testy: Array[Int])(trainer: => (Array[T], Array[Int]) => SoftClassifier[T]): SoftClassifier[T]
    

The above Scala methods takes a code block to train the model and apply it on the test data. These methods return the trained model and print out various measures.


    val segTrain = read.arff("data/weka/segment-challenge.arff")
    val segTest = read.arff("data/weka/segment-test.arff")

    smile> test("class" ~, segTrain, segTest) { case (formula, data) => smile.classification.randomForest(formula, data) }
    [main] INFO smile.util.package$ - testing runtime: 0:00:00.103314
    Accuracy = 97.65%
    Confusion Matrix: ROW=truth and COL=predicted
    class  0 |     124 |       0 |       0 |       0 |       1 |       0 |       0 |
    class  1 |       0 |     110 |       0 |       0 |       0 |       0 |       0 |
    class  2 |       3 |       0 |     117 |       1 |       1 |       0 |       0 |
    class  3 |       1 |       0 |       0 |     109 |       0 |       0 |       0 |
    class  4 |       1 |       0 |       6 |       2 |     117 |       0 |       0 |
    class  5 |       0 |       0 |       0 |       0 |       0 |      94 |       0 |
    class  6 |       0 |       0 |       1 |       2 |       0 |       0 |     120 |
    res21: RandomForest = smile.classification.RandomForest@77f95e19
    

    var segTrain = Read.arff("data/weka/segment-challenge.arff");
    var segTest = Read.arff("data/weka/segment-test.arff");
    var formula = Formula.lhs("class");
    var model = RandomForest.fit(formula, segTrain);
    var pred = model.predict(segTest);

    jshell> ConfusionMatrix.of(formula.y(segTest).toIntArray(), pred)
    $187 ==> ROW=truth and COL=predicted
    class  0 |     124 |       0 |       0 |       0 |       1 |       0 |       0 |
    class  1 |       0 |     110 |       0 |       0 |       0 |       0 |       0 |
    class  2 |       3 |       0 |     115 |       1 |       3 |       0 |       0 |
    class  3 |       2 |       0 |       0 |     106 |       2 |       0 |       0 |
    class  4 |       2 |       0 |      10 |       6 |     108 |       0 |       0 |
    class  5 |       0 |       0 |       0 |       0 |       0 |      94 |       0 |
    class  6 |       2 |       0 |       1 |       0 |       0 |       0 |     120 |
          

    val toyTrain = read.csv("data/classification/toy/toy-train.txt", delimiter='\t', header=false)
    val toyTest = read.csv("data/classification/toy/toy-test.txt", delimiter='\t', header=false)

    val x = toyTrain.select(1, 2).toArray
    val y = toyTrain.column(0).toIntArray

    val testx = toyTest.select(1, 2).toArray
    val testy = toyTest.column(0).toIntArray

    smile> test2(x, y, testx, testy) { case (x, y) => lda(x, y) }
    training...
    testing...
    [main] INFO smile.util.package$ - runtime: 78.653061 ms
    Accuracy = 81.23%
    Sensitivity/Recall = 78.28%
    Specificity = 84.17%
    Precision = 83.18%
    F1-Score = 80.66%
    F2-Score = 79.21%
    F0.5-Score = 82.15%
    Confusion Matrix: ROW=truth and COL=predicted
    class 0	: 8417	| 1583	|
    class 1	: 2172	| 7828	|
    res5: LDA = smile.classification.LDA@5a524a19

    smile> test2(x, y, testx, testy) { case (x, y) => logit(x, y, 0.1, 0.001) }
    training...
    testing...
    Accuracy = 81.44%
    Sensitivity/Recall = 78.28%
    Specificity = 84.59%
    Precision = 83.55%
    F1-Score = 80.83%
    F2-Score = 79.28%
    F0.5-Score = 82.44%
    Confusion Matrix: ROW=truth and COL=predicted
    class  0 |    8459 |    1541 |
    class  1 |    2172 |    7828 |
    res29: LogisticRegression = smile.classification.LogisticRegression@6b0bcea5

    // AUC will be reported in binary classification
    test2soft(x, y, testx, testy) { case (x, y) => lda(x, y) }
    test2soft(x, y, testx, testy) { case (x, y) => logit(x, y, 0.1, 0.001) }
    

Out-of-bag Error

Out-of-bag (OOB) error, also called out-of-bag estimate, is a method of measuring the prediction error of random forests, boosted decision trees, and other machine learning models utilizing bootstrap aggregating to sub-sample data sampled used for training. OOB is the mean prediction error on each training sample xi, using only the trees that did not have xi in their bootstrap sample.


    val rf = smile.classification.randomForest("class" ~, iris)
    println(s"OOB error = ${rf.error}")
    

    var rf = smile.classification.RandomForest.fit(Formula.lhs("class"), iris);
    System.out.format("OOB error = %.2f%%%n", rf.error());
          

Subsampling allows one to define an out-of-bag estimate of the prediction performance improvement by evaluating predictions on those observations which were not used in the building of the next base learner. Out-of-bag estimates help avoid the need for an independent validation dataset, but often underestimate actual performance improvement and the optimal number of iterations.

Cross Validation

In k-fold cross validation, the dataset is divided into k random partitions. We treat each of the k partition like a hold-out set, train a model on the rest of data, and measure the quality of the model on the held-out. The overall performance is taken to be the average of the performance on all k partitions.


    object cv {
        def classification[T](x: Array[T], y: Array[Int], k: Int, measures: ClassificationMeasure*)(trainer: => (Array[T], Array[Int]) => Classifier[T]): Array[Double]

        def regression[T](x: Array[T], y: Array[Double], k: Int, measures: RegressionMeasure*)(trainer: => (Array[T], Array[Double]) => Regression[T]): Array[Double]
    }
    

    public class CrossValidation {
        public static int[] classification(int k, T[] x, int[] y, BiFunction<T[], int[], Classifier<T>> trainer);
        public static int[] classification(int k, Formula formula, DataFrame data, BiFunction<Formula, DataFrame, DataFrameClassifier> trainer);
        public static double[] regression(int k, T[] x, double[] y, BiFunction<T[], double[], Regression<T>> trainer);
        public static double[] regression(int k, Formula formula, DataFrame data, BiFunction<Formula, DataFrame, DataFrameRegression> trainer);
    }
          

When no measures are provided, the methods use accuracy or RMSE by default for classification or regression, respectively.


    smile> val iris = read.arff("data/weka/iris.arff")
    smile> cv.classification(10, "class" ~, iris) { case (formula, data) => smile.classification.cart(formula, data) }
    [main] INFO smile.util.package$ - Decision Tree runtime: 0:00:00.4392
    [main] INFO smile.util.package$ - Decision Tree runtime: 0:00:00.1187
    [main] INFO smile.util.package$ - Decision Tree runtime: 0:00:00.1340
    [main] INFO smile.util.package$ - Decision Tree runtime: 0:00:00.1120
    [main] INFO smile.util.package$ - Decision Tree runtime: 0:00:00.876
    [main] INFO smile.util.package$ - Decision Tree runtime: 0:00:00.1105
    [main] INFO smile.util.package$ - Decision Tree runtime: 0:00:00.1570
    [main] INFO smile.util.package$ - Decision Tree runtime: 0:00:00.818
    [main] INFO smile.util.package$ - Decision Tree runtime: 0:00:00.1013
    [main] INFO smile.util.package$ - Decision Tree runtime: 0:00:00.929
    Confusion Matrix: ROW=truth and COL=predicted
    class  0 |      50 |       0 |       0 |
    class  1 |       0 |      45 |       5 |
    class  2 |       0 |       5 |      45 |
    Accuracy: 93.33%
    res35: Array[Double] = Array(0.9333333333333333)
    

    jshell> var iris = Read.arff("data/weka/iris.arff");
    [main] INFO smile.io.Arff - Read ARFF relation iris
    iris ==> [sepallength: float, sepalwidth: float, petalleng ... -------+
    140 more rows...

    jshell> var pred = CrossValidation.classification(10, Formula.lhs("class"), iris, (formula, data) -> DecisionTree.fit(formula, data));
    pred ==> int[150] { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 1, 2, 2, 2, 2, 2, 1, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2 }

    jshell> var y = iris.column("class").toIntArray()
    y ==> int[150] { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... , 2, 2, 2, 2, 2, 2, 2, 2 }

    jshell> Accuracy.of(y, pred)
    $193 ==> 0.9266666666666666

    jshell> ConfusionMatrix.of(y, pred)
    $194 ==> ROW=truth and COL=predicted
    class  0 |      50 |       0 |       0 |
    class  1 |       0 |      45 |       5 |
    class  2 |       0 |       6 |      44 |
          

On the Iris data, the accuracy estimation of 10-fold cross validation is about 84.7%. You may get different number because of the random partitions.

A special case is the leave-one-out cross validation that uses a single observation from the original sample as the validation data, and the remaining observations as the training data. This is repeated such that each observation in the sample is used once as the validation data. Leave-one-out cross-validation is usually very expensive from a computational point of view because of the large number of times the training process is repeated.


    object loocv {
        def classification[T](x: Array[T], y: Array[Int], measures: ClassificationMeasure*)(trainer: => (Array[T], Array[Int]) => Classifier[T]): Array[Double]

        def regression[T](x: Array[T], y: Array[Double], measures: RegressionMeasure*)(trainer: => (Array[T], Array[Double]) => Regression[T]): Array[Double]
    }
    

    public class LOOCV {
        public static int[] classification(T[] x, int[] y, BiFunction<T[], int[], Classifier<T>> trainer);
        public static int[] classification(Formula formula, DataFrame data, BiFunction<Formula, DataFrame, DataFrameClassifier> trainer);
        public static double[] regression(T[] x, double[] y, BiFunction<T[], double[], Regression<T>> trainer);
        public static double[] regression(Formula formula, DataFrame data, BiFunction<Formula, DataFrame, DataFrameRegression> trainer);
    }
          

On the Iris data, the accuracy estimation of LOOCV is 85.33%, which is higher than that of 10-fold cross validation. This is because more data is used for training and less for testing.


    smile> loocv.classification(x, y) { case (x, y) => lda(x, y) }
    Confusion Matrix: ROW=truth and COL=predicted
    class  0 |      80 |      20 |
    class  1 |      19 |      81 |
    Accuracy: 80.50%
    res41: Array[Double] = Array(0.805)
    

    jshell> var x = iris.drop("class").toArray();
    x ==> double[150][] { double[4] { 5.099999904632568, 3. ... 68, 1.7999999523162842 } }

    jshell> var pred = LOOCV.classification(x, y, (x, y) -> LDA.fit(x, y));
    Mar 11, 2020 10:14:52 AM com.github.fommil.jni.JniLoader load
    INFO: already loaded netlib-native_system-osx-x86_64.jnilib
    pred ==> int[150] { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... , 2, 2, 2, 2, 1, 2, 2, 2 }

    jshell> Accuracy.of(y, pred)
    $197 ==> 0.8533333333333334

    jshell> ConfusionMatrix.of(y, pred)
    $198 ==> ROW=truth and COL=predicted
    class  0 |      49 |       1 |       0 |
    class  1 |       0 |      41 |       9 |
    class  2 |       0 |      12 |      38 |
          

Bootstrap

Bootstrap is a general tool for assessing statistical accuracy. The basic idea is to randomly draw data with replacement from the training data, each bootstrap sample set has the same size as the original training set. In the bootstrap set, the expected ratio of unique instances is approximately 1 − 1/e ≈ 63.2%. This process is done many times (say k = 100), producing k bootstrap datasets. Then we fit the model to each of the bootstrap datasets and examine the behavior of the fits over the k replications.


    object bootstrap {
        def classification[T](k: Int, x: Array[T], y: Array[Int], measures: ClassificationMeasure*)(trainer: => (Array[T], Array[Int]) => Classifier[T]): Array[Double]

        def regression[T](k: Int, x: Array[T], y: Array[Double], measures: RegressionMeasure*)(trainer: => (Array[T], Array[Double]) => Regression[T]): Array[Double]
    }
    

    public class Bootstrap {
        public static double[] classification(int k, T[] x, int[] y, BiFunction<T[], int[], Classifier<T>> trainer);
        public static double[] classification(int k, Formula formula, DataFrame data, BiFunction<Formula, DataFrame, DataFrameClassifier> trainer);
        public static double[] regression(int k, T[] x, double[] y, BiFunction<T[], double[], Regression<T>> trainer);
        public static double[] regression(int k, Formula formula, DataFrame data, BiFunction<Formula, DataFrame, DataFrameRegression> trainer);
    }
          

On the Iris data, the accuracy estimation of 100 bootstraps is about 83.7%, which is slightly lower than that of 10-fold cross validation.


    smile> bootstrap.classification(100, x, y) { case (x, y) => lda(x, y) }
    res40: Array[Double] = Array(
      0.21212121212121215,
      0.22499999999999998,
      0.16901408450704225,
      0.16666666666666663,
      0.25,
      0.19480519480519476,
      0.19999999999999996,
      0.273972602739726,
      0.125,
      0.1842105263157895,
      0.16129032258064513,
      0.17808219178082196,
      0.18461538461538463,
      0.23750000000000004,
      0.22972972972972971,
      0.14864864864864868,
      0.17808219178082196,
      0.17333333333333334,
      0.2777777777777778,
      0.16666666666666663,
      0.18666666666666665,
      0.22388059701492535,
    ...
    

    jshell> Bootstrap.classification(100, x, y, (x, y) -> LDA.fit(x, y))
    $199 ==> double[100] { 0.11111111111111116, 0.18867924528301883, 0.09090909090909094, 0.2068965517241379, 0.1428571428571429, 0.19999999999999996, 0.16981132075471694, 0.21153846153846156, 0.1785714285714286, 0.109375, 0.16666666666666663, 0.2142857142857143, 0.1071428571428571, 0.11764705882352944, 0.2545454545454545, 0.21568627450980393, 0.25806451612903225, 0.06382978723404253, 0.14814814814814814, 0.2222222222222222, 0.1578947368421053, 0.15517241379310343, 0.25, 0.18965517241379315, 0.17543859649122806, 0.18333333333333335, 0.12765957446808507, 0.0892857142857143, 0.17307692307692313, 0.16666666666666663, 0.17647058823529416, 0.2142857142857143, 0.12, 0.1818 ... 615, 0.1724137931034483, 0.11111111111111116, 0.1071428571428571, 0.1228070175438597, 0.2142857142857143, 0.23076923076923073, 0.07843137254901966, 0.13793103448275867, 0.06896551724137934, 0.17021276595744683, 0.1578947368421053, 0.2075471698113207, 0.1568627450980392, 0.1636363636363637, 0.18518518518518523, 0.15384615384615385 }
          

The bootstrap distribution of a parameter-estimator has been used to calculate confidence intervals for its population-parameter. If the bootstrap distribution of an estimator is symmetric, then percentile confidence-interval are often used; such intervals are appropriate especially for median-unbiased estimators of minimum risk (with respect to an absolute loss function). Otherwise, if the bootstrap distribution is non-symmetric, then percentile confidence-intervals are often inappropriate.

The bootstrap distribution and the sample may disagree systematically, in which case bias may occur. Bias in the bootstrap distribution will lead to bias in the confidence-interval.

Hyperparameter Tuning

A hyperparameter is a parameter whose value is set before the learning process begins. By contrast, the values of other parameters are derived via training. Hyperparameters can be classified as model hyperparameters, that cannot be inferred while fitting the machine to the training set because they refer to the model selection task, or algorithm hyperparameters, that in principle have no influence on the performance of the model but affect the speed and quality of the learning process. For example, the topology and size of a neural network are model hyperparameters, while learning rate and mini-batch size are algorithm hyperparameters.

In Smile, Hyperparameters class provides two generic approaches to sampling search candidates. With add() methods, the user can define a parameter space with a specified distribution (a fixed value, an array of values, or a range). The method grid() exhaustively considers all parameter combinations, while random() generates a stream of random candidates.


    import smile.io.*;
    import smile.data.formula.Formula;
    import smile.validation.*;
    import smile.classification.RandomForest;

    var hp = new Hyperparameters()
        .add("smile.random.forest.trees", 100) // a fixed value
        .add("smile.random.forest.mtry", new int[] {2, 3, 4}) // an array of values to choose
        .add("smile.random.forest.max.nodes", 100, 500, 50); // range [100, 500] with step 50


    var train = Read.arff("data/weka/segment-challenge.arff");
    var test = Read.arff("data/weka/segment-test.arff");
    var formula = Formula.lhs("class");
    var testy = formula.y(test).toIntArray();

    hp.grid().forEach(prop -> {
        var model = RandomForest.fit(formula, train, prop);
        var pred = model.predict(test);
        System.out.println(prop);
        System.out.format("Accuracy = %.2f%%%n", (100.0 * Accuracy.of(testy, pred)));
        System.out.println(ConfusionMatrix.of(testy, pred));
    });
    

While grid search is popular, random search has the benefit to choose a budget independent of the number of parameters and possible values. Note that rand() returns a stream that never ends. Therefore, one should use the limit() method to decide how many configurations to test.


    hp.random().limit(20).forEach(prop -> {
        var model = RandomForest.fit(formula, train, prop);
        var pred = model.predict(test);
        System.out.println(prop);
        System.out.format("Accuracy = %.2f%%%n", (100.0 * Accuracy.of(testy, pred)));
        System.out.println(ConfusionMatrix.of(testy, pred));
    });
    

In the lambda of hyperparameter tuning, the user is free to train any model (or even multiple algorithms), to evaluate with one or more metrics. The evaluation approach can also be cross validation and boosting besides on the test data as in above examples.

Both grid search and random search evaluate each parameter setting independently. Therefore, computations may be run in parallel with parallel stream (enable with parallel()). Note that some algorithms already run in parallel (e.g. random forest, logistic regression, etc.). In those cases, we should NOT use parallel stream to avoid potential deadlock.

Model Selection Criteria

Model selection is the task of selecting a statistical model from a set of candidate models, given data. In the simplest cases, a pre-existing set of data is considered. Given candidate models of similar predictive or explanatory power, the simplest model is most likely to be the best choice (Occam's razor).

A good model selection technique will balance goodness of fit with simplicity. More complex models will be better able to adapt their shape to fit the data, but the additional parameters may not represent anything useful. Goodness of fit is generally determined using a likelihood ratio approach, or an approximation of this, leading to a chi-squared test. The complexity is generally measured by counting the number of parameters in the model.

The most commonly used criteria are the Akaike information criterion and the Bayesian information criterion, which are implemented in ModelSelection. The formula for BIC is similar to the formula for AIC, but with a different penalty for the number of parameters. With AIC the penalty is 2k, whereas with BIC the penalty is log(n) * k.

AIC and BIC are both approximately correct according to a different goal and a different set of asymptotic assumptions. Both sets of assumptions have been criticized as unrealistic.

AIC is better in situations when a false negative finding would be considered more misleading than a false positive, and BIC is better in situations where a false positive is as misleading as, or more misleading than, a false negative.

Fork me on GitHub