# Model Validation

When training a supervised model, we should always evaluate the goodness of fit of the model. This helps on model selection and also hyperparameter tuning. First of all, we should note that the error of the model as measured on the training data is likely to be lower than the actual generalization error.

## Evaluation Metrics

Although most supervised learning algorithms try to minimize the empirical error (regularized or not), we should not use only error rate or accuracy as the objective measure. For example, if a highly unbalanced data contains 99% positive sample, a naive algorithm that classifies everything as positive will have 99% accuracy. However, it is useless.

For classification, Smile has the following measures:

- The
**accuracy**is the proportion of true results (both true positives and true negatives) in the population. - The
**sensitivity**or**true positive rate**(TPR) (also called**hit rate**,**recall**) is a statistical measures of the performance of a binary classification test. Sensitivity is the proportion of actual positives which are correctly identified as such.`TPR = TP / P = TP / (TP + FN)`

- The
**specificity**(SPC) or**true negative rate**is a statistical measures of the performance of a binary classification test. Specificity measures the proportion of negatives which are correctly identified.`SPC = TN / N = TN / (FP + TN) = 1 - FPR`

- The
**precision**or**positive predictive value**(PPV) is ratio of true positives to combined true and false positives, which is different from sensitivity.`PPV = TP / (TP + FP)`

- The
**false discovery rate**(FDR) is ratio of false positives to combined true and false positives, which is actually 1 - precision.`FDR = FP / (TP + FP)`

**Fall-out, false alarm rate, or false positive rate**(FPR) is

Fall-out is actually Type I error and closely related to specificity (1 - specificity).`FPR = FP / N = FP / (FP + TN)`

The

**F-score**(or**F-measure**) considers both the precision and the recall of the test to compute the score. The traditional or balanced F-score (F1 score) is the harmonic mean of precision and recall, where an F1 score reaches its best value at 1 and worst at 0.The general formula involves a positive real β so that F-score measures the effectiveness of retrieval with respect to a user who attaches β times as much importance to recall as precision.

In Smile, the class label 1 is regarded as positive while 0 as negative. Note that not all measures can be applied to multi-class data. If one applies such a measure (e.g. specificity and sensitivity) on multi-class data regardlessly, the results may not make sense and all others are regarded as negative. Note that in these situations, only label 1 is regarded as positive and any other values are treated as negative class.

The below example shows how to calculate the accuracy of a multi-class model.

```
val segTrain = read.arff("data/weka/segment-challenge.arff")
val segTest = read.arff("data/weka/segment-test.arff")
val model = randomForest("class" ~, segTrain)
val pred = model.predict(segTest)
smile> accuracy(segTest("class").toIntArray, pred)
res5: Double = 0.9728395061728395
```

```
var segTrain = Read.arff("data/weka/segment-challenge.arff");
var segTest = Read.arff("data/weka/segment-test.arff");
var model = RandomForest.fit(Formula.lhs("class"), segTrain);
var pred = model.predict(segTest);
jshell> Accuracy.of(segTest.column("class").toIntArray(), pred)
$161 ==> 0.9617283950617284
```

Sensitivity and specificity are closely related to the concepts of type I and type II errors. For any test, there is usually a trade-off between the measures. This trade-off can be represented graphically using an ROC curve. When using normalized units, the area under the ROC curve is equal to the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one (assuming 'positive' ranks higher than 'negative').

The following example calculates various metrics for a binary classification problem.

```
val toyTrain = read.csv("data/classification/toy/toy-train.txt", delimiter='\t', header=false)
val toyTest = read.csv("data/classification/toy/toy-test.txt", delimiter='\t', header=false)
val x = toyTrain.select(1, 2).toArray
val y = toyTrain.column(0).toIntArray
val model = logit(x, y, 0.1, 0.001)
val testx = toyTest.select(1, 2).toArray
val testy = toyTest.column(0).toIntArray
val pred = testx.map(model.predict(_))
smile> accuracy(testy, pred)
res7: Double = 0.81435
smile> recall(testy, pred)
res8: Double = 0.7828
smile> sensitivity(testy, pred)
res9: Double = 0.7828
smile> specificity(testy, pred)
res10: Double = 0.8459
smile> fallout(testy, pred)
res11: Double = 0.15410000000000001
smile> fdr(testy, pred)
res12: Double = 0.16447859963710107
smile> f1(testy, pred)
res13: Double = 0.808301925757654
// Calculate posteriori probability for AUC computation.
val posteriori = new Array[Double](2)
val prob = testx.map { x =>
model.predict(x, posteriori)
posteriori(1)
}
smile> auc(testy, prob)
res17: Double = 0.8650958
```

```
var toyTrain = Read.csv("data/classification/toy/toy-train.txt", CSVFormat.DEFAULT.withDelimiter('\t'));
var toyTest = Read.csv("data/classification/toy/toy-test.txt", CSVFormat.DEFAULT.withDelimiter('\t'));
var x = toyTrain.select(1, 2).toArray();
var y = toyTrain.column(0).toIntArray();
var model = LogisticRegression.fit(x, y, 0.1, 0.001, 100);
var testx = toyTest.select(1, 2).toArray();
var testy = toyTest.column(0).toIntArray();
var pred = Arrays.stream(testx).mapToInt(xi -> model.predict(xi)).toArray();
jshell> Accuracy.of(testy, pred)
$171 ==> 0.81435
jshell> Recall.of(testy, pred)
$172 ==> 0.7828
jshell> Sensitivity.of(testy, pred)
$173 ==> 0.7828
jshell> Specificity.of(testy, pred)
$174 ==> 0.8459
jshell> Fallout.of(testy, pred)
$175 ==> 0.15410000000000001
jshell> FDR.of(testy, pred)
$176 ==> 0.16447859963710107
jshell> FMeasure.of(testy, pred)
$177 ==> 0.808301925757654
// Calculate posteriori probability for AUC computation.
var posteriori = new double[2];
var prob = Arrays.stream(testx).mapToDouble(xi -> {
model.predict(xi, posteriori);
return posteriori[1];
}).toArray();
jshell> AUC.of(testy, prob)
$180 ==> 0.8650958
```

For regression, Smile has the following measures:

- MSE (mean squared error) and RMSE (root mean squared error).
- MAD (mean absolute deviation error).
- RSS (residual sum of squares).

## Out-of-sample Evaluation

The generalization error (also known as the out-of-sample error) is a measure of how accurately an algorithm is able to predict outcome values for previously unseen data. Ideally, test data should be statistically independent from training data. But in practice, we usually have only one historical dataset and the evaluation of a learning algorithm may be sensitive to sampling error. In what follows, we discuss various testing mechanisms.

We provide both Java and Scala helper functions for testing. The Java helper
functions are the static methods of the class `smile.validation.Validation`

.
The Scala one are in the package object of `smile.validation`

and
can be accessed directly in the Shell.

### Hold-out Testing

Hold-out testing assume that all data samples are independently and identically distributed (this is also the basic assumption of most learning algorithms). A part of the data is held out for testing. Many benchmark data contain a separate test dataset.

```
// Test a generic classifier. Only accuracy is calculated.
def test[T](x: Array[T], y: Array[Int], testx: Array[T], testy: Array[Int])(trainer: => (Array[T], Array[Int]) => Classifier[T]): Classifier[T]
// Test a binary classifier. Report accuracy, sensitivity/recall, specificity, precision, and F-Score.
def test2[T](x: Array[T], y: Array[Int], testx: Array[T], testy: Array[Int])(trainer: => (Array[T], Array[Int]) => Classifier[T]): Classifier[T]
// Test a binary soft classifier. In addition to test2, AUC is caulcated.
def test2soft[T](x: Array[T], y: Array[Int], testx: Array[T], testy: Array[Int])(trainer: => (Array[T], Array[Int]) => SoftClassifier[T]): SoftClassifier[T]
```

The above Scala methods takes a code block to train the model and apply it on the test data. These methods return the trained model and print out various measures.

```
val segTrain = read.arff("data/weka/segment-challenge.arff")
val segTest = read.arff("data/weka/segment-test.arff")
smile> test("class" ~, segTrain, segTest) { case (formula, data) => smile.classification.randomForest(formula, data) }
[main] INFO smile.util.package$ - testing runtime: 0:00:00.103314
Accuracy = 97.65%
Confusion Matrix: ROW=truth and COL=predicted
class 0 | 124 | 0 | 0 | 0 | 1 | 0 | 0 |
class 1 | 0 | 110 | 0 | 0 | 0 | 0 | 0 |
class 2 | 3 | 0 | 117 | 1 | 1 | 0 | 0 |
class 3 | 1 | 0 | 0 | 109 | 0 | 0 | 0 |
class 4 | 1 | 0 | 6 | 2 | 117 | 0 | 0 |
class 5 | 0 | 0 | 0 | 0 | 0 | 94 | 0 |
class 6 | 0 | 0 | 1 | 2 | 0 | 0 | 120 |
res21: RandomForest = smile.classification.RandomForest@77f95e19
```

```
var segTrain = Read.arff("data/weka/segment-challenge.arff");
var segTest = Read.arff("data/weka/segment-test.arff");
var formula = Formula.lhs("class");
var model = RandomForest.fit(formula, segTrain);
var pred = model.predict(segTest);
jshell> ConfusionMatrix.of(formula.y(segTest).toIntArray(), pred)
$187 ==> ROW=truth and COL=predicted
class 0 | 124 | 0 | 0 | 0 | 1 | 0 | 0 |
class 1 | 0 | 110 | 0 | 0 | 0 | 0 | 0 |
class 2 | 3 | 0 | 115 | 1 | 3 | 0 | 0 |
class 3 | 2 | 0 | 0 | 106 | 2 | 0 | 0 |
class 4 | 2 | 0 | 10 | 6 | 108 | 0 | 0 |
class 5 | 0 | 0 | 0 | 0 | 0 | 94 | 0 |
class 6 | 2 | 0 | 1 | 0 | 0 | 0 | 120 |
```

```
val toyTrain = read.csv("data/classification/toy/toy-train.txt", delimiter='\t', header=false)
val toyTest = read.csv("data/classification/toy/toy-test.txt", delimiter='\t', header=false)
val x = toyTrain.select(1, 2).toArray
val y = toyTrain.column(0).toIntArray
val testx = toyTest.select(1, 2).toArray
val testy = toyTest.column(0).toIntArray
smile> test2(x, y, testx, testy) { case (x, y) => lda(x, y) }
training...
testing...
[main] INFO smile.util.package$ - runtime: 78.653061 ms
Accuracy = 81.23%
Sensitivity/Recall = 78.28%
Specificity = 84.17%
Precision = 83.18%
F1-Score = 80.66%
F2-Score = 79.21%
F0.5-Score = 82.15%
Confusion Matrix: ROW=truth and COL=predicted
class 0 : 8417 | 1583 |
class 1 : 2172 | 7828 |
res5: LDA = smile.classification.LDA@5a524a19
smile> test2(x, y, testx, testy) { case (x, y) => logit(x, y, 0.1, 0.001) }
training...
testing...
Accuracy = 81.44%
Sensitivity/Recall = 78.28%
Specificity = 84.59%
Precision = 83.55%
F1-Score = 80.83%
F2-Score = 79.28%
F0.5-Score = 82.44%
Confusion Matrix: ROW=truth and COL=predicted
class 0 | 8459 | 1541 |
class 1 | 2172 | 7828 |
res29: LogisticRegression = smile.classification.LogisticRegression@6b0bcea5
// AUC will be reported in binary classification
test2soft(x, y, testx, testy) { case (x, y) => lda(x, y) }
test2soft(x, y, testx, testy) { case (x, y) => logit(x, y, 0.1, 0.001) }
```

### Out-of-bag Error

Out-of-bag (OOB) error, also called out-of-bag estimate, is a method of measuring
the prediction error of random forests, boosted decision trees, and other machine
learning models utilizing bootstrap aggregating to sub-sample data sampled used
for training. OOB is the mean prediction error on each training sample `x`

, using
only the trees that did not have _{i}`x`

in their bootstrap sample._{i}

```
val rf = smile.classification.randomForest("class" ~, iris)
println(s"OOB error = ${rf.error}")
```

```
var rf = smile.classification.RandomForest.fit(Formula.lhs("class"), iris);
System.out.format("OOB error = %.2f%%%n", rf.error());
```

Subsampling allows one to define an out-of-bag estimate of the prediction performance improvement by evaluating predictions on those observations which were not used in the building of the next base learner. Out-of-bag estimates help avoid the need for an independent validation dataset, but often underestimate actual performance improvement and the optimal number of iterations.

## Cross Validation

In `k`

-fold cross validation, the dataset is divided into `k`

random partitions.
We treat each of the `k`

partition like a hold-out set, train a model on
the rest of data, and measure the quality of the model on the held-out.
The overall performance is taken to be the average of the performance
on all `k`

partitions.

```
object cv {
def classification[T](x: Array[T], y: Array[Int], k: Int, measures: ClassificationMeasure*)(trainer: => (Array[T], Array[Int]) => Classifier[T]): Array[Double]
def regression[T](x: Array[T], y: Array[Double], k: Int, measures: RegressionMeasure*)(trainer: => (Array[T], Array[Double]) => Regression[T]): Array[Double]
}
```

```
public class CrossValidation {
public static int[] classification(int k, T[] x, int[] y, BiFunction<T[], int[], Classifier<T>> trainer);
public static int[] classification(int k, Formula formula, DataFrame data, BiFunction<Formula, DataFrame, DataFrameClassifier> trainer);
public static double[] regression(int k, T[] x, double[] y, BiFunction<T[], double[], Regression<T>> trainer);
public static double[] regression(int k, Formula formula, DataFrame data, BiFunction<Formula, DataFrame, DataFrameRegression> trainer);
}
```

When no measures are provided, the methods use accuracy or RMSE by default for classification or regression, respectively.

```
smile> val iris = read.arff("data/weka/iris.arff")
smile> cv.classification(10, "class" ~, iris) { case (formula, data) => smile.classification.cart(formula, data) }
[main] INFO smile.util.package$ - Decision Tree runtime: 0:00:00.4392
[main] INFO smile.util.package$ - Decision Tree runtime: 0:00:00.1187
[main] INFO smile.util.package$ - Decision Tree runtime: 0:00:00.1340
[main] INFO smile.util.package$ - Decision Tree runtime: 0:00:00.1120
[main] INFO smile.util.package$ - Decision Tree runtime: 0:00:00.876
[main] INFO smile.util.package$ - Decision Tree runtime: 0:00:00.1105
[main] INFO smile.util.package$ - Decision Tree runtime: 0:00:00.1570
[main] INFO smile.util.package$ - Decision Tree runtime: 0:00:00.818
[main] INFO smile.util.package$ - Decision Tree runtime: 0:00:00.1013
[main] INFO smile.util.package$ - Decision Tree runtime: 0:00:00.929
Confusion Matrix: ROW=truth and COL=predicted
class 0 | 50 | 0 | 0 |
class 1 | 0 | 45 | 5 |
class 2 | 0 | 5 | 45 |
Accuracy: 93.33%
res35: Array[Double] = Array(0.9333333333333333)
```

```
jshell> var iris = Read.arff("data/weka/iris.arff");
[main] INFO smile.io.Arff - Read ARFF relation iris
iris ==> [sepallength: float, sepalwidth: float, petalleng ... -------+
140 more rows...
jshell> var pred = CrossValidation.classification(10, Formula.lhs("class"), iris, (formula, data) -> DecisionTree.fit(formula, data));
pred ==> int[150] { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 1, 2, 2, 2, 2, 2, 1, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2 }
jshell> var y = iris.column("class").toIntArray()
y ==> int[150] { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... , 2, 2, 2, 2, 2, 2, 2, 2 }
jshell> Accuracy.of(y, pred)
$193 ==> 0.9266666666666666
jshell> ConfusionMatrix.of(y, pred)
$194 ==> ROW=truth and COL=predicted
class 0 | 50 | 0 | 0 |
class 1 | 0 | 45 | 5 |
class 2 | 0 | 6 | 44 |
```

On the Iris data, the accuracy estimation of 10-fold cross validation is about 84.7%. You may get different number because of the random partitions.

A special case is the leave-one-out cross validation that uses a single observation from the original sample as the validation data, and the remaining observations as the training data. This is repeated such that each observation in the sample is used once as the validation data. Leave-one-out cross-validation is usually very expensive from a computational point of view because of the large number of times the training process is repeated.

```
object loocv {
def classification[T](x: Array[T], y: Array[Int], measures: ClassificationMeasure*)(trainer: => (Array[T], Array[Int]) => Classifier[T]): Array[Double]
def regression[T](x: Array[T], y: Array[Double], measures: RegressionMeasure*)(trainer: => (Array[T], Array[Double]) => Regression[T]): Array[Double]
}
```

```
public class LOOCV {
public static int[] classification(T[] x, int[] y, BiFunction<T[], int[], Classifier<T>> trainer);
public static int[] classification(Formula formula, DataFrame data, BiFunction<Formula, DataFrame, DataFrameClassifier> trainer);
public static double[] regression(T[] x, double[] y, BiFunction<T[], double[], Regression<T>> trainer);
public static double[] regression(Formula formula, DataFrame data, BiFunction<Formula, DataFrame, DataFrameRegression> trainer);
}
```

On the Iris data, the accuracy estimation of LOOCV is 85.33%, which is higher than that of 10-fold cross validation. This is because more data is used for training and less for testing.

```
smile> loocv.classification(x, y) { case (x, y) => lda(x, y) }
Confusion Matrix: ROW=truth and COL=predicted
class 0 | 80 | 20 |
class 1 | 19 | 81 |
Accuracy: 80.50%
res41: Array[Double] = Array(0.805)
```

```
jshell> var x = iris.drop("class").toArray();
x ==> double[150][] { double[4] { 5.099999904632568, 3. ... 68, 1.7999999523162842 } }
jshell> var pred = LOOCV.classification(x, y, (x, y) -> LDA.fit(x, y));
Mar 11, 2020 10:14:52 AM com.github.fommil.jni.JniLoader load
INFO: already loaded netlib-native_system-osx-x86_64.jnilib
pred ==> int[150] { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... , 2, 2, 2, 2, 1, 2, 2, 2 }
jshell> Accuracy.of(y, pred)
$197 ==> 0.8533333333333334
jshell> ConfusionMatrix.of(y, pred)
$198 ==> ROW=truth and COL=predicted
class 0 | 49 | 1 | 0 |
class 1 | 0 | 41 | 9 |
class 2 | 0 | 12 | 38 |
```

## Bootstrap

Bootstrap is a general tool for assessing statistical accuracy. The basic
idea is to randomly draw data with replacement from the training data,
each bootstrap sample set has the same size as the original training set.
In the bootstrap set, the expected ratio of unique instances is
approximately `1 − 1/e ≈ 63.2%`

. This process is done many
times (say `k = 100`

), producing `k`

bootstrap datasets.
Then we fit the model to each of the bootstrap datasets and examine
the behavior of the fits over the `k`

replications.

```
object bootstrap {
def classification[T](k: Int, x: Array[T], y: Array[Int], measures: ClassificationMeasure*)(trainer: => (Array[T], Array[Int]) => Classifier[T]): Array[Double]
def regression[T](k: Int, x: Array[T], y: Array[Double], measures: RegressionMeasure*)(trainer: => (Array[T], Array[Double]) => Regression[T]): Array[Double]
}
```

```
public class Bootstrap {
public static double[] classification(int k, T[] x, int[] y, BiFunction<T[], int[], Classifier<T>> trainer);
public static double[] classification(int k, Formula formula, DataFrame data, BiFunction<Formula, DataFrame, DataFrameClassifier> trainer);
public static double[] regression(int k, T[] x, double[] y, BiFunction<T[], double[], Regression<T>> trainer);
public static double[] regression(int k, Formula formula, DataFrame data, BiFunction<Formula, DataFrame, DataFrameRegression> trainer);
}
```

On the Iris data, the accuracy estimation of 100 bootstraps is about 83.7%, which is slightly lower than that of 10-fold cross validation.

```
smile> bootstrap.classification(100, x, y) { case (x, y) => lda(x, y) }
res40: Array[Double] = Array(
0.21212121212121215,
0.22499999999999998,
0.16901408450704225,
0.16666666666666663,
0.25,
0.19480519480519476,
0.19999999999999996,
0.273972602739726,
0.125,
0.1842105263157895,
0.16129032258064513,
0.17808219178082196,
0.18461538461538463,
0.23750000000000004,
0.22972972972972971,
0.14864864864864868,
0.17808219178082196,
0.17333333333333334,
0.2777777777777778,
0.16666666666666663,
0.18666666666666665,
0.22388059701492535,
...
```

```
jshell> Bootstrap.classification(100, x, y, (x, y) -> LDA.fit(x, y))
$199 ==> double[100] { 0.11111111111111116, 0.18867924528301883, 0.09090909090909094, 0.2068965517241379, 0.1428571428571429, 0.19999999999999996, 0.16981132075471694, 0.21153846153846156, 0.1785714285714286, 0.109375, 0.16666666666666663, 0.2142857142857143, 0.1071428571428571, 0.11764705882352944, 0.2545454545454545, 0.21568627450980393, 0.25806451612903225, 0.06382978723404253, 0.14814814814814814, 0.2222222222222222, 0.1578947368421053, 0.15517241379310343, 0.25, 0.18965517241379315, 0.17543859649122806, 0.18333333333333335, 0.12765957446808507, 0.0892857142857143, 0.17307692307692313, 0.16666666666666663, 0.17647058823529416, 0.2142857142857143, 0.12, 0.1818 ... 615, 0.1724137931034483, 0.11111111111111116, 0.1071428571428571, 0.1228070175438597, 0.2142857142857143, 0.23076923076923073, 0.07843137254901966, 0.13793103448275867, 0.06896551724137934, 0.17021276595744683, 0.1578947368421053, 0.2075471698113207, 0.1568627450980392, 0.1636363636363637, 0.18518518518518523, 0.15384615384615385 }
```

The bootstrap distribution of a parameter-estimator has been used to calculate confidence intervals for its population-parameter. If the bootstrap distribution of an estimator is symmetric, then percentile confidence-interval are often used; such intervals are appropriate especially for median-unbiased estimators of minimum risk (with respect to an absolute loss function). Otherwise, if the bootstrap distribution is non-symmetric, then percentile confidence-intervals are often inappropriate.

The bootstrap distribution and the sample may disagree systematically, in which case bias may occur. Bias in the bootstrap distribution will lead to bias in the confidence-interval.

## Hyperparameter Tuning

A hyperparameter is a parameter whose value is set before the learning process begins. By contrast, the values of other parameters are derived via training. Hyperparameters can be classified as model hyperparameters, that cannot be inferred while fitting the machine to the training set because they refer to the model selection task, or algorithm hyperparameters, that in principle have no influence on the performance of the model but affect the speed and quality of the learning process. For example, the topology and size of a neural network are model hyperparameters, while learning rate and mini-batch size are algorithm hyperparameters.

In Smile, `Hyperparameters`

class provides two generic
approaches to sampling search candidates. With `add()`

methods, the user can define a parameter space with a specified
distribution (a fixed value, an array of values, or a range).
The method `grid()`

exhaustively considers all parameter
combinations, while `random()`

generates a stream of
random candidates.

```
import smile.io.*;
import smile.data.formula.Formula;
import smile.validation.*;
import smile.classification.RandomForest;
var hp = new Hyperparameters()
.add("smile.random.forest.trees", 100) // a fixed value
.add("smile.random.forest.mtry", new int[] {2, 3, 4}) // an array of values to choose
.add("smile.random.forest.max.nodes", 100, 500, 50); // range [100, 500] with step 50
var train = Read.arff("data/weka/segment-challenge.arff");
var test = Read.arff("data/weka/segment-test.arff");
var formula = Formula.lhs("class");
var testy = formula.y(test).toIntArray();
hp.grid().forEach(prop -> {
var model = RandomForest.fit(formula, train, prop);
var pred = model.predict(test);
System.out.println(prop);
System.out.format("Accuracy = %.2f%%%n", (100.0 * Accuracy.of(testy, pred)));
System.out.println(ConfusionMatrix.of(testy, pred));
});
```

While grid search is popular, random search has the benefit to choose
a budget independent of the number of parameters and possible values.
Note that `rand()`

returns a stream that never ends.
Therefore, one should use the `limit()`

method to decide
how many configurations to test.

```
hp.random().limit(20).forEach(prop -> {
var model = RandomForest.fit(formula, train, prop);
var pred = model.predict(test);
System.out.println(prop);
System.out.format("Accuracy = %.2f%%%n", (100.0 * Accuracy.of(testy, pred)));
System.out.println(ConfusionMatrix.of(testy, pred));
});
```

In the lambda of hyperparameter tuning, the user is free to train any model (or even multiple algorithms), to evaluate with one or more metrics. The evaluation approach can also be cross validation and boosting besides on the test data as in above examples.

Both grid search and random search evaluate each parameter setting
independently. Therefore, computations may be run in parallel with
parallel stream (enable with `parallel()`

). Note that
some algorithms already run in parallel (e.g. random forest, logistic
regression, etc.). In those cases, we should NOT use parallel stream
to avoid potential deadlock.

## Model Selection Criteria

Model selection is the task of selecting a statistical model from a set of candidate models, given data. In the simplest cases, a pre-existing set of data is considered. Given candidate models of similar predictive or explanatory power, the simplest model is most likely to be the best choice (Occam's razor).

A good model selection technique will balance goodness of fit with simplicity. More complex models will be better able to adapt their shape to fit the data, but the additional parameters may not represent anything useful. Goodness of fit is generally determined using a likelihood ratio approach, or an approximation of this, leading to a chi-squared test. The complexity is generally measured by counting the number of parameters in the model.

The most commonly used criteria are the Akaike information criterion
and the Bayesian information criterion, which are implemented in
`ModelSelection`

. The formula for BIC is similar
to the formula for AIC, but with a different penalty for the number of
parameters. With AIC the penalty is `2k`

, whereas with BIC
the penalty is `log(n) * k`

.

AIC and BIC are both approximately correct according to a different goal and a different set of asymptotic assumptions. Both sets of assumptions have been criticized as unrealistic.

AIC is better in situations when a false negative finding would be considered more misleading than a false positive, and BIC is better in situations where a false positive is as misleading as, or more misleading than, a false negative.