smile.classification.AbstractClassifier<Tuple>

smile.classification.RandomForest

All Implemented Interfaces:: Serializable, ToDoubleFunction<Tuple>, ToIntFunction<Tuple>, Classifier<Tuple>, DataFrameClassifier, SHAP<Tuple>, TreeSHAP

public class RandomForest extends AbstractClassifier<Tuple> implements DataFrameClassifier, TreeSHAP

Random forest for classification. Random forest is an ensemble classifier that consists of many decision trees and outputs the majority vote of individual trees. The method combines bagging idea and the random selection of features.

Each tree is constructed using the following algorithm:

If the number of cases in the training set is N, randomly sample N cases with replacement from the original data. This sample will be the training set for growing the tree.
If there are M input variables, a number m << M is specified such that at each node, m variables are selected at random out of the M and the best split on these m is used to split the node. The value of m is held constant during the forest growing.
Each tree is grown to the largest extent possible. There is no pruning.

The advantages of random forest are:

For many data sets, it produces a highly accurate classifier.
It runs efficiently on large data sets.
It can handle thousands of input variables without variable deletion.
It gives estimates of what variables are important in the classification.
It generates an internal unbiased estimate of the generalization error as the forest building progresses.
It has an effective method for estimating missing data and maintains accuracy when a large proportion of the data are missing.

The disadvantages are

Random forests are prone to over-fitting for some datasets. This is even more pronounced on noisy data.
For data including categorical variables with different number of levels, random forests are biased in favor of those attributes with more levels. Therefore, the variable importance scores from random forest are not reliable for this type of data.

See Also:

Nested Class Summary

Nested Classes

Modifier and Type

Class

Description

static final record

RandomForest.Model

The base model.

static final record

RandomForest.Options

Random forest hyperparameters.

static final record

RandomForest.TrainingStatus

Training status per tree.

Nested classes/interfaces inherited from interface Classifier
Classifier.Trainer<T,M>

Nested classes/interfaces inherited from interface DataFrameClassifier
DataFrameClassifier.Trainer<M>
Field Summary

Fields inherited from class AbstractClassifier
classes
Constructor Summary

Constructors

Constructor

Description

RandomForest(Formula formula, int k, RandomForest.Model[] models, ClassificationMetrics metrics, double[] importance)

Constructor.

RandomForest(Formula formula, int k, RandomForest.Model[] models, ClassificationMetrics metrics, double[] importance, IntSet labels)

Constructor.
Method Summary

Modifier and Type

Method

Description

static RandomForest

fit(Formula formula, DataFrame data)

Fits a random forest for classification.

static RandomForest

fit(Formula formula, DataFrame data, RandomForest.Options options)

Fits a random forest for classification.

Formula

formula()

Returns the formula associated with the model.

double[]

importance()

Returns the variable importance.

RandomForest

merge(RandomForest other)

Merges two random forests.

ClassificationMetrics

metrics()

Returns the overall out-of-bag metric estimations.

RandomForest.Model[]

models()

Returns the base models.

int

predict(Tuple x)

Predicts the class label of an instance.

int

predict(Tuple x, double[] posteriori)

Predicts the class label of an instance and also calculate a posteriori probabilities.

RandomForest

prune(DataFrame test)

Returns a new random forest by reduced error pruning.

StructType

schema()

Returns the predictor schema.

int

size()

Returns the number of trees in the model.

boolean

soft()

Returns true if this is a soft classifier that can estimate the posteriori probabilities of classification.

int[][]

test(DataFrame data)

Test the model on a validation dataset.

DecisionTree[]

trees()

Returns the decision trees.

RandomForest

trim(int ntrees)

Trims the tree model set to a smaller size in case of over-fitting.

int

vote(Tuple x, double[] posteriori)

Predict and estimate the probability by voting.

Methods inherited from class AbstractClassifier
classes, numClasses

Methods inherited from class Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Methods inherited from interface Classifier
applyAsDouble, applyAsInt, classes, numClasses, online, predict, predict, predict, predict, predict, predict, score, update, update, update

Methods inherited from interface DataFrameClassifier
predict, predict

Methods inherited from interface SHAP
shap

Methods inherited from interface TreeSHAP
shap, shap

Constructor Details
- RandomForest
  
  public RandomForest(Formula formula, int k, RandomForest.Model[] models, ClassificationMetrics metrics, double[] importance)
  
  Constructor.
  
  Parameters:
  
  formula - a symbolic description of the model to be fitted.
  
  k - the number of classes.
  
  models - forest of decision trees.
  
  metrics - the overall out-of-bag metric estimation.
  
  importance - the feature importance.
- RandomForest
  
  public RandomForest(Formula formula, int k, RandomForest.Model[] models, ClassificationMetrics metrics, double[] importance, IntSet labels)
  
  Constructor.
  
  Parameters:
  
  formula - a symbolic description of the model to be fitted.
  
  k - the number of classes.
  
  models - the base models.
  
  metrics - the overall out-of-bag metric estimation.
  
  importance - the feature importance.
  
  labels - the class label encoder.
Method Details
- fit
  
  public static RandomForest fit(Formula formula, DataFrame data)
  
  Fits a random forest for classification.
  
  Parameters:
  
  formula - a symbolic description of the model to be fitted.
  
  data - the data frame of the explanatory and response variables.
  
  Returns:
  
  the model.
- fit
  
  public static RandomForest fit(Formula formula, DataFrame data, RandomForest.Options options)
  
  Fits a random forest for classification.
  
  Parameters:
  
  formula - a symbolic description of the model to be fitted.
  
  data - the data frame of the explanatory and response variables.
  
  options - the hyperparameters.
  
  Returns:
  
  the model.
- formula
  
  public Formula formula()
  
  Description copied from interface: DataFrameClassifier
  
  Returns the formula associated with the model.
  
  Specified by:
  
  formula in interface DataFrameClassifier
  
  Specified by:
  
  formula in interface TreeSHAP
  
  Returns:
  
  the formula associated with the model.
- schema
  
  public StructType schema()
  
  Description copied from interface: DataFrameClassifier
  
  Returns the predictor schema.
  
  Specified by:
  
  schema in interface DataFrameClassifier
  
  Returns:
  
  the predictor schema.
- metrics
  
  public ClassificationMetrics metrics()
  
  Returns the overall out-of-bag metric estimations. The OOB estimate is quite accurate given that enough trees have been grown. Otherwise, the OOB error estimate can bias upward.
  
  Returns:
  
  the out-of-bag metrics estimations.
- importance
  
  public double[] importance()
  
  Returns the variable importance. Every time a split of a node is made on variable the (GINI, information gain, etc.) impurity criterion for the two descendent nodes is less than the parent node. Adding up the decreases for each individual variable over all trees in the forest gives a fast measure of variable importance that is often very consistent with the permutation importance measure.
  
  Returns:
  
  the variable importance
- size
  
  public int size()
  
  Returns the number of trees in the model.
  
  Returns:
  
  the number of trees in the model.
- models
  
  public RandomForest.Model[] models()
  
  Returns the base models.
  
  Returns:
  
  the base models.
- trees
  
  public DecisionTree[] trees()
  
  Description copied from interface: TreeSHAP
  
  Returns the decision trees.
  
  Specified by:
  
  trees in interface TreeSHAP
  
  Returns:
  
  the decision trees.
- trim
  
  public RandomForest trim(int ntrees)
  
  Trims the tree model set to a smaller size in case of over-fitting. Or if extra decision trees in the model don't improve the performance, we may remove them to reduce the model size and also improve the speed of prediction.
  
  Parameters:
  
  ntrees - the new (smaller) size of tree model set.
  
  Returns:
  
  a new trimmed forest.
- merge
  
  public RandomForest merge(RandomForest other)
  
  Merges two random forests.
  
  Parameters:
  
  other - the other forest to merge with.
  
  Returns:
  
  the merged forest.
- predict
  
  public int predict(Tuple x)
  
  Description copied from interface: Classifier
  
  Predicts the class label of an instance.
  
  Specified by:
  
  predict in interface Classifier<Tuple>
  
  Parameters:
  
  x - the instance to be classified.
  
  Returns:
  
  the predicted class label.
- soft
  
  public boolean soft()
  
  Description copied from interface: Classifier
  
  Returns true if this is a soft classifier that can estimate the posteriori probabilities of classification.
  
  Specified by:
  
  soft in interface Classifier<Tuple>
  
  Returns:
  
  true if soft classifier.
- predict
  
  public int predict(Tuple x, double[] posteriori)
  
  Description copied from interface: Classifier
  
  Predicts the class label of an instance and also calculate a posteriori probabilities. Classifiers may NOT support this method since not all classification algorithms are able to calculate such a posteriori probabilities.
  
  Specified by:
  
  predict in interface Classifier<Tuple>
  
  Parameters:
  
  x - an instance to be classified.
  
  posteriori - a posteriori probabilities on output.
  
  Returns:
  
  the predicted class label
- vote
  
  public int vote(Tuple x, double[] posteriori)
  
  Predict and estimate the probability by voting.
  
  Parameters:
  
  x - the instances to be classified.
  
  posteriori - a posteriori probabilities on output.
  
  Returns:
  
  the predicted class labels.
- test
  
  public int[][] test(DataFrame data)
  
  Test the model on a validation dataset.
  
  Parameters:
  
  data - the test data set.
  
  Returns:
  
  the predictions with first 1, 2, ..., decision trees.
- prune
  
  public RandomForest prune(DataFrame test)
  
  Returns a new random forest by reduced error pruning.
  
  Parameters:
  
  test - the test data set to evaluate the errors of nodes.
  
  Returns:
  
  a new pruned random forest.

Class RandomForest

Nested Class Summary

Nested classes/interfaces inherited from interface Classifier

Nested classes/interfaces inherited from interface DataFrameClassifier

Field Summary

Fields inherited from class AbstractClassifier

Constructor Summary

Method Summary

Methods inherited from class AbstractClassifier

Methods inherited from class Object

Methods inherited from interface Classifier

Methods inherited from interface DataFrameClassifier

Methods inherited from interface SHAP

Methods inherited from interface TreeSHAP

Constructor Details

RandomForest

RandomForest

Method Details

fit

fit

formula

schema

metrics

importance

size

models

trees

trim

merge

predict

soft

predict

vote

test

prune