randomForest

fun randomForest(formula: Formula, data: DataFrame, ntrees: Int = 500, mtry: Int = 0, splitRule: SplitRule = SplitRule.GINI, maxDepth: Int = 20, maxNodes: Int = 500, nodeSize: Int = 1, subsample: Double = 1.0, classWeight: IntArray? = null, seeds: LongStream? = null): RandomForest

Random forest for classification. Random forest is an ensemble classifier that consists of many decision trees and outputs the majority vote of individual trees. The method combines bagging idea and the random selection of features.

Each tree is constructed using the following algorithm:

  1. If the number of cases in the training set is N, randomly sample N cases with replacement from the original data. This sample will be the training set for growing the tree.

  2. If there are M input variables, a number m << M is specified such that at each node, m variables are selected at random out of the M and the best split on these m is used to split the node. The value of m is held constant during the forest growing.

  3. Each tree is grown to the largest extent possible. There is no pruning.

The advantages of random forest are:

  • For many data sets, it produces a highly accurate classifier.

  • It runs efficiently on large data sets.

  • It can handle thousands of input variables without variable deletion.

  • It gives estimates of what variables are important in the classification.

  • It generates an internal unbiased estimate of the generalization error as the forest building progresses.

  • It has an effective method for estimating missing data and maintains accuracy when a large proportion of the data are missing.

The disadvantages are

  • Random forests are prone to over-fitting for some datasets. This is even more pronounced on noisy data.

  • For data including categorical variables with different number of levels, random forests are biased in favor of those attributes with more levels. Therefore, the variable importance scores from random forest are not reliable for this type of data.

Return

Random forest classification model.

Parameters

formula

a symbolic description of the model to be fitted.

data

the data frame of the explanatory and response variables.

ntrees

the number of trees.

mtry

the number of random selected features to be used to determine the decision at a node of the tree. floor(sqrt(dim)) seems to give generally good performance, where dim is the number of variables.

maxDepth

the maximum depth of the tree.

maxNodes

the maximum number of leaf nodes in the tree.

nodeSize

the minimum size of leaf nodes.

subsample

the sampling rate for training tree. 1.0 means sampling with replacement. < 1.0 means sampling without replacement.

splitRule

Decision tree node split rule.