smile.classification.DecisionTree

All Implemented Interfaces:: Serializable, ToDoubleFunction<Tuple>, ToIntFunction<Tuple>, Classifier<Tuple>, DataFrameClassifier, SHAP<Tuple>

public class DecisionTree extends CART implements Classifier<Tuple>, DataFrameClassifier

Decision tree. A classification/regression tree can be learned by splitting the training set into subsets based on an attribute value test. This process is repeated on each derived subset in a recursive manner called recursive partitioning. The recursion is completed when the subset at a node all has the same value of the target variable, or when splitting no longer adds value to the predictions.

The algorithms that are used for constructing decision trees usually work top-down by choosing a variable at each step that is the next best variable to use in splitting the set of items. "Best" is defined by how well the variable splits the set into homogeneous subsets that have the same value of the target variable. Different algorithms use different formulae for measuring "best". Used by the CART algorithm, Gini impurity is a measure of how often a randomly chosen element from the set would be incorrectly labeled if it were randomly labeled according to the distribution of labels in the subset. Gini impurity can be computed by summing the probability of each item being chosen times the probability of a mistake in categorizing that item. It reaches its minimum (zero) when all cases in the node fall into a single target category. Information gain is another popular measure, used by the ID3, C4.5 and C5.0 algorithms. Information gain is based on the concept of entropy used in information theory. For categorical variables with different number of levels, however, information gain are biased in favor of those attributes with more levels. Instead, one may employ the information gain ratio, which solves the drawback of information gain.

Classification and Regression Tree techniques have a number of advantages over many of those alternative techniques.

Simple to understand and interpret.: In most cases, the interpretation of results summarized in a tree is very simple. This simplicity is useful not only for purposes of rapid classification of new observations, but can also often yield a much simpler "model" for explaining why observations are classified or predicted in a particular manner.
Able to handle both numerical and categorical data.: Other techniques are usually specialized in analyzing datasets that have only one type of variable.
Tree methods are nonparametric and nonlinear.: The final results of using tree methods for classification or regression can be summarized in a series of (usually few) logical if-then conditions (tree nodes). Therefore, there is no implicit assumption that the underlying relationships between the predictor variables and the dependent variable are linear, follow some specific non-linear link function, or that they are even monotonic in nature. Thus, tree methods are particularly well suited for data mining tasks, where there is often little a priori knowledge nor any coherent set of theories or predictions regarding which variables are related and how. In those types of data analytics, tree methods can often reveal simple relationships between just a few variables that could have easily gone unnoticed using other analytic techniques.

One major problem with classification and regression trees is their high variance. Often a small change in the data can result in a very different series of splits, making interpretation somewhat precarious. Besides, decision-tree learners can create over-complex trees that cause over-fitting. Mechanisms such as pruning are necessary to avoid this problem. Another limitation of trees is the lack of smoothness of the prediction surface.

Some techniques such as bagging, boosting, and random forest use more than one decision tree for their analysis.

See Also:

Nested Class Summary

Nested Classes

Modifier and Type

Class

Description

static final record

DecisionTree.Options

Decision tree hyperparameters.

Nested classes/interfaces inherited from interface Classifier
Classifier.Trainer<T,M>

Nested classes/interfaces inherited from interface DataFrameClassifier
DataFrameClassifier.Trainer<M>
Field Summary

Fields inherited from class CART
formula, importance, index, maxDepth, maxNodes, mtry, nodeSize, order, response, root, samples, schema, x
Constructor Summary

Constructors

Constructor

Description

DecisionTree(DataFrame x, int[] y, StructField response, int k, SplitRule rule, int maxDepth, int maxNodes, int nodeSize, int mtry, int[] samples, int[][] order)

Constructor.
Method Summary

Modifier and Type

Method

Description

int[]

classes()

Returns the class labels.

protected Optional<Split>

findBestSplit(LeafNode leaf, int j, double impurity, int lo, int hi)

Finds the best split for given column.

static DecisionTree

fit(Formula formula, DataFrame data)

Fits a classification tree.

static DecisionTree

fit(Formula formula, DataFrame data, DecisionTree.Options options)

Fits a classification tree.

Formula

formula()

Returns null if the tree is part of ensemble algorithm.

protected double

impurity(LeafNode node)

Returns the impurity of node.

boolean

isSoft()

Returns true if this is a soft classifier that can estimate the posteriori probabilities of classification.

protected LeafNode

newNode(int[] nodeSamples)

Creates a new leaf node.

int

numClasses()

Returns the number of classes.

int

predict(Tuple x)

Predicts the class label of an instance.

int

predict(Tuple x, double[] posteriori)

Predicts the class label of an instance and also calculate a posteriori probabilities.

DecisionTree

prune(DataFrame test)

Returns a new decision tree by reduced error pruning.

StructType

schema()

Returns the predictor schema.

Methods inherited from class CART
clear, dot, findBestSplit, importance, order, predictors, root, shap, shap, size, split, toString

Methods inherited from class Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait

Methods inherited from interface Classifier
applyAsDouble, applyAsInt, isOnline, predict, predict, predict, predict, predict, predict, score, update, update, update

Methods inherited from interface DataFrameClassifier
predict, predict

Methods inherited from interface SHAP
shap

Constructor Details
- DecisionTree
  
  public DecisionTree(DataFrame x, int[] y, StructField response, int k, SplitRule rule, int maxDepth, int maxNodes, int nodeSize, int mtry, int[] samples, int[][] order)
  
  Constructor. Fits a classification tree for AdaBoost and Random Forest.
  
  Parameters:
  
  x - the data frame of the explanatory variable.
  
  y - the response variables.
  
  response - the metadata of response variable.
  
  k - the number of classes.
  
  rule - the splitting rule.
  
  maxDepth - the maximum depth of the tree.
  
  maxNodes - the maximum number of leaf nodes in the tree.
  
  nodeSize - the minimum size of leaf nodes.
  
  mtry - the number of input variables to pick to split on at each node. It seems that sqrt(p) give generally good performance, where p is the number of variables.
  
  samples - the sample set of instances for stochastic learning. samples[i] is the number of sampling for instance i.
  
  order - the index of training values in ascending order. Note that only numeric attributes need be sorted.
Method Details
- impurity
  
  protected double impurity(LeafNode node)
  
  Description copied from class: CART
  
  Returns the impurity of node.
  
  Specified by:
  
  impurity in class CART
  
  Parameters:
  
  node - the node to calculate the impurity.
  
  Returns:
  
  the impurity of node.
- newNode
  
  protected LeafNode newNode(int[] nodeSamples)
  
  Description copied from class: CART
  
  Creates a new leaf node.
  
  Specified by:
  
  newNode in class CART
  
  Parameters:
  
  nodeSamples - the samples belonging to this node.
  
  Returns:
  
  the new leaf node.
- findBestSplit
  
  protected Optional<Split> findBestSplit(LeafNode leaf, int j, double impurity, int lo, int hi)
  
  Description copied from class: CART
  
  Finds the best split for given column.
  
  Specified by:
  
  findBestSplit in class CART
  
  Parameters:
  
  leaf - the node to split.
  
  j - the column to split on.
  
  impurity - the impurity of node.
  
  lo - the lower bound of sample index in the node.
  
  hi - the upper bound of sample index in the node.
  
  Returns:
  
  the best split.
- fit
  
  public static DecisionTree fit(Formula formula, DataFrame data)
  
  Fits a classification tree.
  
  Parameters:
  
  formula - a symbolic description of the model to be fitted.
  
  data - the data frame of the explanatory and response variables.
  
  Returns:
  
  the model.
- fit
  
  public static DecisionTree fit(Formula formula, DataFrame data, DecisionTree.Options options)
  
  Fits a classification tree.
  
  Parameters:
  
  formula - a symbolic description of the model to be fitted.
  
  data - the data frame of the explanatory and response variables.
  
  options - the hyperparameters.
  
  Returns:
  
  the model.
- numClasses
  
  public int numClasses()
  
  Description copied from interface: Classifier
  
  Returns the number of classes.
  
  Specified by:
  
  numClasses in interface Classifier<Tuple>
  
  Returns:
  
  the number of classes.
- classes
  
  public int[] classes()
  
  Description copied from interface: Classifier
  
  Returns the class labels.
  
  Specified by:
  
  classes in interface Classifier<Tuple>
  
  Returns:
  
  the class labels.
- predict
  
  public int predict(Tuple x)
  
  Description copied from interface: Classifier
  
  Predicts the class label of an instance.
  
  Specified by:
  
  predict in interface Classifier<Tuple>
  
  Parameters:
  
  x - the instance to be classified.
  
  Returns:
  
  the predicted class label.
- isSoft
  
  public boolean isSoft()
  
  Description copied from interface: Classifier
  
  Returns true if this is a soft classifier that can estimate the posteriori probabilities of classification.
  
  Specified by:
  
  isSoft in interface Classifier<Tuple>
  
  Returns:
  
  true if soft classifier.
- predict
  
  public int predict(Tuple x, double[] posteriori)
  
  Predicts the class label of an instance and also calculate a posteriori probabilities. The posteriori estimation is based on sample distribution in the leaf node. It is not accurate at all when be used in a single tree. It is mainly used by RandomForest in an ensemble way.
  
  Specified by:
  
  predict in interface Classifier<Tuple>
  
  Parameters:
  
  x - an instance to be classified.
  
  posteriori - a posteriori probabilities on output.
  
  Returns:
  
  the predicted class label
- formula
  
  public Formula formula()
  
  Returns null if the tree is part of ensemble algorithm.
  
  Specified by:
  
  formula in interface DataFrameClassifier
  
  Returns:
  
  the formula associated with the model.
- schema
  
  public StructType schema()
  
  Description copied from interface: DataFrameClassifier
  
  Returns the predictor schema.
  
  Specified by:
  
  schema in interface DataFrameClassifier
  
  Returns:
  
  the predictor schema.
- prune
  
  public DecisionTree prune(DataFrame test)
  
  Returns a new decision tree by reduced error pruning.
  
  Parameters:
  
  test - the test data set to evaluate the errors of nodes.
  
  Returns:
  
  a new pruned tree.

Class DecisionTree

Nested Class Summary

Nested classes/interfaces inherited from interface Classifier

Nested classes/interfaces inherited from interface DataFrameClassifier

Field Summary

Fields inherited from class CART

Constructor Summary

Method Summary

Methods inherited from class CART

Methods inherited from class Object

Methods inherited from interface Classifier

Methods inherited from interface DataFrameClassifier

Methods inherited from interface SHAP

Constructor Details

DecisionTree

Method Details

impurity

newNode

findBestSplit

fit

fit

numClasses

classes

predict

isSoft

predict

formula

schema

prune