public class GradientTreeBoost extends java.lang.Object implements SoftClassifier<Tuple>, DataFrameClassifier, SHAP<Tuple>
Generic gradient boosting at the tth step would fit a regression tree to pseudoresiduals. Let J be the number of its leaves. The tree partitions the input space into J disjoint regions and predicts a constant value in each region. The parameter J controls the maximum allowed level of interaction between variables in the model. With J = 2 (decision stumps), no interaction between variables is allowed. With J = 3 the model may include effects of the interaction between up to two variables, and so on. Hastie et al. comment that typically 4 ≤ J ≤ 8 work well for boosting and results are fairly insensitive to the choice of in this range, J = 2 is insufficient for many applications, and J > 10 is unlikely to be required.
Fitting the training set too closely can lead to degradation of the model's generalization ability. Several socalled regularization techniques reduce this overfitting effect by constraining the fitting procedure. One natural regularization parameter is the number of gradient boosting iterations T (i.e. the number of trees in the model when the base learner is a decision tree). Increasing T reduces the error on training set, but setting it too high may lead to overfitting. An optimal value of T is often selected by monitoring prediction error on a separate validation data set.
Another regularization approach is the shrinkage which times a parameter η (called the "learning rate") to update term. Empirically it has been found that using small learning rates (such as η < 0.1) yields dramatic improvements in model's generalization ability over gradient boosting without shrinking (η = 1). However, it comes at the price of increasing computational time both during training and prediction: lower learning rate requires more iterations.
Soon after the introduction of gradient boosting Friedman proposed a minor modification to the algorithm, motivated by Breiman's bagging method. Specifically, he proposed that at each iteration of the algorithm, a base learner should be fit on a subsample of the training set drawn at random without replacement. Friedman observed a substantial improvement in gradient boosting's accuracy with this modification.
Subsample size is some constant fraction f of the size of the training set. When f = 1, the algorithm is deterministic and identical to the one described above. Smaller values of f introduce randomness into the algorithm and help prevent overfitting, acting as a kind of regularization. The algorithm also becomes faster, because regression trees have to be fit to smaller datasets at each iteration. Typically, f is set to 0.5, meaning that one half of the training set is used to build each base learner.
Also, like in bagging, subsampling allows one to define an outofbag estimate of the prediction performance improvement by evaluating predictions on those observations which were not used in the building of the next base learner. Outofbag estimates help avoid the need for an independent validation dataset, but often underestimate actual performance improvement and the optimal number of iterations.
Gradient tree boosting implementations often also use regularization by limiting the minimum number of observations in trees' terminal nodes. It's used in the tree building process by ignoring any splits that lead to nodes containing fewer than this number of training set instances. Imposing this limit helps to reduce variance in predictions at leaves.
Constructor and Description 

GradientTreeBoost(Formula formula,
RegressionTree[][] forest,
double shrinkage,
double[] importance)
Constructor of multiclass.

GradientTreeBoost(Formula formula,
RegressionTree[][] forest,
double shrinkage,
double[] importance,
IntSet labels)
Constructor of multiclass.

GradientTreeBoost(Formula formula,
RegressionTree[] trees,
double b,
double shrinkage,
double[] importance)
Constructor of binary class.

GradientTreeBoost(Formula formula,
RegressionTree[] trees,
double b,
double shrinkage,
double[] importance,
IntSet labels)
Constructor of binary class.

Modifier and Type  Method and Description 

static GradientTreeBoost 
fit(Formula formula,
DataFrame data)
Fits a gradient tree boosting for classification.

static GradientTreeBoost 
fit(Formula formula,
DataFrame data,
int ntrees,
int maxDepth,
int maxNodes,
int nodeSize,
double shrinkage,
double subsample)
Fits a gradient tree boosting for classification.

static GradientTreeBoost 
fit(Formula formula,
DataFrame data,
java.util.Properties prop)
Fits a gradient tree boosting for classification.

Formula 
formula()
Returns the formula associated with the model.

double[] 
importance()
Returns the variable importance.

int 
predict(Tuple x)
Predicts the class label of an instance.

int 
predict(Tuple x,
double[] posteriori)
Predicts the class label of an instance and also calculate a posteriori
probabilities.

StructType 
schema()
Returns the design matrix schema.

double[] 
shap(DataFrame data)
Returns the average of absolute SHAP values over a data frame.

double[] 
shap(Tuple x)
Returns the SHAP values.

int 
size()
Returns the number of trees in the model.

int[][] 
test(DataFrame data)
Test the model on a validation dataset.

RegressionTree[] 
trees()
Returns the regression trees.

void 
trim(int ntrees)
Trims the tree model set to a smaller size in case of overfitting.

clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
applyAsDouble, applyAsInt, f, predict
predict
public GradientTreeBoost(Formula formula, RegressionTree[] trees, double b, double shrinkage, double[] importance)
formula
 a symbolic description of the model to be fitted.trees
 forest of regression trees.b
 the interceptimportance
 variable importancepublic GradientTreeBoost(Formula formula, RegressionTree[] trees, double b, double shrinkage, double[] importance, IntSet labels)
formula
 a symbolic description of the model to be fitted.trees
 forest of regression trees.b
 the interceptimportance
 variable importancelabels
 class labelspublic GradientTreeBoost(Formula formula, RegressionTree[][] forest, double shrinkage, double[] importance)
formula
 a symbolic description of the model to be fitted.forest
 forest of regression trees.importance
 variable importancepublic GradientTreeBoost(Formula formula, RegressionTree[][] forest, double shrinkage, double[] importance, IntSet labels)
formula
 a symbolic description of the model to be fitted.forest
 forest of regression trees.importance
 variable importancelabels
 class labelspublic static GradientTreeBoost fit(Formula formula, DataFrame data)
formula
 a symbolic description of the model to be fitted.data
 the data frame of the explanatory and response variables.public static GradientTreeBoost fit(Formula formula, DataFrame data, java.util.Properties prop)
formula
 a symbolic description of the model to be fitted.data
 the data frame of the explanatory and response variables.public static GradientTreeBoost fit(Formula formula, DataFrame data, int ntrees, int maxDepth, int maxNodes, int nodeSize, double shrinkage, double subsample)
formula
 a symbolic description of the model to be fitted.data
 the data frame of the explanatory and response variables.ntrees
 the number of iterations (trees).maxDepth
 the maximum depth of the tree.maxNodes
 the maximum number of leaf nodes in the tree.nodeSize
 the number of instances in a node below which the tree will
not split, setting nodeSize = 5 generally gives good results.shrinkage
 the shrinkage parameter in (0, 1] controls the learning rate of procedure.subsample
 the sampling fraction for stochastic tree boosting.public Formula formula()
DataFrameClassifier
formula
in interface DataFrameClassifier
public StructType schema()
DataFrameClassifier
schema
in interface DataFrameClassifier
public double[] importance()
public int size()
public RegressionTree[] trees()
public void trim(int ntrees)
ntrees
 the new (smaller) size of tree model set.public int predict(Tuple x)
Classifier
predict
in interface Classifier<Tuple>
predict
in interface DataFrameClassifier
x
 the instance to be classified.public int predict(Tuple x, double[] posteriori)
SoftClassifier
predict
in interface SoftClassifier<Tuple>
x
 an instance to be classified.posteriori
 the array to store a posteriori probabilities on output.public int[][] test(DataFrame data)
data
 the test data set.public double[] shap(DataFrame data)
public double[] shap(Tuple x)
SHAP
p x k
, where p
is the number of
features and k
is the classes. The first k elements are
the SHAP values of first feature over k classes, respectively. The
rest features follow accordingly.