trait Operators extends AnyRef
* High level classification operators.
 Alphabetic
 By Inheritance
 Operators
 AnyRef
 Any
 by any2stringadd
 by StringFormat
 by Ensuring
 by ArrowAssoc
 Hide All
 Show All
 Public
 All
Value Members

final
def
!=(arg0: Any): Boolean
 Definition Classes
 AnyRef → Any

final
def
##(): Int
 Definition Classes
 AnyRef → Any
 def +(other: String): String
 def >[B](y: B): (Operators, B)

final
def
==(arg0: Any): Boolean
 Definition Classes
 AnyRef → Any

def
adaboost(x: Array[Array[Double]], y: Array[Int], attributes: Array[Attribute] = null, ntrees: Int = 500, maxNodes: Int = 2): AdaBoost
AdaBoost (Adaptive Boosting) classifier with decision trees.
AdaBoost (Adaptive Boosting) classifier with decision trees. In principle, AdaBoost is a metaalgorithm, and can be used in conjunction with many other learning algorithms to improve their performance. In practice, AdaBoost with decision trees is probably the most popular combination. AdaBoost is adaptive in the sense that subsequent classifiers built are tweaked in favor of those instances misclassified by previous classifiers. AdaBoost is sensitive to noisy data and outliers. However in some problems it can be less susceptible to the overfitting problem than most learning algorithms.
AdaBoost calls a weak classifier repeatedly in a series of rounds from total T classifiers. For each call a distribution of weights is updated that indicates the importance of examples in the data set for the classification. On each round, the weights of each incorrectly classified example are increased (or alternatively, the weights of each correctly classified example are decreased), so that the new classifier focuses more on those examples.
The basic AdaBoost algorithm is only for binary classification problem. For multiclass classification, a common approach is reducing the multiclass classification problem to multiple twoclass problems. This implementation is a multiclass AdaBoost without such reductions.
References:
 Yoav Freund, Robert E. Schapire. A DecisionTheoretic Generalization of onLine Learning and an Application to Boosting, 1995.
 Ji Zhu, Hui Zhou, Saharon Rosset and Trevor Hastie. Multiclass Adaboost, 2009.
 x
the training instances.
 y
the response variable.
 attributes
the attribute properties. If not provided, all attributes are treated as numeric values.
 ntrees
the number of trees.
 maxNodes
the maximum number of leaf nodes in the trees.
 returns
AdaBoost model.

final
def
asInstanceOf[T0]: T0
 Definition Classes
 Any

def
cart(x: Array[Array[Double]], y: Array[Int], maxNodes: Int, attributes: Array[Attribute] = null, splitRule: SplitRule = DecisionTree.SplitRule.GINI): DecisionTree
Decision tree.
Decision tree. A decision tree can be learned by splitting the training set into subsets based on an attribute value test. This process is repeated on each derived subset in a recursive manner called recursive partitioning. The recursion is completed when the subset at a node all has the same value of the target variable, or when splitting no longer adds value to the predictions.
The algorithms that are used for constructing decision trees usually work topdown by choosing a variable at each step that is the next best variable to use in splitting the set of items. "Best" is defined by how well the variable splits the set into homogeneous subsets that have the same value of the target variable. Different algorithms use different formulae for measuring "best". Used by the CART algorithm, Gini impurity is a measure of how often a randomly chosen element from the set would be incorrectly labeled if it were randomly labeled according to the distribution of labels in the subset. Gini impurity can be computed by summing the probability of each item being chosen times the probability of a mistake in categorizing that item. It reaches its minimum (zero) when all cases in the node fall into a single target category. Information gain is another popular measure, used by the ID3, C4.5 and C5.0 algorithms. Information gain is based on the concept of entropy used in information theory. For categorical variables with different number of levels, however, information gain are biased in favor of those attributes with more levels. Instead, one may employ the information gain ratio, which solves the drawback of information gain.
Classification and Regression Tree techniques have a number of advantages over many of those alternative techniques.
 Simple to understand and interpret: In most cases, the interpretation of results summarized in a tree is very simple. This simplicity is useful not only for purposes of rapid classification of new observations, but can also often yield a much simpler "model" for explaining why observations are classified or predicted in a particular manner.
 Able to handle both numerical and categorical data: Other techniques are usually specialized in analyzing datasets that have only one type of variable.
 Nonparametric and nonlinear: The final results of using tree methods for classification or regression can be summarized in a series of (usually few) logical ifthen conditions (tree nodes). Therefore, there is no implicit assumption that the underlying relationships between the predictor variables and the dependent variable are linear, follow some specific nonlinear link function, or that they are even monotonic in nature. Thus, tree methods are particularly well suited for data mining tasks, where there is often little a priori knowledge nor any coherent set of theories or predictions regarding which variables are related and how. In those types of data analytics, tree methods can often reveal simple relationships between just a few variables that could have easily gone unnoticed using other analytic techniques.
One major problem with classification and regression trees is their high variance. Often a small change in the data can result in a very different series of splits, making interpretation somewhat precarious. Besides, decisiontree learners can create overcomplex trees that cause overfitting. Mechanisms such as pruning are necessary to avoid this problem. Another limitation of trees is the lack of smoothness of the prediction surface.
Some techniques such as bagging, boosting, and random forest use more than one decision tree for their analysis.
 x
the training instances.
 y
the response variable.
 maxNodes
the maximum number of leaf nodes in the tree.
 attributes
the attribute properties.
 splitRule
the splitting rule.
 returns
Decision tree model.

def
clone(): AnyRef
 Attributes
 protected[java.lang]
 Definition Classes
 AnyRef
 Annotations
 @throws( ... )
 def ensuring(cond: (Operators) ⇒ Boolean, msg: ⇒ Any): Operators
 def ensuring(cond: (Operators) ⇒ Boolean): Operators
 def ensuring(cond: Boolean, msg: ⇒ Any): Operators
 def ensuring(cond: Boolean): Operators

final
def
eq(arg0: AnyRef): Boolean
 Definition Classes
 AnyRef

def
equals(arg0: Any): Boolean
 Definition Classes
 AnyRef → Any

def
finalize(): Unit
 Attributes
 protected[java.lang]
 Definition Classes
 AnyRef
 Annotations
 @throws( classOf[java.lang.Throwable] )

def
fisher(x: Array[Array[Double]], y: Array[Int], L: Int = 1, tol: Double = 0.0001): FLD
Fisher's linear discriminant.
Fisher's linear discriminant. Fisher defined the separation between two distributions to be the ratio of the variance between the classes to the variance within the classes, which is, in some sense, a measure of the signaltonoise ratio for the class labeling. FLD finds a linear combination of features which maximizes the separation after the projection. The resulting combination may be used for dimensionality reduction before later classification.
The terms Fisher's linear discriminant and LDA are often used interchangeably, although FLD actually describes a slightly different discriminant, which does not make some of the assumptions of LDA such as normally distributed classes or equal class covariances. When the assumptions of LDA are satisfied, FLD is equivalent to LDA.
FLD is also closely related to principal component analysis (PCA), which also looks for linear combinations of variables which best explain the data. As a supervised method, FLD explicitly attempts to model the difference between the classes of data. On the other hand, PCA is a unsupervised method and does not take into account any difference in class.
One complication in applying FLD (and LDA) to real data occurs when the number of variables/features does not exceed the number of samples. In this case, the covariance estimates do not have full rank, and so cannot be inverted. This is known as small sample size problem.
 x
training instances.
 y
training labels in [0, k), where k is the number of classes.
 L
the dimensionality of mapped space. The default value is the number of classes  1.
 tol
a tolerance to decide if a covariance matrix is singular; it will reject variables whose variance is less than tol^{2}.
 returns
fisher discriminant analysis model.
 def formatted(fmtstr: String): String

def
gbm(x: Array[Array[Double]], y: Array[Int], attributes: Array[Attribute] = null, ntrees: Int = 500, maxNodes: Int = 6, shrinkage: Double = 0.05, subsample: Double = 0.7): GradientTreeBoost
Gradient boosted classification trees.
Gradient boosted classification trees.
Generic gradient boosting at the tth step would fit a regression tree to pseudoresiduals. Let J be the number of its leaves. The tree partitions the input space into J disjoint regions and predicts a constant value in each region. The parameter J controls the maximum allowed level of interaction between variables in the model. With J = 2 (decision stumps), no interaction between variables is allowed. With J = 3 the model may include effects of the interaction between up to two variables, and so on. Hastie et al. comment that typically 4 ≤ J ≤ 8 work well for boosting and results are fairly insensitive to the choice of in this range, J = 2 is insufficient for many applications, and J > 10 is unlikely to be required.
Fitting the training set too closely can lead to degradation of the model's generalization ability. Several socalled regularization techniques reduce this overfitting effect by constraining the fitting procedure. One natural regularization parameter is the number of gradient boosting iterations T (i.e. the number of trees in the model when the base learner is a decision tree). Increasing T reduces the error on training set, but setting it too high may lead to overfitting. An optimal value of T is often selected by monitoring prediction error on a separate validation data set.
Another regularization approach is the shrinkage which times a parameter η (called the "learning rate") to update term. Empirically it has been found that using small learning rates (such as η < 0.1) yields dramatic improvements in model's generalization ability over gradient boosting without shrinking (η = 1). However, it comes at the price of increasing computational time both during training and prediction: lower learning rate requires more iterations.
Soon after the introduction of gradient boosting Friedman proposed a minor modification to the algorithm, motivated by Breiman's bagging method. Specifically, he proposed that at each iteration of the algorithm, a base learner should be fit on a subsample of the training set drawn at random without replacement. Friedman observed a substantial improvement in gradient boosting's accuracy with this modification.
Subsample size is some constant fraction f of the size of the training set. When f = 1, the algorithm is deterministic and identical to the one described above. Smaller values of f introduce randomness into the algorithm and help prevent overfitting, acting as a kind of regularization. The algorithm also becomes faster, because regression trees have to be fit to smaller datasets at each iteration. Typically, f is set to 0.5, meaning that one half of the training set is used to build each base learner.
Also, like in bagging, subsampling allows one to define an outofbag estimate of the prediction performance improvement by evaluating predictions on those observations which were not used in the building of the next base learner. Outofbag estimates help avoid the need for an independent validation dataset, but often underestimate actual performance improvement and the optimal number of iterations.
Gradient tree boosting implementations often also use regularization by limiting the minimum number of observations in trees' terminal nodes. It's used in the tree building process by ignoring any splits that lead to nodes containing fewer than this number of training set instances. Imposing this limit helps to reduce variance in predictions at leaves.
References:
 J. H. Friedman. Greedy Function Approximation: A Gradient Boosting Machine, 1999.
 J. H. Friedman. Stochastic Gradient Boosting, 1999.
 x
the training instances.
 y
the class labels.
 attributes
the attribute properties. If not provided, all attributes are treated as numeric values.
 ntrees
the number of iterations (trees).
 maxNodes
the number of leaves in each tree.
 shrinkage
the shrinkage parameter in (0, 1] controls the learning rate of procedure.
 subsample
the sampling fraction for stochastic tree boosting.
 returns
Gradient boosted trees.

final
def
getClass(): Class[_]
 Definition Classes
 AnyRef → Any

def
hashCode(): Int
 Definition Classes
 AnyRef → Any

final
def
isInstanceOf[T0]: Boolean
 Definition Classes
 Any

def
knn(x: Array[Array[Double]], y: Array[Int], k: Int): KNN[Array[Double]]
Knearest neighbor classifier with Euclidean distance as the similarity measure.
Knearest neighbor classifier with Euclidean distance as the similarity measure.
 x
training samples.
 y
training labels in [0, c), where c is the number of classes.
 k
the number of neighbors for classification.

def
knn[T <: AnyRef](x: Array[T], y: Array[Int], distance: Distance[T], k: Int): KNN[T]
Knearest neighbor classifier.
Knearest neighbor classifier.
 x
training samples.
 y
training labels in [0, c), where c is the number of classes.
 distance
the distance measure for finding nearest neighbors.
 k
the number of neighbors for classification.

def
knn[T <: AnyRef](x: KNNSearch[T, T], y: Array[Int], k: Int): KNN[T]
Knearest neighbor classifier.
Knearest neighbor classifier. The knearest neighbor algorithm (kNN) is a method for classifying objects by a majority vote of its neighbors, with the object being assigned to the class most common amongst its k nearest neighbors (k is a positive integer, typically small). kNN is a type of instancebased learning, or lazy learning where the function is only approximated locally and all computation is deferred until classification.
The best choice of k depends upon the data; generally, larger values of k reduce the effect of noise on the classification, but make boundaries between classes less distinct. A good k can be selected by various heuristic techniques, e.g. crossvalidation. In binary problems, it is helpful to choose k to be an odd number as this avoids tied votes.
A drawback to the basic majority voting classification is that the classes with the more frequent instances tend to dominate the prediction of the new object, as they tend to come up in the k nearest neighbors when the neighbors are computed due to their large number. One way to overcome this problem is to weight the classification taking into account the distance from the test point to each of its k nearest neighbors.
Often, the classification accuracy of kNN can be improved significantly if the distance metric is learned with specialized algorithms such as Large Margin Nearest Neighbor or Neighborhood Components Analysis.
Nearest neighbor rules in effect compute the decision boundary in an implicit manner. It is also possible to compute the decision boundary itself explicitly, and to do so in an efficient manner so that the computational complexity is a function of the boundary complexity.
The nearest neighbor algorithm has some strong consistency results. As the amount of data approaches infinity, the algorithm is guaranteed to yield an error rate no worse than twice the Bayes error rate (the minimum achievable error rate given the distribution of the data). kNN is guaranteed to approach the Bayes error rate, for some value of k (where k increases as a function of the number of data points).
 x
knearest neighbor search data structure of training instances.
 y
training labels in [0, c), where c is the number of classes.
 k
the number of neighbors for classification.

def
lda(x: Array[Array[Double]], y: Array[Int], priori: Array[Double] = null, tol: Double = 0.0001): LDA
Linear discriminant analysis.
Linear discriminant analysis. LDA is based on the Bayes decision theory and assumes that the conditional probability density functions are normally distributed. LDA also makes the simplifying homoscedastic assumption (i.e. that the class covariances are identical) and that the covariances have full rank. With these assumptions, the discriminant function of an input being in a class is purely a function of this linear combination of independent variables.
LDA is closely related to ANOVA (analysis of variance) and linear regression analysis, which also attempt to express one dependent variable as a linear combination of other features or measurements. In the other two methods, however, the dependent variable is a numerical quantity, while for LDA it is a categorical variable (i.e. the class label). Logistic regression and probit regression are more similar to LDA, as they also explain a categorical variable. These other methods are preferable in applications where it is not reasonable to assume that the independent variables are normally distributed, which is a fundamental assumption of the LDA method.
One complication in applying LDA (and Fisher's discriminant) to real data occurs when the number of variables/features does not exceed the number of samples. In this case, the covariance estimates do not have full rank, and so cannot be inverted. This is known as small sample size problem.
 x
training samples.
 y
training labels in [0, k), where k is the number of classes.
 priori
the priori probability of each class. If null, it will be estimated from the training data.
 tol
a tolerance to decide if a covariance matrix is singular; it will reject variables whose variance is less than tol^{2}.
 returns
linear discriminant analysis model.

def
logit(x: Array[Array[Double]], y: Array[Int], lambda: Double = 0.0, tol: Double = 1E5, maxIter: Int = 500): LogisticRegression
Logistic regression.
Logistic regression. Logistic regression (logit model) is a generalized linear model used for binomial regression. Logistic regression applies maximum likelihood estimation after transforming the dependent into a logit variable. A logit is the natural log of the odds of the dependent equaling a certain value or not (usually 1 in binary logistic models, the highest value in multinomial models). In this way, logistic regression estimates the odds of a certain event (value) occurring.
Goodnessoffit tests such as the likelihood ratio test are available as indicators of model appropriateness, as is the Wald statistic to test the significance of individual independent variables.
Logistic regression has many analogies to ordinary least squares (OLS) regression. Unlike OLS regression, however, logistic regression does not assume linearity of relationship between the raw values of the independent variables and the dependent, does not require normally distributed variables, does not assume homoscedasticity, and in general has less stringent requirements.
Compared with linear discriminant analysis, logistic regression has several advantages:
 It is more robust: the independent variables don't have to be normally distributed, or have equal variance in each group
 It does not assume a linear relationship between the independent variables and dependent variable.
 It may handle nonlinear effects since one can add explicit interaction and power terms.
However, it requires much more data to achieve stable, meaningful results.
Logistic regression also has strong connections with neural network and maximum entropy modeling. For example, binary logistic regression is equivalent to a onelayer, singleoutput neural network with a logistic activation function trained under log loss. Similarly, multinomial logistic regression is equivalent to a onelayer, softmaxoutput neural network.
Logistic regression estimation also obeys the maximum entropy principle, and thus logistic regression is sometimes called "maximum entropy modeling", and the resulting classifier the "maximum entropy classifier".
 x
training samples.
 y
training labels in [0, k), where k is the number of classes.
 lambda
λ > 0 gives a "regularized" estimate of linear weights which often has superior generalization performance, especially when the dimensionality is high.
 tol
the tolerance for stopping iterations.
 maxIter
the maximum number of iterations.
 returns
Logistic regression model.

def
maxent(x: Array[Array[Int]], y: Array[Int], p: Int, lambda: Double = 0.1, tol: Double = 1E5, maxIter: Int = 500): Maxent
Maximum Entropy Classifier.
Maximum Entropy Classifier. Maximum entropy is a technique for learning probability distributions from data. In maximum entropy models, the observed data itself is assumed to be the testable information. Maximum entropy models don't assume anything about the probability distribution other than what have been observed and always choose the most uniform distribution subject to the observed constraints.
Basically, maximum entropy classifier is another name of multinomial logistic regression applied to categorical independent variables, which are converted to binary dummy variables. Maximum entropy models are widely used in natural language processing. Here, we provide an implementation which assumes that binary features are stored in a sparse array, of which entries are the indices of nonzero features.
References:
 A. L. Berger, S. D. Pietra, and V. J. D. Pietra. A maximum entropy approach to natural language processing. Computational Linguistics 22(1):3971, 1996.
 x
training samples. Each sample is represented by a set of sparse binary features. The features are stored in an integer array, of which are the indices of nonzero features.
 y
training labels in [0, k), where k is the number of classes.
 p
the dimension of feature space.
 lambda
λ > 0 gives a "regularized" estimate of linear weights which often has superior generalization performance, especially when the dimensionality is high.
 tol
tolerance for stopping iterations.
 maxIter
maximum number of iterations.
 returns
Maximum entropy model.

def
mlp(x: Array[Array[Double]], y: Array[Int], numUnits: Array[Int], error: ErrorFunction, activation: ActivationFunction, epochs: Int = 25, eta: Double = 0.1, alpha: Double = 0.0, lambda: Double = 0.0): NeuralNetwork
Multilayer perceptron neural network.
Multilayer perceptron neural network. An MLP consists of several layers of nodes, interconnected through weighted acyclic arcs from each preceding layer to the following, without lateral or feedback connections. Each node calculates a transformed weighted linear combination of its inputs (output activations from the preceding layer), with one of the weights acting as a trainable bias connected to a constant input. The transformation, called activation function, is a bounded nondecreasing (nonlinear) function, such as the sigmoid functions (ranges from 0 to 1). Another popular activation function is hyperbolic tangent which is actually equivalent to the sigmoid function in shape but ranges from 1 to 1. More specialized activation functions include radial basis functions which are used in RBF networks.
The representational capabilities of a MLP are determined by the range of mappings it may implement through weight variation. Single layer perceptrons are capable of solving only linearly separable problems. With the sigmoid function as activation function, the singlelayer network is identical to the logistic regression model.
The universal approximation theorem for neural networks states that every continuous function that maps intervals of real numbers to some output interval of real numbers can be approximated arbitrarily closely by a multilayer perceptron with just one hidden layer. This result holds only for restricted classes of activation functions, which are extremely complex and NOT smooth for subtle mathematical reasons. On the other hand, smoothness is important for gradient descent learning. Besides, the proof is not constructive regarding the number of neurons required or the settings of the weights. Therefore, complex systems will have more layers of neurons with some having increased layers of input neurons and output neurons in practice.
The most popular algorithm to train MLPs is backpropagation, which is a gradient descent method. Based on chain rule, the algorithm propagates the error back through the network and adjusts the weights of each connection in order to reduce the value of the error function by some small amount. For this reason, backpropagation can only be applied on networks with differentiable activation functions.
During error back propagation, we usually times the gradient with a small number η, called learning rate, which is carefully selected to ensure that the network converges to a local minimum of the error function fast enough, without producing oscillations. One way to avoid oscillation at large η, is to make the change in weight dependent on the past weight change by adding a momentum term.
Although the backpropagation algorithm may performs gradient descent on the total error of all instances in a batch way, the learning rule is often applied to each instance separately in an online way or stochastic way. There exists empirical indication that the stochastic way results in faster convergence.
In practice, the problem of overfitting has emerged. This arises in convoluted or overspecified systems when the capacity of the network significantly exceeds the needed free parameters. There are two general approaches for avoiding this problem: The first is to use crossvalidation and similar techniques to check for the presence of overfitting and optimally select hyperparameters such as to minimize the generalization error. The second is to use some form of regularization, which emerges naturally in a Bayesian framework, where the regularization can be performed by selecting a larger prior probability over simpler models; but also in statistical learning theory, where the goal is to minimize over the "empirical risk" and the "structural risk".
For neural networks, the input patterns usually should be scaled/standardized. Commonly, each input variable is scaled into interval [0, 1] or to have mean 0 and standard deviation 1.
For penalty functions and output units, the following natural pairings are recommended:
 linear output units and a least squares penalty function.
 a twoclass crossentropy penalty function and a logistic activation function.
 a multiclass crossentropy penalty function and a softmax activation function.
By assigning a softmax activation function on the output layer of the neural network for categorical target variables, the outputs can be interpreted as posterior probabilities, which are very useful.
 x
training samples.
 y
training labels in [0, k), where k is the number of classes.
 numUnits
the number of units in each layer.
 error
the error function.
 activation
the activation function of output layer.
 epochs
the number of epochs of stochastic learning.
 eta
the learning rate.
 alpha
the momentum factor.
 lambda
the weight decay for regularization.

def
naiveBayes(priori: Array[Double], condprob: Array[Array[Distribution]]): NaiveBayes
Creates a general naive Bayes classifier.
Creates a general naive Bayes classifier.
 priori
the priori probability of each class.
 condprob
the conditional distribution of each variable in each class. In particular, condprob[i][j] is the conditional distribution P(x_{j}  class i).

def
naiveBayes(x: Array[Array[Double]], y: Array[Int], model: Model, priori: Array[Double] = null, sigma: Double = 1.0): NaiveBayes
Creates a naive Bayes classifier for document classification.
Creates a naive Bayes classifier for document classification. Addk smoothing.
 x
training samples.
 y
training labels in [0, k), where k is the number of classes.
 model
the generation model of naive Bayes classifier.
 priori
the priori probability of each class. If null, equal probability is assume for each class.
 sigma
the prior count of addk smoothing of evidence.

final
def
ne(arg0: AnyRef): Boolean
 Definition Classes
 AnyRef

final
def
notify(): Unit
 Definition Classes
 AnyRef

final
def
notifyAll(): Unit
 Definition Classes
 AnyRef

def
nrbfnet[T <: AnyRef, RBF <: RadialBasisFunction](x: Array[T], y: Array[Int], distance: Metric[T], rbf: Array[RBF], centers: Array[T]): RBFNetwork[T]
Normalized radial basis function networks.

def
nrbfnet[T <: AnyRef](x: Array[T], y: Array[Int], distance: Metric[T], rbf: RadialBasisFunction, centers: Array[T]): RBFNetwork[T]
Normalized radial basis function networks.

def
qda(x: Array[Array[Double]], y: Array[Int], priori: Array[Double] = null, tol: Double = 0.0001): QDA
Quadratic discriminant analysis.
Quadratic discriminant analysis. QDA is closely related to linear discriminant analysis (LDA). Like LDA, QDA models the conditional probability density functions as a Gaussian distribution, then uses the posterior distributions to estimate the class for a given test data. Unlike LDA, however, in QDA there is no assumption that the covariance of each of the classes is identical. Therefore, the resulting separating surface between the classes is quadratic.
The Gaussian parameters for each class can be estimated from training data with maximum likelihood (ML) estimation. However, when the number of training instances is small compared to the dimension of input space, the ML covariance estimation can be illposed. One approach to resolve the illposed estimation is to regularize the covariance estimation. One of these regularization methods is
regularized discriminant analysis
. x
training samples.
 y
training labels in [0, k), where k is the number of classes.
 priori
the priori probability of each class. If null, it will be estimated from the training data.
 tol
a tolerance to decide if a covariance matrix is singular; it will reject variables whose variance is less than tol^{2}.
 returns
Quadratic discriminant analysis model.

def
randomForest(x: Array[Array[Double]], y: Array[Int], attributes: Array[Attribute] = null, ntrees: Int = 500, maxNodes: Int = 1, nodeSize: Int = 1, mtry: Int = 1, subsample: Double = 1.0, splitRule: SplitRule = DecisionTree.SplitRule.GINI, classWeight: Array[Int] = null): RandomForest
Random forest for classification.
Random forest for classification. Random forest is an ensemble classifier that consists of many decision trees and outputs the majority vote of individual trees. The method combines bagging idea and the random selection of features.
Each tree is constructed using the following algorithm:
 If the number of cases in the training set is N, randomly sample N cases with replacement from the original data. This sample will be the training set for growing the tree.
 If there are M input variables, a number m << M is specified such that at each node, m variables are selected at random out of the M and the best split on these m is used to split the node. The value of m is held constant during the forest growing.
 Each tree is grown to the largest extent possible. There is no pruning.
The advantages of random forest are:
 For many data sets, it produces a highly accurate classifier.
 It runs efficiently on large data sets.
 It can handle thousands of input variables without variable deletion.
 It gives estimates of what variables are important in the classification.
 It generates an internal unbiased estimate of the generalization error as the forest building progresses.
 It has an effective method for estimating missing data and maintains accuracy when a large proportion of the data are missing.
The disadvantages are
 Random forests are prone to overfitting for some datasets. This is even more pronounced on noisy data.
 For data including categorical variables with different number of levels, random forests are biased in favor of those attributes with more levels. Therefore, the variable importance scores from random forest are not reliable for this type of data.
 x
the training instances.
 y
the response variable.
 attributes
the attribute properties. If not provided, all attributes are treated as numeric values.
 ntrees
the number of trees.
 maxNodes
maximum number of leaf nodes.
 nodeSize
number of instances in a node below which the tree will not split.
 mtry
the number of random selected features to be used to determine the decision at a node of the tree. floor(sqrt(dim)) seems to give generally good performance, where dim is the number of variables.
 subsample
the sampling rate for training tree. 1.0 means sampling with replacement. < 1.0 means sampling without replacement.
 splitRule
Decision tree node split rule.
 classWeight
Priors of the classes.
 returns
Random forest classification model.

def
rbfnet[T <: AnyRef, RBF <: RadialBasisFunction](x: Array[T], y: Array[Int], distance: Metric[T], rbf: Array[RBF], centers: Array[T]): RBFNetwork[T]
Radial basis function networks.
Radial basis function networks. A radial basis function network is an artificial neural network that uses radial basis functions as activation functions. It is a linear combination of radial basis functions. They are used in function approximation, time series prediction, and control.
In its basic form, radial basis function network is in the form
y(x) = Σ w_{i} φ(xc_{i})
where the approximating function y(x) is represented as a sum of N radial basis functions φ, each associated with a different center c_{i}, and weighted by an appropriate coefficient w_{i}. For distance, one usually chooses Euclidean distance. The weights w_{i} can be estimated using the matrix methods of linear least squares, because the approximating function is linear in the weights.
The centers c_{i} can be randomly selected from training data, or learned by some clustering method (e.g. kmeans), or learned together with weight parameters undergo a supervised learning processing (e.g. errorcorrection learning).
The popular choices for φ comprise the Gaussian function and the so called thin plate splines. The advantage of the thin plate splines is that their conditioning is invariant under scalings. Gaussian, multiquadric and inverse multiquadric are infinitely smooth and and involve a scale or shape parameter, r_{0} > 0. Decreasing r_{0} tends to flatten the basis function. For a given function, the quality of approximation may strongly depend on this parameter. In particular, increasing r_{0} has the effect of better conditioning (the separation distance of the scaled points increases).
A variant on RBF networks is normalized radial basis function (NRBF) networks, in which we require the sum of the basis functions to be unity. NRBF arises more naturally from a Bayesian statistical perspective. However, there is no evidence that either the NRBF method is consistently superior to the RBF method, or vice versa.
SVMs with Gaussian kernel have similar structure as RBF networks with Gaussian radial basis functions. However, the SVM approach "automatically" solves the network complexity problem since the size of the hidden layer is obtained as the result of the QP procedure. Hidden neurons and support vectors correspond to each other, so the center problems of the RBF network is also solved, as the support vectors serve as the basis function centers. It was reported that with similar number of support vectors/centers, SVM shows better generalization performance than RBF network when the training data size is relatively small. On the other hand, RBF network gives better generalization performance than SVM on large training data.
References:
 Simon Haykin. Neural Networks: A Comprehensive Foundation (2nd edition). 1999.
 T. Poggio and F. Girosi. Networks for approximation and learning. Proc. IEEE 78(9):14841487, 1990.
 Nabil Benoudjit and Michel Verleysen. On the kernel widths in radialbasis function networks. Neural Process, 2003.
 x
training samples.
 y
training labels in [0, k), where k is the number of classes.
 distance
the distance metric functor.
 rbf
the radial basis functions at each center.
 centers
the centers of RBF functions.

def
rbfnet[T <: AnyRef](x: Array[T], y: Array[Int], distance: Metric[T], rbf: RadialBasisFunction, centers: Array[T]): RBFNetwork[T]
Radial basis function networks.
Radial basis function networks. A radial basis function network is an artificial neural network that uses radial basis functions as activation functions. It is a linear combination of radial basis functions. They are used in function approximation, time series prediction, and control.
In its basic form, radial basis function network is in the form
y(x) = Σ w_{i} φ(xc_{i})
where the approximating function y(x) is represented as a sum of N radial basis functions φ, each associated with a different center c_{i}, and weighted by an appropriate coefficient w_{i}. For distance, one usually chooses Euclidean distance. The weights w_{i} can be estimated using the matrix methods of linear least squares, because the approximating function is linear in the weights.
The centers c_{i} can be randomly selected from training data, or learned by some clustering method (e.g. kmeans), or learned together with weight parameters undergo a supervised learning processing (e.g. errorcorrection learning).
The popular choices for φ comprise the Gaussian function and the so called thin plate splines. The advantage of the thin plate splines is that their conditioning is invariant under scalings. Gaussian, multiquadric and inverse multiquadric are infinitely smooth and and involve a scale or shape parameter, r_{0} > 0. Decreasing r_{0} tends to flatten the basis function. For a given function, the quality of approximation may strongly depend on this parameter. In particular, increasing r_{0} has the effect of better conditioning (the separation distance of the scaled points increases).
A variant on RBF networks is normalized radial basis function (NRBF) networks, in which we require the sum of the basis functions to be unity. NRBF arises more naturally from a Bayesian statistical perspective. However, there is no evidence that either the NRBF method is consistently superior to the RBF method, or vice versa.
SVMs with Gaussian kernel have similar structure as RBF networks with Gaussian radial basis functions. However, the SVM approach "automatically" solves the network complexity problem since the size of the hidden layer is obtained as the result of the QP procedure. Hidden neurons and support vectors correspond to each other, so the center problems of the RBF network is also solved, as the support vectors serve as the basis function centers. It was reported that with similar number of support vectors/centers, SVM shows better generalization performance than RBF network when the training data size is relatively small. On the other hand, RBF network gives better generalization performance than SVM on large training data.
References:
 Simon Haykin. Neural Networks: A Comprehensive Foundation (2nd edition). 1999.
 T. Poggio and F. Girosi. Networks for approximation and learning. Proc. IEEE 78(9):14841487, 1990.
 Nabil Benoudjit and Michel Verleysen. On the kernel widths in radialbasis function networks. Neural Process, 2003.
 x
training samples.
 y
training labels in [0, k), where k is the number of classes.
 distance
the distance metric functor.
 rbf
the radial basis function.
 centers
the centers of RBF functions.

def
rda(x: Array[Array[Double]], y: Array[Int], alpha: Double, priori: Array[Double] = null, tol: Double = 0.0001): RDA
Regularized discriminant analysis.
Regularized discriminant analysis. RDA is a compromise between LDA and QDA, which allows one to shrink the separate covariances of QDA toward a common variance as in LDA. This method is very similar in flavor to ridge regression. The regularized covariance matrices of each class is Σ_{k}(α) = α Σ_{k} + (1  α) Σ. The quadratic discriminant function is defined using the shrunken covariance matrices Σ_{k}(α). The parameter α in [0, 1] controls the complexity of the model. When α is one, RDA becomes QDA. While α is zero, RDA is equivalent to LDA. Therefore, the regularization factor α allows a continuum of models between LDA and QDA.
 x
training samples.
 y
training labels in [0, k), where k is the number of classes.
 alpha
regularization factor in [0, 1] allows a continuum of models between LDA and QDA.
 priori
the priori probability of each class.
 tol
tolerance to decide if a covariance matrix is singular; it will reject variables whose variance is less than tol^{2}.
 returns
Regularized discriminant analysis model.

def
svm[T <: AnyRef](x: Array[T], y: Array[Int], kernel: MercerKernel[T], C: Double, strategy: Multiclass = SVM.Multiclass.ONE_VS_ONE, epoch: Int = 1): SVM[T]
Support vector machines for classification.
Support vector machines for classification. The basic support vector machine is a binary linear classifier which chooses the hyperplane that represents the largest separation, or margin, between the two classes. If such a hyperplane exists, it is known as the maximummargin hyperplane and the linear classifier it defines is known as a maximum margin classifier.
If there exists no hyperplane that can perfectly split the positive and negative instances, the soft margin method will choose a hyperplane that splits the instances as cleanly as possible, while still maximizing the distance to the nearest cleanly split instances.
The nonlinear SVMs are created by applying the kernel trick to maximummargin hyperplanes. The resulting algorithm is formally similar, except that every dot product is replaced by a nonlinear kernel function. This allows the algorithm to fit the maximummargin hyperplane in a transformed feature space. The transformation may be nonlinear and the transformed space be high dimensional. For example, the feature space corresponding Gaussian kernel is a Hilbert space of infinite dimension. Thus though the classifier is a hyperplane in the highdimensional feature space, it may be nonlinear in the original input space. Maximum margin classifiers are well regularized, so the infinite dimension does not spoil the results.
The effectiveness of SVM depends on the selection of kernel, the kernel's parameters, and soft margin parameter C. Given a kernel, best combination of C and kernel's parameters is often selected by a gridsearch with cross validation.
The dominant approach for creating multiclass SVMs is to reduce the single multiclass problem into multiple binary classification problems. Common methods for such reduction is to build binary classifiers which distinguish between (i) one of the labels to the rest (oneversusall) or (ii) between every pair of classes (oneversusone). Classification of new instances for oneversusall case is done by a winnertakesall strategy, in which the classifier with the highest output function assigns the class. For the oneversusone approach, classification is done by a maxwins voting strategy, in which every classifier assigns the instance to one of the two classes, then the vote for the assigned class is increased by one vote, and finally the class with most votes determines the instance classification.
 T
the data type
 x
training data
 y
training labels
 kernel
Mercer kernel
 C
Regularization parameter
 strategy
Multiclass classification strategy, one vs all or one vs one. Ignored for binary classification.
 epoch
the number of training epochs
 returns
SVM model.

final
def
synchronized[T0](arg0: ⇒ T0): T0
 Definition Classes
 AnyRef

def
toString(): String
 Definition Classes
 AnyRef → Any

final
def
wait(): Unit
 Definition Classes
 AnyRef
 Annotations
 @throws( ... )

final
def
wait(arg0: Long, arg1: Int): Unit
 Definition Classes
 AnyRef
 Annotations
 @throws( ... )

final
def
wait(arg0: Long): Unit
 Definition Classes
 AnyRef
 Annotations
 @throws( ... )
 def →[B](y: B): (Operators, B)
High level Smile operators in Scala.