Jupyter Notebook is a web-based interactive computational environment. A Jupyter Notebook document is a JSON document, following a versioned schema, and containing an ordered list of input/output cells which can contain code, text (using Markdown), mathematics, plots and rich media, usually ending with the ".ipynb" extension. BeakerX is a collection of kernels and extensions to the Jupyter interactive computing environment. It provides JVM support, Spark cluster support, polyglot programming, interactive plots, tables, forms, publishing, and more.

BeakerX is available via conda, pip, and docker. Or try it live online with Binder. You can install with conda with just:

`conda install -c conda-forge ipywidgets beakerx`

Docker is a great way to run and manage servers including Jupyter with BeakerX built-in. BeakerX has a container distributed by Docker Hub so you can download and run it with just one command:

`docker run -p 8888:8888 beakerx/beakerx`

Smile (Statistical Machine Intelligence and Learning Engine) is a fast and comprehensive machine learning, NLP, linear algebra, graph, interpolation, and visualization system in Java and Scala. With advanced data structures and algorithms, Smile delivers state-of-art performance. Smile covers every aspect of machine learning, including classification, regression, clustering, association rule mining, feature selection, manifold learning, multidimensional scaling, genetic algorithms, missing value imputation, efficient nearest neighbor search, etc. See the project website for programming guides and more information.

In [1]:

```
%%classpath add mvn
com.github.haifengl smile-scala_2.11 1.5.2
```

In [2]:

```
import smile._
import smile.util._
import smile.math._
import java.lang.Math._
import smile.math.Math.{log2, logistic, factorial, choose, random, randomInt, permutate, c, cbind, rbind, sum, mean, median, q1, q3, `var` => variance, sd, mad, min, max, whichMin, whichMax, unique, dot, distance, pdist, KullbackLeiblerDivergence => kld, JensenShannonDivergence => jsd, cov, cor, spearman, kendall, norm, norm1, norm2, normInf, standardize, normalize, scale, unitize, unitize1, unitize2, root}
import smile.math.distance._
import smile.math.kernel._
import smile.math.matrix._
import smile.math.matrix.Matrix._
import smile.stat.distribution._
import smile.data._
import java.awt.Color, smile.plot._
import smile.interpolation._
import smile.validation._
import smile.association._
import smile.classification._
import smile.regression.{ols, ridge, lasso, svr, gpr}
import smile.feature._
import smile.clustering._
import smile.vq._
import smile.manifold._
import smile.mds._
import smile.sequence._
import smile.projection._
import smile.nlp._
import smile.wavelet._
```

Out[2]:

Most Smile algorithms take simple `double[]`

as input. But we often use the encapsulation class AttributeDataset. In fact, the output of most Smile data parsers is an `AttributeDataset`

object. `AttributeDataset`

contains a fixed number of attributes. All attribute values are stored as double even if the attribute may be nominal, ordinal, string, or date. The dataset is stored row-wise internally, which is fast for frequently accessing instances of dataset.

A data object may have an associated class label (for classification) or real-valued response value (for regression). Optionally, a data object or attribute may have a (positive) weight value, whose meaning depends on applications. However, most machine learning methods are not able to utilize this extra weight information.

`AttributeDataset`

may also contains meta data such as data name and descriptions.

Suppose we have an `AttributeDataset`

object, which may be created by a parser. To feed the data to a learning algorithm, we need to unwrap the data by the function `unzip`

function. If the data also have labels, the function `unzipInt`

can be used to return a tuple `(x, y)`

, of which `x`

is an array of feature vectors and `y`

is an array of labels. If the response variable is real valued in case of regression, `unzipDouble`

can be used.

In [3]:

```
val iris = read.arff("iris.arff", 4)
```

Out[3]:

In [4]:

```
iris.summary
```

Out[4]:

In [5]:

```
iris.tail(10)
```

Out[5]:

In [6]:

```
val (x, y) = iris.unzipInt
```

Out[6]:

In Scala, we can also wrap `AttributeDataset`

into a `DataFrame`

, which provides advanced data manipulation functions. For example, we can get a row with the array syntax or refer a column by its name.

In [7]:

```
val df = DataFrame(iris)
```

Out[7]:

In [8]:

```
df.sepallength
```

Out[8]:

It is also easy to create a subset of data frame.

In [9]:

```
df(10,100)
```

Out[9]:

Similarly, we can select a few columns to create a new data frame.

In [10]:

```
df("sepallength", "sepalwidth")
```

Out[10]:

Advanced operations such as `exists`

, `forall`

, `find`

, `filter`

are also supported. The predicate of these functions expect a Row, which has fields x for attribute vector and y for responsible variable. If the response variable is a class label, the user may use label to access the response variable in string label.

In [11]:

```
df.filter { row => row.x(1) > 3 && row.y != 0 }
```

Out[11]:

In machine learning and pattern recognition, classification refers to an algorithmic procedure for assigning a given input object into one of a given number of categories. The input object is formally termed an instance, and the categories are termed classes.

Classification normally refers to a supervised procedure, i.e. a procedure that produces an inferred function to predict the output value of new instances based on a training set of pairs consisting of an input object and a desired output value. The inferred function is called a classifier if the output is discrete or a regression function if the output is continuous.

The inferred function should predict the correct output value for any valid input object. This requires the learning algorithm to generalize from the training data to unseen situations in a "reasonable" way.

A wide range of supervised learning algorithms is available, each with its strengths and weaknesses. There is no single learning algorithm that works best on all supervised learning problems. The most widely used learning algorithms are AdaBoost and gradient boosting, support vector machines, linear regression, linear discriminant analysis, logistic regression, naive Bayes, decision trees, k-nearest neighbor algorithm, and neural networks (multilayer perceptron).

If the feature vectors include features of many different kinds (discrete, discrete ordered, counts, continuous values), some algorithms cannot be easily applied. Many algorithms, including linear regression, logistic regression, neural networks, and nearest neighbor methods, require that the input features be numerical and scaled to similar ranges (e.g., to the `[-1,1]`

interval). Methods that employ a distance function, such as nearest neighbor methods and support vector machines with Gaussian kernels, are particularly sensitive to this. An advantage of decision trees (and boosting algorithms based on decision trees) is that they easily handle heterogeneous data.

If the input features contain redundant information (e.g., highly correlated features), some learning algorithms (e.g., linear regression, logistic regression, and distance based methods) will perform poorly because of numerical instabilities. These problems can often be solved by imposing some form of regularization.

If each of the features makes an independent contribution to the output, then algorithms based on linear functions (e.g., linear regression, logistic regression, linear support vector machines, naive Bayes) generally perform well. However, if there are complex interactions among features, then algorithms such as nonlinear support vector machines, decision trees and neural networks work better. Linear methods can also be applied, but the engineer must manually specify the interactions when using them.

Smile's classification algorithms are in the package `smile.classification`

and all algorithms implement the interface `Classifier`

that has a method `predict`

to predict the class label of an instance. An overloaded version in `SoftClassifier`

can also calculate the a posteriori probabilities besides the class label. For all algorithms, the model can be trained through the constructor. Meanwhile, each algorithm has a `Trainer`

companion class that can hold model hyper-parameters and be applied to multiple training datasets.

Some algorithms with online learning capability also implement the interface `OnlineClassifier`

. Online learning is a model of induction that learns one instance at a time. The method learn updates the model with a new instance.

The k-nearest neighbor algorithm (k-NN) is a method for classifying objects by a majority vote of its neighbors, with the object being assigned to the class most common amongst its `k`

nearest neighbors (`k`

is typically small). k-NN is a type of instance-based learning, or lazy learning where the function is only approximated locally and all computation is deferred until classification.

In [12]:

```
cv(x, y, 10) { case (x, y) => knn(x, y, 3) }
```

Out[12]:

Random forest is an ensemble classifier that consists of many decision trees and outputs the majority vote of individual trees. The method combines bagging idea and the random selection of features.

Each tree is constructed using the following algorithm:

- If the number of cases in the training set is
`N`

, randomly sample`N`

cases with replacement from the original data. This sample will be the training set for growing the tree. - If there are
`M`

input variables, a number`m`

<<`M`

is specified such that at each node,`m`

variables are selected at random out of the`M`

and the best split on these m is used to split the node. The value of`m`

is held constant during the forest growing. - Each tree is grown to the largest extent possible. There is no pruning.

where ntrees is the number of trees, and mtry is the number of attributed randomly selected at each node to choose the best split. Although the original random forest algorithm trains each tree fully without pruning, it is useful to control the tree size some times, which can be achieved by the parameter maxNodes. The tree can also be regularized by limiting the minimum number of observations in trees' terminal nodes with the parameter nodeSize. When subsample = 1.0, we use the sampling with replacement to train each tree as described above. If subsample < 1.0, we instead select a subset of samples (without replacement) to train each tree. If the classes are not balanced, the user should provide the classWeight (proportional to the class priori) so that the sampling is done in stratified way. Otherwise, small classes may be not sampled sufficiently.

In [13]:

```
cv(x, y, 10) { case (x, y) => randomForest(x, y, iris.attributes) }
```

Out[13]:

The basic support vector machine (SVM) is a binary linear classifier which chooses the hyperplane that represents the largest separation, or margin, between the two classes. If such a hyperplane exists, it is known as the maximum-margin hyperplane and the linear classifier it defines is known as a maximum margin classifier.

In [14]:

```
cv(x, y, 10) { case (x, y) => svm(x, y, new LinearKernel, C = 1, epoch = 2) }
```

Out[14]:

If there exists no hyperplane that can perfectly split the positive and negative instances, the soft margin method will choose a hyperplane that splits the instances as cleanly as possible, while still maximizing the distance to the nearest cleanly split instances.

The nonlinear SVMs are created by applying the kernel trick to maximum-margin hyperplanes. The resulting algorithm is formally similar, except that every dot product is replaced by a nonlinear kernel function. This allows the algorithm to fit the maximum-margin hyperplane in a transformed feature space. The transformation may be nonlinear and the transformed space be high dimensional. For example, the feature space corresponding Gaussian kernel is a Hilbert space of infinite dimension. Thus though the classifier is a hyperplane in the high-dimensional feature space, it may be nonlinear in the original input space. Maximum margin classifiers are well regularized, so the infinite dimension does not spoil the results.

The effectiveness of SVM depends on the selection of kernel, the kernel's parameters, and soft margin parameter `C`

. Given a kernel, best combination of C and kernel's parameters is often selected by a grid-search with cross validation.

A multilayer perceptron neural network consists of several layers of nodes, interconnected through weighted acyclic arcs from each preceding layer to the following, without lateral or feedback connections. Each node calculates a transformed weighted linear combination of its inputs (output activations from the preceding layer), with one of the weights acting as a trainable bias connected to a constant input. The transformation, called activation function, is a bounded non-decreasing (non-linear) function, such as the sigmoid functions (ranges from `0`

to `1`

). Another popular activation function is hyperbolic tangent which is actually equivalent to the sigmoid function in shape but ranges from `-1`

to `1`

. More specialized activation functions include radial basis functions which are used in RBF networks.

For neural networks, the input patterns usually should be scaled/standardized. Commonly, each input variable is scaled into interval `[0, 1]`

or to have mean `0`

and standard deviation `1`

.

In [15]:

```
cv(x, y, 10) { case (x, y) => mlp(x, y, numUnits = c(4, 20, 3), error = NeuralNetwork.ErrorFunction.CROSS_ENTROPY, activation = NeuralNetwork.ActivationFunction.SOFTMAX) }
```

Out[15]:

Clustering is the assignment of a set of observations into subsets (called clusters) so that observations in the same cluster are similar in some sense. Clustering is a method of unsupervised learning, and a common technique for statistical data analysis used in many fields.

Hierarchical algorithms find successive clusters using previously established clusters. These algorithms usually are either agglomerative ("bottom-up") or divisive ("top-down"). Agglomerative algorithms begin with each element as a separate cluster and merge them into successively larger clusters. Divisive algorithms begin with the whole set and proceed to divide it into successively smaller clusters.

Partitional algorithms typically determine all clusters at once, but can also be used as divisive algorithms in the hierarchical clustering. Many partitional clustering algorithms require the specification of the number of clusters to produce in the input data set, prior to execution of the algorithm. Barring knowledge of the proper value beforehand, the appropriate value must be determined, a problem on its own for which a number of techniques have been developed.

Density-based clustering algorithms are devised to discover arbitrary-shaped clusters. In this approach, a cluster is regarded as a region in which the density of data objects exceeds a threshold.

Subspace clustering methods look for clusters that can only be seen in a particular projection (subspace, manifold) of the data. These methods thus can ignore irrelevant attributes. The general problem is also known as Correlation clustering while the special case of axis-parallel subspaces is also known as two-way clustering, co-clustering or biclustering in bioinformatics: in these methods not only the objects are clustered but also the features of the objects, i.e., if the data is represented in a data matrix, the rows and columns are clustered simultaneously. They usually do not however work with arbitrary feature combinations as in general subspace methods.

Agglomerative hierarchical clustering seeks to build a hierarchy of clusters in a bottom up approach: each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy. The results of hierarchical clustering are usually presented in a dendrogram.

In general, the merges are determined in a greedy manner. In order to decide which clusters should be combined, a measure of dissimilarity between sets of observations is required. In most methods of hierarchical clustering, this is achieved by use of an appropriate metric, and a linkage criteria which specifies the dissimilarity of sets as a function of the pairwise distances of observations in the sets.

Hierarchical clustering has the distinct advantage that any valid measure of distance can be used. In fact, the observations themselves are not required: all that is used is a matrix of distances.

In [18]:

```
val clusters = hclust(pdist(x), "complete")
clusters.getTree
```

Out[18]:

K-Means clustering partitions n observations into k clusters in which each observation belongs to the cluster with the nearest mean. Although finding an exact solution to the K-Means problem for arbitrary input is NP-hard, the standard approach to finding an approximate solution (often called Lloyd's algorithm or the K-Means algorithm) is used widely and frequently finds reasonable solutions quickly.

K-Means is a hard clustering method, i.e. each sample is assigned to a specific cluster. In contrast, soft clustering, e.g. the Expectation-Maximization algorithm for Gaussian mixtures, assign samples to different clusters with different probabilities.

The K-Means algorithm has at least two major theoretic shortcomings:

First, it has been shown that the worst case running time of the algorithm is super-polynomial in the input size.
Second, the approximation found can be arbitrarily bad with respect to the objective function compared to the optimal learn.
In Smile, we use K-Means++ which addresses the second of these obstacles by specifying a procedure to initialize the cluster centers before proceeding with the standard K-Means optimization iterations. With the K-Means++ initialization, the algorithm is guaranteed to find a solution that is `O(log k)`

competitive to the optimal K-Means solution.

We also use K-D trees to speed up each K-Means step as described in the filter algorithm by Kanungo, et al.

In [19]:

```
val clusters = kmeans(x, 6, runs = 20)
val y = clusters.getClusterLabel
```

Out[19]: