Deep Learning

Deep learning is based on artificial neural networks (ANNs) with representation learning. The adjective "deep" refers to the use of multiple layers in the network. Fundamentally, deep learning algorithms, such as convolutional neural networks and transformers, leverage a hierarchy of layers to transform input data into a slightly more abstract and composite representation.

Importantly, a deep learning process can learn which features to optimally place in which level on its own. Prior to deep learning, machine learning techniques often involved hand-crafted feature engineering to transform the data into a more suitable representation for a classification algorithm to operate upon. In the deep learning approach, features are not hand-crafted and the model discovers useful feature representations from the data automatically. This does not eliminate the need for hand-tuning; for example, varying numbers of layers and layer sizes can provide different degrees of abstraction.

While smile-core module provides MLP (multi-layer perceptron) for classification and regression tasks on tabular data, smile-deep module provides advanced algorithms for computer vision and large language models (LLMs). Furthermore, smile-deep supports GPU devices.

A Gentle Example

In the below code snippets, we show how to train a model on MNIST dataset. On line 5, we call the function Device.preferredDevice() that will return a GPU device if it exists, otherwise the default CPU device. You can also create a Device object by calling its factory methods such as Device.GPU(0), Device.MPS(), or Device.CPU(). Then we set the returned device as the default compute device. Line 5 and 6 are optional. Without them, we will use CPU as the default compute device.

On Line 8, we define a deep learning model with a sequential block of layers. For complicated models, it is helpful to print out its structure for verification as we do on Line 14. On Line 15, we move the model to the preferred compute device.

    import smile.deep.layer.*;
    import smile.deep.metric.*;
    import smile.deep.tensor.*;

    Device device = Device.preferredDevice();
    device.setDefaultDevice();

    Model net = new Model(new SequentialBlock(
            Layer.relu(784, 64, 0.5),
            Layer.relu(64, 32),
            Layer.logSoftmax(32, 10))
    );

    System.out.println(net);
    net.to(device);

    CSVFormat format = CSVFormat.Builder.create().setDelimiter(' ').build();
    double[][] x = Read.csv("data/mnist/mnist2500_X.txt", format).toArray();
    int[] y = Read.csv("data/mnist/mnist2500_labels.txt", format).column(0).toIntArray();
    Dataset dataset = Dataset.of(x, y, 64);

    Optimizer optimizer = Optimizer.SGD(net, 0.01);
    Loss loss = Loss.nll();
    net.train(100, optimizer, loss, dataset);

    try (var guard = Tensor.noGradGuard()) {
        Map<String, Double> metrics = net.eval(dataset,
                new Accuracy(),
                new Precision(Averaging.Micro),
                new Precision(Averaging.Macro),
                new Precision(Averaging.Weighted),
                new Recall(Averaging.Micro),
                new Recall(Averaging.Macro),
                new Recall(Averaging.Weighted));
        for (var entry : metrics.entrySet()) {
            System.out.format("Training %s = %.2f%%\n", entry.getKey(), 100 * entry.getValue());
        }
    }

From line 17 to 19, we load a sample data of MNIST. This is same as we used to do with smile-core. The data are read in as plain double[][]. Then on line 20, we create a Dataset object that wraps the data and target labels. The Dataset object implements the Iterable interface so that it may emit mini-batch samples of size 64 as specified by the third parameter if we loop through it.

From line 22 to 24, we create an SGD (stochastic gradient descent) optimizer, the negative log-likelihood (NLL) loss function, and train the model for 100 epochs. The whole process should finish very quickly (e.g. 15 seconds with CPU). Finally, we evaluate the model with a variety of metrics from line 26 to 38. Note that the evaluation is on the training data only for demonstration purpose. In practice, it is better to evaluate on a hold-out test dataset. On line 26, we create a no-grad guard in a try-with statement to prevent gradient computation. The inference code should be inside this try-with block. This is very helpful for inference as it minimizes the memory usage and avoids a lot of unnecessary computation. The guard object will be automatically released after the code block finishes.

EfficientNet

In previous section, we train a model from scratch. In this section, we demonstrate image classification with pretrained EfficientNetV2 models. EfficientNetV2 is a new family of convolutional networks that have faster training speed and better parameter efficiency than previous models.

On line 1, we create an instance of EfficientNet V2_S (small) model, which will load the pretrained weights at model/EfficientNet/efficientnet_v2_s.pt from the working directory. You may download the weights from smile-ai.org.

    var model = EfficientNet.V2S();
    model.to(device);
    model.eval();

    var lenna = ImageIO.read(new File("data/image/Lenna.png"));
    var panda = ImageIO.read(new File("data/image/panda.jpg"));

    try (var guard = Tensor.noGradGuard()) {
        long startTime = System.nanoTime();
        var output = model.forward(panda);
        long endTime = System.nanoTime();
        long duration = (endTime - startTime) / 1000000;  //divide by 1000000 to get milliseconds.
        System.out.println("1st run elapsed time: " + duration + "ms");

        startTime = System.nanoTime();
        output = model.forward(lenna, panda);
        endTime = System.nanoTime();
        duration = (endTime - startTime) / 1000000;
        System.out.println("2nd run elapsed time: " + duration + "ms");

        var topk = output.topk(5);
        topk._2().to(Device.CPU());
        String[] images = {"Lenna", "Panda"};
        for (int i = 0; i < 2; i++) {
            System.out.println("======== " + images[i] + " ========");
            for (int j = 0; j < 5; j++) {
                System.out.println(ImageNet.labels[topk._2().getInt(i, j)]);
            }
        }

Note that we run the inference twice for benchmarking. The first inference is typically slow due to multiple reasons. The very first CUDA call (it could be a tensor creation etc.) is creating the CUDA context, which loads the driver etc. The first inference also needs to allocate new memory, which will then be reused through the CUDACachingAllocator. However, the initial cudaMalloc calls are also "expensive" (compared to just reusing the already allocated memory) and you would thus also see a slow iteration time until your workload reached the peak memory and is able to reuse the GPU memory. Note that new cudaMalloc calls could of course still happen during the training, e.g. if your input size increases etc.

Fork me on GitHub