FAQ

How should I cite Smile?

Please cite Smile in your publications if it helps your research. Here is an example BibTeX entry:


    @misc{Li2014Smile,
      title={Smile},
      author={Haifeng Li},
      year={2014},
      howpublished={\url{https://haifengl.github.io}},
    }
    

Smile artifacts are hosted in Sonatype Nexus. You can add the following dependency into your pom.xml:


    <dependency>
      <groupId>com.github.haifengl</groupId>
      <artifactId>smile-core</artifactId>
      <version>2.5.3</version>
    </dependency>
    

If you're using Gradle, add the following line into your build file's dependencies section:


    implementation("com.github.haifengl:smile-core:2.5.3")
    

If you're using SBT, add the following line into your build file:


    libraryDependencies += "com.github.haifengl" % "smile-core" % "2.5.3"
    

For Scala API,


    libraryDependencies += "com.github.haifengl" %% "smile-scala" % "2.5.3"
    

Some algorithms rely on BLAS and LAPACK (e.g. manifold learning, some clustering algorithms, Gaussian Process regression, MLP, etc). To use these algorithms, you should include OpenBLAS for optimized matrix computation:


    libraryDependencies ++= Seq(
      "org.bytedeco" % "javacpp"   % "1.5.3"       classifier "macosx-x86_64" classifier "windows-x86_64" classifier "linux-x86_64" classifier "linux-arm64" classifier "linux-ppc64le" classifier "android-arm64" classifier "ios-arm64",
      "org.bytedeco" % "openblas"  % "0.3.9-1.5.3" classifier "macosx-x86_64" classifier "windows-x86_64" classifier "linux-x86_64" classifier "linux-arm64" classifier "linux-ppc64le" classifier "android-arm64" classifier "ios-arm64",
      "org.bytedeco" % "arpack-ng" % "3.7.0-1.5.3" classifier "macosx-x86_64" classifier "windows-x86_64" classifier "linux-x86_64" classifier "linux-arm64" classifier "linux-ppc64le" classifier ""
    )
    

In this example, we include all supported 64-bit platforms and filter out 32-bit platforms. The user should include only the needed platforms to save spaces.

If you prefer other BLAS implementations, you can use any library found on the "java.library.path" or on the class path, by specifying it with the "org.bytedeco.openblas.load" system property. For example, to use the BLAS library from the Accelerate framework on Mac OS X, we can pass options such as `-Djava.library.path=/usr/lib/ -Dorg.bytedeco.openblas.load=blas`.

For a default installation of MKL that would be `-Dorg.bytedeco.openblas.load=mkl_rt`. Or you may simply include `smile-mkl` module in your project, which includes MKL binaries. With `smile-mkl` module in the class path, Smile will automatically switch to MKL.


    libraryDependencies += "com.github.haifengl" %% "smile-mkl" % "2.5.3"
    

Model serialization

To serialize a model, you may use


    write(model, file)
    

This method is in the Scala API smile.write object and serialize the model to Java serialization format. This is handy if you want to use a model in Spark.

Alternatively, you can also use


    write.xstream(model, file)
    

which uses XStream library to serialize the model (actually any objects) to XML file.

To read the model back, you can use read(file) or read.xstream(file), correspondingly.

Data Format

Most Smile algorithms take simple double[] as input. So you can use your favorite methods or library to import the data as long as the samples are in double arrays. Meanwhile, Smile provides a couple of parsers for popular data formats, such as Weka's ARFF files, LibSVM's file format, delimited text files, and binary sparse data. These classes are in the package smile.data.parser and smile.io provides high level operators on top of these parsers. The package smile.data.parser.microarray also provides several parsers for microarray gene expression datasets, including GCT, PCL, RES, and TXT files.

Cannot build Smile with maven

We have moved to SBT to build packages. The maven pom.xml files were deprecated and were removed in v1.2.0.

Headless Plot

In case that your environment does not have a display or you need to generate and save a lot of plots without showing them on the screen, you may run Smile in headless model.


    bin/smile -Djava.awt.headless=true
    

    bin/jshell.sh -R-Djava.awt.headless=true
    

The following example shows how to save a plot in the headless mode.


    val toy = read.csv("data/classification/toy/toy-train.txt", delimiter='\t', header=false)
    val canvas = plot(toy, "V2", "V3", "V1", '.')
    val image = canvas.toBufferedImage(400, 400)
    javax.imageio.ImageIO.write(image, "png", new java.io.File("headless.png"))
    

    import java.awt.Color;
    import smile.io.*;
    import smile.plot.swing.*;
    import org.apache.commons.csv.CSVFormat;

    var toy = Read.csv("data/classification/toy/toy-train.txt", CSVFormat.DEFAULT.withDelimiter('\t'));
    var canvas = ScatterPlot.of(toy, "V2", "V3", "V1", '.').canvas();
    var image = canvas.toBufferedImage(400, 400);
    javax.imageio.ImageIO.write(image, "png", new java.io.File("headless.png"));
          

How can I set the random number generator seed for random forest?

This is a common question for stochastic algorithms like random forest. In general, this is discouraged because people often choose bad seed due to the lack of sufficient knowledge of random number generation. However, one may want the repeatable result for testing purpose. In this case, call smile.math.MathEx.setSeed before training the model.

Note that we don't provide a method to set the seed for a particular algorithm. Many algorithms are multithreaded and each thread has their own random number generator. We choose this design because each random number generator maintains an internal state so that it is not multithread-safe. If multithreads share a random number generator, we have to use locks, which significant reduce the performance.

A method setSeed() in the algorithm is also troublesome. For algorithms like random forest, it is not right to initialize every thread with the same seed. Otherwise, same decision trees will be created and we lose the randomness of "random" forest. It is also complicated to pass a sequence of random numbers because it is not clear how many random number generators are needed for many algorithms. Even worse, it breaks the encapsulation as the caller has to know the details of algorithms.

Fork me on GitHub