Data Processing

Machine learning is all about building models from data. However, data scientists frequently talk about models and algorithms first, which very likely generates suboptimal results. The other approach is to play with the data first. Even simple statistics and plots can help us get feelings of data and problems, which more likely lead us to better modelling.

Features

A feature is an individual measurable property of a phenomenon being observed. Features are also called explanatory variables, independent variables, predictors, regressors, etc. Any attribute could be a feature, but choosing informative, discriminating and independent features is a crucial step for effective algorithms in machine learning. Features are usually numeric and a set of numeric features can be conveniently described by a feature vector. Structural features such as strings, sequences and graphs are also used in areas such as natural language processing, computational biology, etc.

Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work. Feature engineering is fundamental to the application of machine learning, and is both difficult and expensive. It requires the experimentation of multiple possibilities and the combination of automated techniques with the intuition and knowledge of the domain expert.

Data Type

Generally speaking, there are two major types of attributes:

Qualitative variables:

The data values are non-numeric categories. Examples: Blood type, Gender.

Quantitative variables:

The data values are counts or numerical measurements. A quantitative variable can be either discrete such as the number of students receiving an 'A' in a class, or continuous such as GPA, salary and so on.

Another way of classifying data is by the measurement scales. In statistics, there are four generally used measurement scales:

Nominal data:

Data values are non-numeric group labels. For example, Gender variable can be defined as male = 0 and female =1.

Ordinal data:

Data values are categorical and may be ranked in some numerically meaningful way. For example, strongly disagree to strong agree may be defined as 1 to 5.

Continuous data:
  • Interval data: Data values are ranged in a real interval, which can be as large as from negative infinity to positive infinity. The difference between two values are meaningful, however, the ratio of two interval data is not meaningful. For example temperature, IQ.
  • Ratio data: Both difference and ratio of two values are meaningful. For example, salary, weight.

Many machine learning algorithms can only handle numeric attributes while a few such as decision trees can process nominal attribute directly. Date attribute is useful in plotting. With some feature engineering, values like day of week can be used as nominal attribute. String attribute could be used in text mining and natural language processing.

DataFrame

Many Smile algorithms take simple double[] as input. But we also use the encapsulation class DataFrame. In fact, the output of most Smile data parsers is a DataFrame object that contains a number of named columns.


    smile> val iris = read.arff("data/weka/iris.arff")
    [main] INFO smile.io.Arff - Read ARFF relation iris
    iris: DataFrame =
    +-----------+----------+-----------+----------+-----------+
    |sepallength|sepalwidth|petallength|petalwidth|      class|
    +-----------+----------+-----------+----------+-----------+
    |        5.1|       3.5|        1.4|       0.2|Iris-setosa|
    |        4.9|         3|        1.4|       0.2|Iris-setosa|
    |        4.7|       3.2|        1.3|       0.2|Iris-setosa|
    |        4.6|       3.1|        1.5|       0.2|Iris-setosa|
    |          5|       3.6|        1.4|       0.2|Iris-setosa|
    |        5.4|       3.9|        1.7|       0.4|Iris-setosa|
    |        4.6|       3.4|        1.4|       0.3|Iris-setosa|
    |          5|       3.4|        1.5|       0.2|Iris-setosa|
    |        4.4|       2.9|        1.4|       0.2|Iris-setosa|
    |        4.9|       3.1|        1.5|       0.1|Iris-setosa|
    +-----------+----------+-----------+----------+-----------+
    140 more rows...

    smile> iris.summary
    res1: DataFrame =
    +-----------+-----+---+--------+---+
    |     column|count|min|     avg|max|
    +-----------+-----+---+--------+---+
    |sepallength|  150|4.3|5.843333|7.9|
    | sepalwidth|  150|  2|   3.054|4.4|
    |petallength|  150|  1|3.758667|6.9|
    | petalwidth|  150|0.1|1.198667|2.5|
    +-----------+-----+---+--------+---+
    

    smile> import smile.data.*

    smile> import smile.io.*

    smile> var iris = Read.arff("data/weka/iris.arff")
    [main] INFO smile.io.Arff - Read ARFF relation iris
    $3 ==> [sepallength: float, sepalwidth: float, petallength: float, petalwidth: float, class: byte nominal[Iris-setosa, Iris-versicolor, Iris-virginica]]
    +-----------+----------+-----------+----------+-----------+
    |sepallength|sepalwidth|petallength|petalwidth|      class|
    +-----------+----------+-----------+----------+-----------+
    |        5.1|       3.5|        1.4|       0.2|Iris-setosa|
    |        4.9|         3|        1.4|       0.2|Iris-setosa|
    |        4.7|       3.2|        1.3|       0.2|Iris-setosa|
    |        4.6|       3.1|        1.5|       0.2|Iris-setosa|
    |          5|       3.6|        1.4|       0.2|Iris-setosa|
    |        5.4|       3.9|        1.7|       0.4|Iris-setosa|
    |        4.6|       3.4|        1.4|       0.3|Iris-setosa|
    |          5|       3.4|        1.5|       0.2|Iris-setosa|
    |        4.4|       2.9|        1.4|       0.2|Iris-setosa|
    |        4.9|       3.1|        1.5|       0.1|Iris-setosa|
    +-----------+----------+-----------+----------+-----------+
    140 more rows...

    smile> iris.summary()
    $5 ==> [column: String, count: long, min: double, avg: double, max: double]
    +-----------+-----+---+--------+---+
    |     column|count|min|     avg|max|
    +-----------+-----+---+--------+---+
    |sepallength|  150|4.3|5.843333|7.9|
    | sepalwidth|  150|  2|   3.054|4.4|
    |petallength|  150|  1|3.758667|6.9|
    | petalwidth|  150|0.1|1.198667|2.5|
    +-----------+-----+---+--------+---+
    

    >>> import smile.*
    >>> import smile.data.*
    >>> import smile.io.*
    >>> val iris = Read.arff("data/weka/iris.arff")
    [main] INFO smile.io.Arff - Read ARFF relation iris
    >>> iris
    res3: smile.data.DataFrame! = [sepallength: float, sepalwidth: float, petallength: float, petalwidth: float, class: byte nominal[Iris-setosa, Iris-versicolor, Iris-virginica]]
    +-----------+----------+-----------+----------+-----------+
    |sepallength|sepalwidth|petallength|petalwidth|      class|
    +-----------+----------+-----------+----------+-----------+
    |        5.1|       3.5|        1.4|       0.2|Iris-setosa|
    |        4.9|         3|        1.4|       0.2|Iris-setosa|
    |        4.7|       3.2|        1.3|       0.2|Iris-setosa|
    |        4.6|       3.1|        1.5|       0.2|Iris-setosa|
    |          5|       3.6|        1.4|       0.2|Iris-setosa|
    |        5.4|       3.9|        1.7|       0.4|Iris-setosa|
    |        4.6|       3.4|        1.4|       0.3|Iris-setosa|
    |          5|       3.4|        1.5|       0.2|Iris-setosa|
    |        4.4|       2.9|        1.4|       0.2|Iris-setosa|
    |        4.9|       3.1|        1.5|       0.1|Iris-setosa|
    +-----------+----------+-----------+----------+-----------+
    140 more rows...

    >>> iris.summary()
    res4: smile.data.DataFrame! = [column: String, count: long, min: double, avg: double, max: double]
    +-----------+-----+---+--------+---+
    |     column|count|min|     avg|max|
    +-----------+-----+---+--------+---+
    |sepallength|  150|4.3|5.843333|7.9|
    | sepalwidth|  150|  2|   3.054|4.4|
    |petallength|  150|  1|3.758667|6.9|
    | petalwidth|  150|0.1|1.198667|2.5|
    +-----------+-----+---+--------+---+
    

We can get a row with the array syntax or refer a column by its name.


    smile> iris(0)
    res5: Tuple = {
      sepallength: 5.1,
      sepalwidth: 3.5,
      petallength: 1.4,
      petalwidth: 0.2,
      class: Iris-setosa
    }

    smile> iris("sepallength")
    res6: vector.BaseVector[T, TS, S] = [5.099999904632568, 4.900000095367432, 4.699999809265137, 4.599999904632568, 5.0, 5.400000095367432, 4.599999904632568, 5.0, 4.400000095367432, 4.900000095367432, ... 140 more]
    

    smile> iris.get(0)
    $7 ==> {
      sepallength: 5.1,
      sepalwidth: 3.5,
      petallength: 1.4,
      petalwidth: 0.2,
      class: Iris-setosa
    }

    smile> iris.column("sepallength")
    $8 ==> [5.1, 4.9, 4.7, 4.6, 5, 5.4, 4.6, 5, 4.4, 4.9, ... 140 more]
          

    >>> iris[0]
    res6: smile.data.Tuple! = {
      sepallength: 5.1,
      sepalwidth: 3.5,
      petallength: 1.4,
      petalwidth: 0.2,
      class: Iris-setosa
    }
    >>> iris.column("sepallength")
    res7: smile.data.vector.BaseVector<(raw) kotlin.Any!, (raw) kotlin.Any!, (raw) java.util.stream.BaseStream<*, *>!>! = [5.099999904632568, 4.900000095367432, 4.699999809265137, 4.599999904632568, 5.0, 5.400000095367432, 4.599999904632568, 5.0, 4.400000095367432, 4.900000095367432, ... 140 more]
    

Similarly, we can select a few columns to create a new data frame.


    smile> iris.select("sepallength", "sepalwidth")
    res8: DataFrame =
    +-----------+----------+
    |sepallength|sepalwidth|
    +-----------+----------+
    |        5.1|       3.5|
    |        4.9|         3|
    |        4.7|       3.2|
    |        4.6|       3.1|
    |          5|       3.6|
    |        5.4|       3.9|
    |        4.6|       3.4|
    |          5|       3.4|
    |        4.4|       2.9|
    |        4.9|       3.1|
    +-----------+----------+
    140 more rows...
    

    smile> iris.select("sepallength", "sepalwidth")
    $9 ==> [sepallength: float, sepalwidth: float]
    +-----------+----------+
    |sepallength|sepalwidth|
    +-----------+----------+
    |        5.1|       3.5|
    |        4.9|         3|
    |        4.7|       3.2|
    |        4.6|       3.1|
    |          5|       3.6|
    |        5.4|       3.9|
    |        4.6|       3.4|
    |          5|       3.4|
    |        4.4|       2.9|
    |        4.9|       3.1|
    +-----------+----------+
    140 more rows...
          

    >>> iris.select("sepallength", "sepalwidth")
    res8: smile.data.DataFrame! = [sepallength: float, sepalwidth: float]
    +-----------+----------+
    |sepallength|sepalwidth|
    +-----------+----------+
    |        5.1|       3.5|
    |        4.9|         3|
    |        4.7|       3.2|
    |        4.6|       3.1|
    |          5|       3.6|
    |        5.4|       3.9|
    |        4.6|       3.4|
    |          5|       3.4|
    |        4.4|       2.9|
    |        4.9|       3.1|
    +-----------+----------+
    140 more rows...
    

Advanced operations such as exists, forall, find, filter are also supported. In Java API, all these operations are on Stream. The corresponding methods are anyMatch, allMatch, findAny, and filter. The predicate of these functions expect a Tuple


    smile> iris.exists(_.getDouble(0) > 4.5)
    res16: Boolean = true

    smile> iris.forall(_.getDouble(0) < 10)
    res17: Boolean = true

    smile> iris.find(_("class") == 1)
    res18: java.util.Optional[Tuple] = Optional[{
      sepallength: 6.2,
      sepalwidth: 2.9,
      petallength: 4.3,
      petalwidth: 1.3,
      class: Iris-versicolor
    }]

    smile> iris.find(_.getString("class").equals("Iris-versicolor"))
    res19: java.util.Optional[Tuple] = Optional[{
      sepallength: 6.2,
      sepalwidth: 2.9,
      petallength: 4.3,
      petalwidth: 1.3,
      class: Iris-versicolor
    }]

    smile> iris.filter { row => row.getDouble(1) > 3 && row("class") != 0 }
    res20: DataFrame =
    +-----------+----------+-----------+----------+---------------+
    |sepallength|sepalwidth|petallength|petalwidth|          class|
    +-----------+----------+-----------+----------+---------------+
    |          7|       3.2|        4.7|       1.4|Iris-versicolor|
    |        6.4|       3.2|        4.5|       1.5|Iris-versicolor|
    |        6.9|       3.1|        4.9|       1.5|Iris-versicolor|
    |        6.3|       3.3|        4.7|       1.6|Iris-versicolor|
    |        6.7|       3.1|        4.4|       1.4|Iris-versicolor|
    |        5.9|       3.2|        4.8|       1.8|Iris-versicolor|
    |          6|       3.4|        4.5|       1.6|Iris-versicolor|
    |        6.7|       3.1|        4.7|       1.5|Iris-versicolor|
    |        6.3|       3.3|          6|       2.5| Iris-virginica|
    |        7.2|       3.6|        6.1|       2.5| Iris-virginica|
    +-----------+----------+-----------+----------+---------------+
    15 more rows...
    

    smile> iris.stream().anyMatch(row -> row.getDouble(0) > 4.5)
    $14 ==> true

    smile> iris.stream().allMatch(row -> row.getDouble(0) < 10)
    $15 ==> true

    smile> iris.stream().filter(row -> row.getByte("class") == 1).findAny()
    $17 ==> Optional[{
      sepallength: 6.2,
      sepalwidth: 2.9,
      petallength: 4.3,
      petalwidth: 1.3,
      class: Iris-versicolor
    }]

    smile> iris.stream().filter(row -> row.getString("class").equals("Iris-versicolor")).findAny()
    $18 ==> Optional[{
      sepallength: 6.2,
      sepalwidth: 2.9,
      petallength: 4.3,
      petalwidth: 1.3,
      class: Iris-versicolor
    }]

    smile> DataFrame.of(iris.stream().filter(row -> row.getDouble(1) > 3 && row.getByte("class") != 0))
    $20 ==> [sepallength: float, sepalwidth: float, petallength: float, petalwidth: float, class: byte nominal[Iris-setosa, Iris-versicolor, Iris-virginica]]
    +-----------+----------+-----------+----------+---------------+
    |sepallength|sepalwidth|petallength|petalwidth|          class|
    +-----------+----------+-----------+----------+---------------+
    |          7|       3.2|        4.7|       1.4|Iris-versicolor|
    |        6.4|       3.2|        4.5|       1.5|Iris-versicolor|
    |        6.9|       3.1|        4.9|       1.5|Iris-versicolor|
    |        6.3|       3.3|        4.7|       1.6|Iris-versicolor|
    |        6.7|       3.1|        4.4|       1.4|Iris-versicolor|
    |          6|       3.4|        4.5|       1.6|Iris-versicolor|
    |        6.7|       3.1|        4.7|       1.5|Iris-versicolor|
    |        6.3|       3.3|          6|       2.5| Iris-virginica|
    |        7.2|       3.6|        6.1|       2.5| Iris-virginica|
    +-----------+----------+-----------+----------+---------------+
    15 more rows...
          

    >>> iris.stream().anyMatch({row -> row.getDouble(0) > 4.5})
    res10: kotlin.Boolean = true
    >>> iris.stream().allMatch({row -> row.getDouble(0) < 10})
    res11: kotlin.Boolean = true
    >>> iris.stream().filter({row -> row.getByte("class") == 1.toByte()}).findAny()
    res14: java.util.Optional<smile.data.Tuple!>! = Optional[{
      sepallength: 6.2,
      sepalwidth: 2.9,
      petallength: 4.3,
      petalwidth: 1.3,
      class: Iris-versicolor
    }]
    >>> iris.stream().filter({row -> row.getString("class").equals("Iris-versicolor")}).findAny()
    res15: java.util.Optional<smile.data.Tuple!>! = Optional[{
      sepallength: 5.4,
      sepalwidth: 3,
      petallength: 4.5,
      petalwidth: 1.5,
      class: Iris-versicolor
    }]
    >>> DataFrame.of(iris.stream().filter({row -> row.getDouble(1) > 3 && row.getByte("class") != 0.toByte()}))
    res22: smile.data.DataFrame! = [sepallength: float, sepalwidth: float, petallength: float, petalwidth: float, class: byte nominal[Iris-setosa, Iris-versicolor, Iris-virginica]]
    +-----------+----------+-----------+----------+---------------+
    |sepallength|sepalwidth|petallength|petalwidth|          class|
    +-----------+----------+-----------+----------+---------------+
    |          7|       3.2|        4.7|       1.4|Iris-versicolor|
    |        6.4|       3.2|        4.5|       1.5|Iris-versicolor|
    |        6.9|       3.1|        4.9|       1.5|Iris-versicolor|
    |        6.3|       3.3|        4.7|       1.6|Iris-versicolor|
    |        6.7|       3.1|        4.4|       1.4|Iris-versicolor|
    |        5.9|       3.2|        4.8|       1.8|Iris-versicolor|
    |          6|       3.4|        4.5|       1.6|Iris-versicolor|
    |        6.7|       3.1|        4.7|       1.5|Iris-versicolor|
    |        6.3|       3.3|          6|       2.5| Iris-virginica|
    |        7.2|       3.6|        6.1|       2.5| Iris-virginica|
    +-----------+----------+-----------+----------+---------------+
    15 more rows...
    

Besides numeric and nominal values, many other data types are also supported.


    smile> val strings = read.arff("data/weka/string.arff")
    [main] INFO smile.io.Arff - Read ARFF relation LCCvsLCSH
    strings: DataFrame =
    +-----+--------------------------------------+
    |  LCC|                                  LCSH|
    +-----+--------------------------------------+
    |  AG5|Encyclopedias and dictionaries.;Twe...|
    |AS262|   Science -- Soviet Union -- History.|
    |  AE5|       Encyclopedias and dictionaries.|
    |AS281|Astronomy, Assyro-Babylonian.;Moon ...|
    |AS281|Astronomy, Assyro-Babylonian.;Moon ...|
    +-----+--------------------------------------+

    smile> strings.filter(_.getString(0).startsWith("AS"))
    res21: DataFrame =
    +-----+--------------------------------------+
    |  LCC|                                  LCSH|
    +-----+--------------------------------------+
    |AS262|   Science -- Soviet Union -- History.|
    |AS281|Astronomy, Assyro-Babylonian.;Moon ...|
    |AS281|Astronomy, Assyro-Babylonian.;Moon ...|
    +-----+--------------------------------------+

    smile> val dates = read.arff("data/weka/date.arff")
    [main] INFO smile.io.Arff - Read ARFF relation Timestamps
    dates: DataFrame =
    +-------------------+
    |          timestamp|
    +-------------------+
    |2001-04-03 12:12:12|
    |2001-05-03 12:59:55|
    +-------------------+
    

    smile> var strings = Read.arff("data/weka/string.arff")
    [main] INFO smile.io.Arff - Read ARFF relation LCCvsLCSH
    strings ==> [LCC: String, LCSH: String]
    +-----+--------------------------------------+
    |  LCC|                                  LCSH|
    +-----+--------------------------------------+
    |  AG5|Encyclopedias and dictionaries.;Twe...|
    |AS262|   Science -- Soviet Union -- History.|
    |  AE5|       Encyclopedias and dictionaries.|
    |AS281|Astronomy, Assyro-Babylonian.;Moon ...|
    |AS281|Astronomy, Assyro-Babylonian.;Moon ...|
    +-----+--------------------------------------+

    smile> var dates = Read.arff("data/weka/date.arff")
    [main] INFO smile.io.Arff - Read ARFF relation Timestamps
    dates ==> [timestamp: DateTime]
    +-------------------+
    |          timestamp|
    +-------------------+
    |2001-04-03 12:12:12|
    |2001-05-03 12:59:55|
    +-------------------+
          

    >>> val strings = read.arff("data/weka/string.arff")
    [main] INFO smile.io.Arff - Read ARFF relation LCCvsLCSH
    >>> strings
    res26: smile.data.DataFrame = [LCC: String, LCSH: String]
    +-----+--------------------------------------+
    |  LCC|                                  LCSH|
    +-----+--------------------------------------+
    |  AG5|Encyclopedias and dictionaries.;Twe...|
    |AS262|   Science -- Soviet Union -- History.|
    |  AE5|       Encyclopedias and dictionaries.|
    |AS281|Astronomy, Assyro-Babylonian.;Moon ...|
    |AS281|Astronomy, Assyro-Babylonian.;Moon ...|
    +-----+--------------------------------------+

    >>> val dates = read.arff("data/weka/date.arff") 
    [main] INFO smile.io.Arff - Read ARFF relation Timestamps
    >>> dates
    res28: smile.data.DataFrame = [timestamp: DateTime]
    +-------------------+
    |          timestamp|
    +-------------------+
    |2001-04-03 12:12:12|
    |2001-05-03 12:59:55|
    +-------------------+
    

For data wrangling, the most important functions of DataFrame are map and groupBy.


    smile> iris.map { row =>
                    val x = new Array[Double](6)
                    for (i <- 0 until 4) x(i) = row.getDouble(i)
                    x(4) = x(0) * x(1)
                    x(5) = x(2) * x(3)
                    x
                  }
    res22: Iterable[Array[Double]] = ArrayBuffer(
      Array(5.1, 3.5, 1.4, 0.2, 17.849999999999998, 0.27999999999999997),
      Array(4.9, 3.0, 1.4, 0.2, 14.700000000000001, 0.27999999999999997),
      Array(4.7, 3.2, 1.3, 0.2, 15.040000000000001, 0.26),
      Array(4.6, 3.1, 1.5, 0.2, 14.26, 0.30000000000000004),
      Array(5.0, 3.6, 1.4, 0.2, 18.0, 0.27999999999999997),
      Array(5.4, 3.9, 1.7, 0.4, 21.060000000000002, 0.68),
      Array(4.6, 3.4, 1.4, 0.3, 15.639999999999999, 0.42),
      Array(5.0, 3.4, 1.5, 0.2, 17.0, 0.30000000000000004),
      Array(4.4, 2.9, 1.4, 0.2, 12.76, 0.27999999999999997),
      Array(4.9, 3.1, 1.5, 0.1, 15.190000000000001, 0.15000000000000002),
      Array(5.4, 3.7, 1.5, 0.2, 19.980000000000004, 0.30000000000000004),
      Array(4.8, 3.4, 1.6, 0.2, 16.32, 0.32000000000000006),
      Array(4.8, 3.0, 1.4, 0.1, 14.399999999999999, 0.13999999999999999),
      Array(4.3, 3.0, 1.1, 0.1, 12.899999999999999, 0.11000000000000001),
      Array(5.8, 4.0, 1.2, 0.2, 23.2, 0.24),
      Array(5.7, 4.4, 1.5, 0.4, 25.080000000000002, 0.6000000000000001),
      Array(5.4, 3.9, 1.3, 0.4, 21.060000000000002, 0.52),
      Array(5.1, 3.5, 1.4, 0.3, 17.849999999999998, 0.42),
      Array(5.7, 3.8, 1.7, 0.3, 21.66, 0.51),
      Array(5.1, 3.8, 1.5, 0.3, 19.38, 0.44999999999999996),
      Array(5.4, 3.4, 1.7, 0.2, 18.36, 0.34),
      Array(5.1, 3.7, 1.5, 0.4, 18.87, 0.6000000000000001),
      Array(4.6, 3.6, 1.0, 0.2, 16.56, 0.2),
      Array(5.1, 3.3, 1.7, 0.5, 16.83, 0.85),
    ...
    

    smile> var x6 = iris.stream().map(row -> {
       ...>     var x = new double[6];
       ...>     for (int i = 0; i < 4; i++) x[i] = row.getDouble(i);
       ...>     x[4] = x[0] * x[1];
       ...>     x[5] = x[2] * x[3];
       ...>     return x;
       ...> })
    x6 ==> java.util.stream.ReferencePipeline$3@32eff876

    smile> x6.forEach(xi -> System.out.println(Arrays.toString(xi)))
    [6.199999809265137, 2.9000000953674316, 4.300000190734863, 1.2999999523162842, 17.980000038146954, 5.590000042915335]
    [7.300000190734863, 2.9000000953674316, 6.300000190734863, 1.7999999523162842, 21.170001249313373, 11.340000042915335]
    [7.699999809265137, 3.0, 6.099999904632568, 2.299999952316284, 23.09999942779541, 14.029999489784245]
    [6.699999809265137, 2.5, 5.800000190734863, 1.7999999523162842, 16.749999523162842, 10.440000066757193]
    [7.199999809265137, 3.5999999046325684, 6.099999904632568, 2.5, 25.919998626709003, 15.249999761581421]
    [6.5, 3.200000047683716, 5.099999904632568, 2.0, 20.800000309944153, 10.199999809265137]
    [6.400000095367432, 2.700000047683716, 5.300000190734863, 1.899999976158142, 17.28000056266785, 10.070000236034389]
    [5.699999809265137, 2.5999999046325684, 3.5, 1.0, 14.819998960495013, 3.5]
    [4.599999904632568, 3.5999999046325684, 1.0, 0.20000000298023224, 16.55999921798707, 0.20000000298023224]
    [5.400000095367432, 3.0, 4.5, 1.5, 16.200000286102295, 6.75]
    [6.699999809265137, 3.0999999046325684, 4.400000095367432, 1.399999976158142, 20.76999876976015, 6.160000028610227]
    [5.099999904632568, 3.799999952316284, 1.600000023841858, 0.20000000298023224, 19.379999394416814, 0.32000000953674324]
    [5.599999904632568, 3.0, 4.5, 1.5, 16.799999713897705, 6.75]
    [6.0, 3.4000000953674316, 4.5, 1.600000023841858, 20.40000057220459, 7.200000107288361]
    [5.099999904632568, 3.299999952316284, 1.7000000476837158, 0.5, 16.82999944210053, 0.8500000238418579]
    [5.5, 2.4000000953674316, 3.799999952316284, 1.100000023841858, 13.200000524520874, 4.1800000381469715]
    [7.099999904632568, 3.0, 5.900000095367432, 2.0999999046325684, 21.299999713897705, 12.38999963760375]
    [6.300000190734863, 3.4000000953674316, 5.599999904632568, 2.4000000953674316, 21.420001249313373, 13.440000305175772]
    [5.099999904632568, 2.5, 3.0, 1.100000023841858, 12.749999761581421, 3.3000000715255737]
    [6.400000095367432, 3.0999999046325684, 5.5, 1.7999999523162842, 19.839999685287466, 9.899999737739563]
    [6.300000190734863, 2.9000000953674316, 5.599999904632568, 1.7999999523162842, 18.27000115394594, 10.079999561309819]
    [5.5, 2.4000000953674316, 3.700000047683716, 1.0, 13.200000524520874, 3.700000047683716]
    [6.5, 3.0, 5.800000190734863, 2.200000047683716, 19.5, 12.76000069618226]
    [7.599999904632568, 3.0, 6.599999904632568, 2.0999999046325684, 22.799999713897705, 13.859999170303354]
    [4.900000095367432, 2.5, 4.5, 1.7000000476837158, 12.250000238418579, 7.650000214576721]
    [5.0, 2.299999952316284, 3.299999952316284, 1.0, 11.499999761581421, 3.299999952316284]
    [5.599999904632568, 2.700000047683716, 4.199999809265137, 1.2999999523162842, 15.120000009536739, 5.45999955177308]
    ...
          

    >>>  val x6 = iris.stream().map({row ->
    ...            val x = DoubleArray(6)
    ...            for (i in 0..3) x[i] = row.getDouble(i)
    ...            x[4] = x[0] * x[1]
    ...            x[5] = x[2] * x[3]
    ...            x
    ...        })
    >>> x6.forEach({xi: DoubleArray -> println(java.util.Arrays.toString(xi))})
    [5.699999809265137, 2.5999999046325684, 3.5, 1.0, 14.819998960495013, 3.5]
    [6.699999809265137, 3.0999999046325684, 4.400000095367432, 1.399999976158142, 20.76999876976015, 6.160000028610227]
    [5.400000095367432, 3.0, 4.5, 1.5, 16.200000286102295, 6.75]
    [5.5, 2.4000000953674316, 3.799999952316284, 1.100000023841858, 13.200000524520874, 4.1800000381469715]
    [5.599999904632568, 3.0, 4.5, 1.5, 16.799999713897705, 6.75]
    [4.900000095367432, 3.0999999046325684, 1.5, 0.10000000149011612, 15.189999828338614, 0.15000000223517418]
    [4.599999904632568, 3.5999999046325684, 1.0, 0.20000000298023224, 16.55999921798707, 0.20000000298023224]
    [7.699999809265137, 3.0, 6.099999904632568, 2.299999952316284, 23.09999942779541, 14.029999489784245]
    [5.400000095367432, 3.700000047683716, 1.5, 0.20000000298023224, 19.980000610351567, 0.30000000447034836]
    [5.800000190734863, 2.700000047683716, 4.099999904632568, 1.0, 15.660000791549692, 4.099999904632568]
    [6.300000190734863, 3.4000000953674316, 5.599999904632568, 2.4000000953674316, 21.420001249313373, 13.440000305175772]
    [6.0, 3.4000000953674316, 4.5, 1.600000023841858, 20.40000057220459, 7.200000107288361]
    [6.199999809265137, 2.200000047683716, 4.5, 1.5, 13.63999987602233, 6.75]
    [6.400000095367432, 3.0999999046325684, 5.5, 1.7999999523162842, 19.839999685287466, 9.899999737739563]
    [6.699999809265137, 3.0999999046325684, 4.699999809265137, 1.5, 20.76999876976015, 7.049999713897705]
    [5.5, 2.4000000953674316, 3.700000047683716, 1.0, 13.200000524520874, 3.700000047683716]
    [5.099999904632568, 3.799999952316284, 1.600000023841858, 0.20000000298023224, 19.379999394416814, 0.32000000953674324]
    [6.199999809265137, 2.9000000953674316, 4.300000190734863, 1.2999999523162842, 17.980000038146954, 5.590000042915335]
    [6.300000190734863, 2.299999952316284, 4.400000095367432, 1.2999999523162842, 14.490000138282767, 5.719999914169307]
    [5.800000190734863, 2.700000047683716, 3.9000000953674316, 1.2000000476837158, 15.660000791549692, 4.680000300407414]
    [6.0, 3.0, 4.800000190734863, 1.7999999523162842, 18.0, 8.640000114440909]
    [5.599999904632568, 2.5, 3.9000000953674316, 1.100000023841858, 13.999999761581421, 4.290000197887423]
    [4.800000190734863, 3.4000000953674316, 1.600000023841858, 0.20000000298023224, 16.320001106262225, 0.32000000953674324]
    [6.900000095367432, 3.0999999046325684, 5.400000095367432, 2.0999999046325684, 21.38999963760375, 11.339999685287466]
    [5.900000095367432, 3.200000047683716, 4.800000190734863, 1.7999999523162842, 18.88000058650971, 8.640000114440909]
    [4.800000190734863, 3.0, 1.399999976158142, 0.10000000149011612, 14.40000057220459, 0.13999999970197674]
    [5.099999904632568, 3.299999952316284, 1.7000000476837158, 0.5, 16.82999944210053, 0.8500000238418579]
    [6.099999904632568, 2.799999952316284, 4.0, 1.2999999523162842, 17.07999944210053, 5.199999809265137]
    [7.900000095367432, 3.799999952316284, 6.400000095367432, 2.0, 30.01999998569488, 12.800000190734863]
    [6.0, 2.700000047683716, 5.099999904632568, 1.600000023841858, 16.200000286102295, 8.159999969005582]
    [6.400000095367432, 2.799999952316284, 5.599999904632568, 2.200000047683716, 17.919999961853023, 12.320000057220454]
    [6.599999904632568, 3.0, 4.400000095367432, 1.399999976158142, 19.799999713897705, 6.160000028610227]
    ...
    

The groupBy operation groups elements according to a classification function, and returning the results in a Map. The classification function maps elements to some key type K. The collector produces a map whose keys are the values resulting from applying the classification function to the input elements, and whose corresponding values are Lists containing the input elements which map to the associated key under the classification function.


    smile> iris.groupBy(row => row.getString("class"))
    res23: Map[String, DataFrame] = Map(
      "Iris-virginica" ->
    +-----------+----------+-----------+----------+--------------+
    |sepallength|sepalwidth|petallength|petalwidth|         class|
    +-----------+----------+-----------+----------+--------------+
    |        6.3|       3.3|          6|       2.5|Iris-virginica|
    |        5.8|       2.7|        5.1|       1.9|Iris-virginica|
    |        7.1|         3|        5.9|       2.1|Iris-virginica|
    |        6.3|       2.9|        5.6|       1.8|Iris-virginica|
    |        6.5|         3|        5.8|       2.2|Iris-virginica|
    |        7.6|         3|        6.6|       2.1|Iris-virginica|
    |        4.9|       2.5|        4.5|       1.7|Iris-virginica|
    |        7.3|       2.9|        6.3|       1.8|Iris-virginica|
    |        6.7|       2.5|        5.8|       1.8|Iris-virginica|
    |        7.2|       3.6|        6.1|       2.5|Iris-virginica|
    +-----------+----------+-----------+----------+--------------+
    40 more rows...
    ,
      "Iris-versicolor" ->
    +-----------+----------+-----------+----------+---------------+
    |sepallength|sepalwidth|petallength|petalwidth|          class|
    +-----------+----------+-----------+----------+---------------+
    |          7|       3.2|        4.7|       1.4|Iris-versicolor|
    |        6.4|       3.2|        4.5|       1.5|Iris-versicolor|
    |        6.9|       3.1|        4.9|       1.5|Iris-versicolor|
    ...
    

    smile> iris.stream().collect(java.util.stream.Collectors.groupingBy(row -> row.getString("class")))
    $24 ==> {Iris-versicolor=[{
      sepallength: 7,
      sepalwidth: 3.2,
      petallength: 4.7,
      petalwidth: 1.4,
      class: Iris-versicolor
    }, {
      sepallength: 6.4,
      sepalwidth: 3.2,
      petallength: 4.5,
      petalwidth: 1.5,
      class: Iris-versicolor
    }, {
      sepallength: 6.9,
      sepalwidth: 3.1,
      petallength: 4.9,
      petalwidth: 1.5,
      class: Iris-versicolor
    }, {
      sepallength: 5.5,
      sepalwidth: 2.3,
      petallength: 4,
      petalwidth: 1.3,
      class: Iris-versicolor
    }, {
      sepallength: 6.5,
      sepalwidth: 2.8,
      petallength: 4.6,
      petalwidth: 1.5,
      class: Iris-versicolor
    }, {
      sepallength: 5.7,
      sepalwidth: 2.8,
      petallength: 4.5,
      petalwidth: 1.3,
      class: Iris-versicolor
    },  ...  class: Iris-setosa
    }, {
      sepallength: 4.6,
      sepalwidth: 3.2,
      petallength: 1.4,
      petalwidth: 0.2,
      class: Iris-setosa
    }, {
      sepallength: 5.3,
      sepalwidth: 3.7,
      petallength: 1.5,
      petalwidth: 0.2,
      class: Iris-setosa
    }, {
      sepallength: 5,
      sepalwidth: 3.3,
      petallength: 1.4,
      petalwidth: 0.2,
      class: Iris-setosa
    }]}
          

    >>> iris.stream().collect(java.util.stream.Collectors.groupingBy({row: Tuple -> row.getString("class")}))
    res98: kotlin.collections.(Mutable)Map<kotlin.String!, kotlin.collections.(Mutable)List<smile.data.Tuple!>!>! = {Iris-versicolor=[{
      sepallength: 7,
      sepalwidth: 3.2,
      petallength: 4.7,
      petalwidth: 1.4,
      class: Iris-versicolor
    }, {
      sepallength: 6.4,
      sepalwidth: 3.2,
      petallength: 4.5,
      petalwidth: 1.5,
      class: Iris-versicolor
    }, {
      sepallength: 6.9,
      sepalwidth: 3.1,
      petallength: 4.9,
      petalwidth: 1.5,
      class: Iris-versicolor
    }, {
      sepallength: 5.5,
      sepalwidth: 2.3,
      petallength: 4,
      petalwidth: 1.3,
      class: Iris-versicolor
    }, {
      sepallength: 6.5,
      sepalwidth: 2.8,
      petallength: 4.6,
      petalwidth: 1.5,
      class: Iris-versicolor
    }, {
      sepallength: 5.7,
      sepalwidth: 2.8,
      petallength: 4.5,
      petalwidth: 1.3,
      class: Iris-versicolor
    }, {
      sepallength: 6.3,
      sepalwidth: 3.3,
      petallength: 4.7,
      petalwidth: 1.6,
      class: Iris-versicolor
    }, {
      sepallength: 4.9,
      sepalwidth: 2.4,
      petallength: 3.3,
      petalwidth: 1,
      class: Iris-versicolor
    }, {
    ...
    

SQL

While Smile provides many imperative way to manipulate DataFrames as showned above, it is probably easier to do so with SQL.


    smile> SQL sql = new SQL();
      ...> sql.parquet("user", "data/kylo/userdata1.parquet");
      ...> sql.json("books", "data/kylo/books_array.json");
      ...> sql.csv("gdp", "data/regression/gdp.csv");
      ...> sql.csv("diabetes", "data/regression/diabetes.csv");

    smile> var tables = sql.tables();
    tables ==> [TABLE_NAME: String, REMARKS: String]
    +----------+-------+
    |TABLE_NAME|REMARKS|
    +----------+-------+
    |     books|   null|
    |  diabetes|   null|
    |       gdp|   null|
    |      user|   null|
    +----------+-------+

    smile> var columns = sql.describe("user");
    columns ==> [COLUMN_NAME: String, TYPE_NAME: String, IS_NULLABLE: String]
    +-----------------+---------+-----------+
    |      COLUMN_NAME|TYPE_NAME|IS_NULLABLE|
    +-----------------+---------+-----------+
    |registration_dttm|TIMESTAMP|        YES|
    |               id|  INTEGER|        YES|
    |       first_name|  VARCHAR|        YES|
    |        last_name|  VARCHAR|        YES|
    |            email|  VARCHAR|        YES|
    |           gender|  VARCHAR|        YES|
    |       ip_address|  VARCHAR|        YES|
    |               cc|  VARCHAR|        YES|
    |          country|  VARCHAR|        YES|
    |        birthdate|  VARCHAR|        YES|
    +-----------------+---------+-----------+
    3 more rows...
    

In the above, we create a database and create four tables by loading parquet, json, and csv files. We also use the describe function to obtain the schema of the table user. With SQL, it is easy to filter data and the result is a DataFrame.


smile> var user = sql.query("SELECT * FROM user WHERE country = 'China'");
[main] INFO smile.data.SQL - SELECT * FROM user WHERE country = 'China'
user ==> [registration_dttm: DateTime, id: Integer, first_name: String, last_name: String, email: String, gender: String, ip_address: String, cc: String, country: String, birthdate: String, salary: Double, title: String, comments: String]
+-------------------+---+----------+----------+--------------------+------+---------------+-------------------+-------+---------+---------+--------------------+--------+
|  registration_dttm| id|first_name| last_name|               email|gender|     ip_address|                 cc|country|birthdate|   salary|               title|comments|
+-------------------+---+----------+----------+--------------------+------+---------------+-------------------+-------+---------+---------+--------------------+--------+
|2016-02-03T00:36:21|  4|    Denise|     Riley|    driley3@gmpg.org|Female|  140.35.109.83|   3576031598965625|  China| 4/8/1997| 90263.05|Senior Cost Accou...|        |
|2016-02-03T18:04:34| 12|     Alice|     Berry|aberryb@wikipedia...|Female| 246.225.12.189|   4917830851454417|  China|8/12/1968| 22944.53|    Quality Engineer|        |
|2016-02-03T10:30:36| 20|   Rebecca|      Bell| rbellj@bandcamp.com|Female|172.215.104.127|                   |  China|         |137251.19|                    |        |
|2016-02-03T08:41:26| 27|     Henry|     Henry| hhenryq@godaddy.com|  Male| 191.88.236.116|4905730021217853521|  China|9/22/1995|284300.15|Nuclear Power Eng...|        |
|2016-02-03T20:46:39| 37|   Dorothy|     Gomez|dgomez10@jiathis.com|Female| 65.111.200.146| 493684876859391834|  China|         | 57194.86|                    |        |
|2016-02-03T08:34:26| 43|    Amanda|      Gray|  agray16@cdbaby.com|Female| 252.20.193.145|   3561501596653859|  China|8/28/1967|213410.26|Senior Quality En...|        |
|2016-02-03T00:05:52| 53|     Ralph|     Price|  rprice1g@tmall.com|  Male|   152.6.235.33|   4844227560658222|  China|8/26/1986| 168208.4|             Teacher|        |
|2016-02-03T16:03:13| 55|      Anna|Montgomery|amontgomery1i@goo...|Female|  80.111.141.47|   3586860392406446|  China| 9/6/1957|  92837.5|Software Test Eng...|     1E2|
|2016-02-03T00:33:25| 57|    Willie|    Palmer|wpalmer1k@t-onlin...|  Male| 164.107.46.161|   4026614769857244|  China|8/23/1986|184978.64|Environmental Spe...|        |
|2016-02-03T05:55:57| 58|    Arthur|     Berry|    aberry1l@unc.edu|  Male|    52.42.24.55|   3542761473624274|  China|         |144164.88|                    |        |
+-------------------+---+----------+----------+--------------------+------+---------------+-------------------+-------+---------+---------+--------------------+--------+
179 more rows...
    

Of course, join is very useful to prepare data from multiple sources. The result DataFrame may be feed to downstream machine learning algorithms.


smile> var gdp = sql.query("SELECT * FROM user LEFT JOIN gdp ON user.country = gdp.Country");
[main] INFO smile.data.SQL - SELECT * FROM user LEFT JOIN gdp ON user.country = gdp.Country
gdp ==> [registration_dttm: DateTime, id: Integer, first_name: String, last_name: String, email: String, gender: String, ip_address: String, cc: String, country: String, birthdate: String, salary: Double, title: String, comments: String, Country: String, GDP Growth: Double, Debt: Double, Interest: Double]
+-------------------+---+----------+---------+--------------------+------+---------------+------------------+---------+----------+---------+--------------------+--------------------+---------+----------+-----+--------+
|  registration_dttm| id|first_name|last_name|               email|gender|     ip_address|                cc|  country| birthdate|   salary|               title|            comments|  Country|GDP Growth| Debt|Interest|
+-------------------+---+----------+---------+--------------------+------+---------------+------------------+---------+----------+---------+--------------------+--------------------+---------+----------+-----+--------+
|2016-02-03T07:55:29|  1|    Amanda|   Jordan|    ajordan0@com.com|Female|    1.197.201.2|  6759521864920116|Indonesia|  3/8/1971| 49756.53|    Internal Auditor|               1E+02|Indonesia|       6.5| 26.2|     7.7|
|2016-02-03T17:04:03|  2|    Albert|  Freeman|     afreeman1@is.gd|  Male| 218.111.175.34|                  |   Canada| 1/16/1968|150280.17|       Accountant IV|                    |   Canada|       2.5| 52.5|     9.5|
|2016-02-03T07:22:34|  6|   Kathryn|    White|  kwhite5@google.com|Female| 195.131.81.179|  3583136326049310|Indonesia| 2/25/1983| 69227.11|   Account Executive|                    |Indonesia|       6.5| 26.2|     7.7|
|2016-02-03T08:33:08|  7|    Samuel|   Holmes|sholmes6@foxnews.com|  Male| 232.234.81.197|  3582641366974690| Portugal|12/18/1987| 14247.62|Senior Financial ...|                    | Portugal|      -1.6| 92.5|     9.7|
|2016-02-03T18:29:47| 10|     Emily|  Stewart|estewart9@opensou...|Female| 143.28.251.245|  3574254110301671|  Nigeria| 1/28/1997| 27234.28|     Health Coach IV|                    |  Nigeria|       7.4|    3|     6.6|
|2016-02-03T08:53:23| 15|   Dorothy|   Hudson|dhudsone@blogger.com|Female|       8.59.7.0|  3542586858224170|    Japan|12/20/1989|157099.71|  Nurse Practicioner|        alert('hi...|    Japan|      -0.6|174.8|    15.7|
|2016-02-03T00:44:01| 16|     Bruce|   Willis|bwillisf@bluehost...|  Male|239.182.219.189|  3573030625927601|   Brazil|          |239100.65|                    |                    |   Brazil|       2.7| 52.8|    24.1|
|2016-02-03T16:44:24| 18|   Stephen|  Wallace|swallaceh@netvibe...|  Male|  152.49.213.62|  5433943468526428|  Ukraine| 1/15/1978|248877.99|Account Represent...|                    |  Ukraine|       5.2| 27.4|     5.2|
|2016-02-03T18:50:55| 23|   Gregory|   Barnes|  gbarnesm@google.ru|  Male| 220.22.114.145|  3538432455620641|  Tunisia| 1/23/1971|182233.49|Senior Sales Asso...|         사회과학원 어학연구소|  Tunisia|        -2|   44|     5.8|
|2016-02-03T08:02:34| 26|   Anthony| Lawrence|alawrencep@miitbe...|  Male| 121.211.242.99|564182969714151470|    Japan|12/10/1979|170085.81| Electrical Engineer|                    |    Japan|      -0.6|174.8|    15.7|
+-------------------+---+----------+---------+--------------------+------+---------------+------------------+---------+----------+---------+--------------------+--------------------+---------+----------+-----+--------+
990 more rows...
    

Sparse Dataset

The feature vectors could be very sparse. To save space, SparseDataset stores data in a list of lists (LIL) sparse matrix format. SparseDataset stores one list per row, where each entry stores a column index and value. Typically, these entries are kept sorted by column index for faster lookup.

SparseDataset is often used to construct the data matrix. Once the matrix is constructed, it is typically converted to a format, such as Harwell-Boeing column-compressed sparse matrix format, which is more efficient for matrix operations.

The class BinarySparseDataset is more efficient for binary sparse data. In BinarySparseDataset, each item is stored as an integer array, which are the indices of nonzero elements in ascending order.

Parsers

Smile provides a couple of parsers for popular data formats, such as Parquet, Avro, Arrow, SAS7BDAT, Weka's ARFF files, LibSVM's file format, delimited text files, JSON, and binary sparse data. We will demonstrate these parsers with the sample data in the data directory. In Scala API, the parsing functions are in the smile.read object.

Apache Parquet

Apache Parquet is a columnar storage format that supports nested data structures. It uses the record shredding and assembly algorithm described in the Dremel paper.


    smile> val df = read.parquet("data/kylo/userdata1.parquet")
    df: DataFrame = [registration_dttm: DateTime, id: Integer, first_name: String, last_name: String, email: String, gender: String, ip_address: String, cc: String, country: String, birthdate: String, salary: Double, title: String, comments: String]
    +-------------------+---+----------+---------+--------------------+------+--------------+----------------+--------------------+----------+---------+--------------------+--------+
    |  registration_dttm| id|first_name|last_name|               email|gender|    ip_address|              cc|             country| birthdate|   salary|               title|comments|
    +-------------------+---+----------+---------+--------------------+------+--------------+----------------+--------------------+----------+---------+--------------------+--------+
    |2016-02-03T07:55:29|  1|    Amanda|   Jordan|    ajordan0@com.com|Female|   1.197.201.2|6759521864920116|           Indonesia|  3/8/1971| 49756.53|    Internal Auditor|   1E+02|
    |2016-02-03T17:04:03|  2|    Albert|  Freeman|     afreeman1@is.gd|  Male|218.111.175.34|                |              Canada| 1/16/1968|150280.17|       Accountant IV|        |
    |2016-02-03T01:09:31|  3|    Evelyn|   Morgan|emorgan2@altervis...|Female|  7.161.136.94|6767119071901597|              Russia|  2/1/1960|144972.51| Structural Engineer|        |
    |2016-02-03T00:36:21|  4|    Denise|    Riley|    driley3@gmpg.org|Female| 140.35.109.83|3576031598965625|               China|  4/8/1997| 90263.05|Senior Cost Accou...|        |
    |2016-02-03T05:05:31|  5|    Carlos|    Burns|cburns4@miitbeian...|      |169.113.235.40|5602256255204850|        South Africa|          |     null|                    |        |
    |2016-02-03T07:22:34|  6|   Kathryn|    White|  kwhite5@google.com|Female|195.131.81.179|3583136326049310|           Indonesia| 2/25/1983| 69227.11|   Account Executive|        |
    |2016-02-03T08:33:08|  7|    Samuel|   Holmes|sholmes6@foxnews.com|  Male|232.234.81.197|3582641366974690|            Portugal|12/18/1987| 14247.62|Senior Financial ...|        |
    |2016-02-03T06:47:06|  8|     Harry|   Howell| hhowell7@eepurl.com|  Male|  91.235.51.73|                |Bosnia and Herzeg...|  3/1/1962|186469.43|    Web Developer IV|        |
    ...
    

    smile> var df = Read.parquet("data/kylo/userdata1.parquet")
    df ==> [registration_dttm: DateTime, id: Integer, first_name: String, last_name: String, email: String, gender: String, ip_address: String, cc: String, country: String, birthdate: String, salary: Double, title: String, comments: String]
    +-------------------+---+----------+---------+--------------------+------+--------------+----------------+--------------------+----------+---------+--------------------+--------+
    |  registration_dttm| id|first_name|last_name|               email|gender|    ip_address|              cc|             country| birthdate|   salary|               title|comments|
    +-------------------+---+----------+---------+--------------------+------+--------------+----------------+--------------------+----------+---------+--------------------+--------+
    |2016-02-03T07:55:29|  1|    Amanda|   Jordan|    ajordan0@com.com|Female|   1.197.201.2|6759521864920116|           Indonesia|  3/8/1971| 49756.53|    Internal Auditor|   1E+02|
    |2016-02-03T17:04:03|  2|    Albert|  Freeman|     afreeman1@is.gd|  Male|218.111.175.34|                |              Canada| 1/16/1968|150280.17|       Accountant IV|        |
    |2016-02-03T01:09:31|  3|    Evelyn|   Morgan|emorgan2@altervis...|Female|  7.161.136.94|6767119071901597|              Russia|  2/1/1960|144972.51| Structural Engineer|        |
    |2016-02-03T00:36:21|  4|    Denise|    Riley|    driley3@gmpg.org|Female| 140.35.109.83|3576031598965625|               China|  4/8/1997| 90263.05|Senior Cost Accou...|        |
    |2016-02-03T05:05:31|  5|    Carlos|    Burns|cburns4@miitbeian...|      |169.113.235.40|5602256255204850|        South Africa|          |     null|                    |        |
    |2016-02-03T07:22:34|  6|   Kathryn|    White|  kwhite5@google.com|Female|195.131.81.179|3583136326049310|           Indonesia| 2/25/1983| 69227.11|   Account Executive|        |
    |2016-02-03T08:33:08|  7|    Samuel|   Holmes|sholmes6@foxnews.com|  Male|232.234.81.197|3582641366974690|            Portugal|12/18/1987| 14247.62|Senior Financial ...|        |
    |2016-02-03T06:47:06|  8|     Harry|   Howell| hhowell7@eepurl.com|  Male|  91.235.51.73|                |Bosnia and Herzeg...|  3/1/1962|186469.43|    Web Developer IV|        |
    ...
          

    >>> val df = read.parquet("data/kylo/userdata1.parquet")
    >>> df
    res100: smile.data.DataFrame = [registration_dttm: DateTime, id: Integer, first_name: String, last_name: String, email: String, gender: String, ip_address: String, cc: String, country: String, birthdate: String, salary: Double, title: String, comments: String]
    +-------------------+---+----------+---------+--------------------+------+--------------+----------------+--------------------+----------+---------+--------------------+--------+
    |  registration_dttm| id|first_name|last_name|               email|gender|    ip_address|              cc|             country| birthdate|   salary|               title|comments|
    +-------------------+---+----------+---------+--------------------+------+--------------+----------------+--------------------+----------+---------+--------------------+--------+
    |2016-02-03T07:55:29|  1|    Amanda|   Jordan|    ajordan0@com.com|Female|   1.197.201.2|6759521864920116|           Indonesia|  3/8/1971| 49756.53|    Internal Auditor|   1E+02|
    |2016-02-03T17:04:03|  2|    Albert|  Freeman|     afreeman1@is.gd|  Male|218.111.175.34|                |              Canada| 1/16/1968|150280.17|       Accountant IV|        |
    |2016-02-03T01:09:31|  3|    Evelyn|   Morgan|emorgan2@altervis...|Female|  7.161.136.94|6767119071901597|              Russia|  2/1/1960|144972.51| Structural Engineer|        |
    |2016-02-03T00:36:21|  4|    Denise|    Riley|    driley3@gmpg.org|Female| 140.35.109.83|3576031598965625|               China|  4/8/1997| 90263.05|Senior Cost Accou...|        |
    |2016-02-03T05:05:31|  5|    Carlos|    Burns|cburns4@miitbeian...|      |169.113.235.40|5602256255204850|        South Africa|          |     null|                    |        |
    |2016-02-03T07:22:34|  6|   Kathryn|    White|  kwhite5@google.com|Female|195.131.81.179|3583136326049310|           Indonesia| 2/25/1983| 69227.11|   Account Executive|        |
    |2016-02-03T08:33:08|  7|    Samuel|   Holmes|sholmes6@foxnews.com|  Male|232.234.81.197|3582641366974690|            Portugal|12/18/1987| 14247.62|Senior Financial ...|        |
    |2016-02-03T06:47:06|  8|     Harry|   Howell| hhowell7@eepurl.com|  Male|  91.235.51.73|                |Bosnia and Herzeg...|  3/1/1962|186469.43|    Web Developer IV|        |
    |2016-02-03T03:52:53|  9|      Jose|   Foster|   jfoster8@yelp.com|  Male|  132.31.53.61|                |         South Korea| 3/27/1992|231067.84|Software Test Eng...|   1E+02|
    |2016-02-03T18:29:47| 10|     Emily|  Stewart|estewart9@opensou...|Female|143.28.251.245|3574254110301671|             Nigeria| 1/28/1997| 27234.28|     Health Coach IV|        |
    +-------------------+---+----------+---------+--------------------+------+--------------+----------------+--------------------+----------+---------+--------------------+--------+
    990 more rows...

    

Apache Avro

Apache Avro is a data serialization system. Avro provides rich data structures, a compact, fast, binary data format, a container file, to store persistent data, and remote procedure call (RPC). Avro relies on schemas. When Avro data is stored in a file, its schema is stored with it. Avro schemas are defined with JSON.


    smile> val df = read.avro(Paths.getTestData("kylo/userdata1.avro"), Paths.getTestData("avro/userdata.avsc"))
    df: DataFrame = [registration_dttm: DateTime, id: Integer, first_name: String, last_name: String, email: String, gender: String, ip_address: String, cc: String, country: String, birthdate: String, salary: Double, title: String, comments: String]
    +-------------------+---+----------+---------+--------------------+------+--------------+----------------+--------------------+----------+---------+--------------------+--------+
    |  registration_dttm| id|first_name|last_name|               email|gender|    ip_address|              cc|             country| birthdate|   salary|               title|comments|
    +-------------------+---+----------+---------+--------------------+------+--------------+----------------+--------------------+----------+---------+--------------------+--------+
    |2016-02-03T07:55:29|  1|    Amanda|   Jordan|    ajordan0@com.com|Female|   1.197.201.2|6759521864920116|           Indonesia|  3/8/1971| 49756.53|    Internal Auditor|   1E+02|
    |2016-02-03T17:04:03|  2|    Albert|  Freeman|     afreeman1@is.gd|  Male|218.111.175.34|                |              Canada| 1/16/1968|150280.17|       Accountant IV|        |
    |2016-02-03T01:09:31|  3|    Evelyn|   Morgan|emorgan2@altervis...|Female|  7.161.136.94|6767119071901597|              Russia|  2/1/1960|144972.51| Structural Engineer|        |
    |2016-02-03T00:36:21|  4|    Denise|    Riley|    driley3@gmpg.org|Female| 140.35.109.83|3576031598965625|               China|  4/8/1997| 90263.05|Senior Cost Accou...|        |
    |2016-02-03T05:05:31|  5|    Carlos|    Burns|cburns4@miitbeian...|      |169.113.235.40|5602256255204850|        South Africa|          |     null|                    |        |
    |2016-02-03T07:22:34|  6|   Kathryn|    White|  kwhite5@google.com|Female|195.131.81.179|3583136326049310|           Indonesia| 2/25/1983| 69227.11|   Account Executive|        |
    |2016-02-03T08:33:08|  7|    Samuel|   Holmes|sholmes6@foxnews.com|  Male|232.234.81.197|3582641366974690|            Portugal|12/18/1987| 14247.62|Senior Financial ...|        |
    |2016-02-03T06:47:06|  8|     Harry|   Howell| hhowell7@eepurl.com|  Male|  91.235.51.73|                |Bosnia and Herzeg...|  3/1/1962|186469.43|    Web Developer IV|        |
    ...
    

    smile> var avrodf = Read.avro(smile.util.Paths.getTestData("kylo/userdata1.avro"), smile.util.Paths.getTestData("kylo/userdata.avsc"))
    avrodf ==> [registration_dttm: DateTime, id: Integer, first_name: String, last_name: String, email: String, gender: String, ip_address: String, cc: String, country: String, birthdate: String, salary: Double, title: String, comments: String]
    +-------------------+---+----------+---------+--------------------+------+--------------+----------------+--------------------+----------+---------+--------------------+--------+
    |  registration_dttm| id|first_name|last_name|               email|gender|    ip_address|              cc|             country| birthdate|   salary|               title|comments|
    +-------------------+---+----------+---------+--------------------+------+--------------+----------------+--------------------+----------+---------+--------------------+--------+
    |2016-02-03T07:55:29|  1|    Amanda|   Jordan|    ajordan0@com.com|Female|   1.197.201.2|6759521864920116|           Indonesia|  3/8/1971| 49756.53|    Internal Auditor|   1E+02|
    |2016-02-03T17:04:03|  2|    Albert|  Freeman|     afreeman1@is.gd|  Male|218.111.175.34|                |              Canada| 1/16/1968|150280.17|       Accountant IV|        |
    |2016-02-03T01:09:31|  3|    Evelyn|   Morgan|emorgan2@altervis...|Female|  7.161.136.94|6767119071901597|              Russia|  2/1/1960|144972.51| Structural Engineer|        |
    |2016-02-03T00:36:21|  4|    Denise|    Riley|    driley3@gmpg.org|Female| 140.35.109.83|3576031598965625|               China|  4/8/1997| 90263.05|Senior Cost Accou...|        |
    |2016-02-03T05:05:31|  5|    Carlos|    Burns|cburns4@miitbeian...|      |169.113.235.40|5602256255204850|        South Africa|          |     null|                    |        |
    |2016-02-03T07:22:34|  6|   Kathryn|    White|  kwhite5@google.com|Female|195.131.81.179|3583136326049310|           Indonesia| 2/25/1983| 69227.11|   Account Executive|        |
    |2016-02-03T08:33:08|  7|    Samuel|   Holmes|sholmes6@foxnews.com|  Male|232.234.81.197|3582641366974690|            Portugal|12/18/1987| 14247.62|Senior Financial ...|        |
    |2016-02-03T06:47:06|  8|     Harry|   Howell| hhowell7@eepurl.com|  Male|  91.235.51.73|                |Bosnia and Herzeg...|  3/1/1962|186469.43|    Web Developer IV|        |
    ...
          

    >>> val avrodf = read.avro(smile.util.Paths.getTestData("kylo/userdata1.avro"), smile.util.Paths.getTestData("kylo/userdata.avsc"))
    >>> avrodf
    res104: smile.data.DataFrame = [registration_dttm: String, id: long, first_name: String, last_name: String, email: String, gender: String, ip_address: String, cc: Long, country: String, birthdate: String, salary: Double, title: String, comments: String]
    +--------------------+---+----------+---------+--------------------+------+--------------+----------------+--------------------+----------+---------+--------------------+--------+
    |   registration_dttm| id|first_name|last_name|               email|gender|    ip_address|              cc|             country| birthdate|   salary|               title|comments|
    +--------------------+---+----------+---------+--------------------+------+--------------+----------------+--------------------+----------+---------+--------------------+--------+
    |2016-02-03T07:55:29Z|  1|    Amanda|   Jordan|    ajordan0@com.com|Female|   1.197.201.2|6759521864920116|           Indonesia|  3/8/1971| 49756.53|    Internal Auditor|   1E+02|
    |2016-02-03T17:04:03Z|  2|    Albert|  Freeman|     afreeman1@is.gd|  Male|218.111.175.34|            null|              Canada| 1/16/1968|150280.17|       Accountant IV|        |
    |2016-02-03T01:09:31Z|  3|    Evelyn|   Morgan|emorgan2@altervis...|Female|  7.161.136.94|6767119071901597|              Russia|  2/1/1960|144972.51| Structural Engineer|        |
    |2016-02-03T12:36:21Z|  4|    Denise|    Riley|    driley3@gmpg.org|Female| 140.35.109.83|3576031598965625|               China|  4/8/1997| 90263.05|Senior Cost Accou...|        |
    |2016-02-03T05:05:31Z|  5|    Carlos|    Burns|cburns4@miitbeian...|      |169.113.235.40|5602256255204850|        South Africa|          |     null|                    |        |
    |2016-02-03T07:22:34Z|  6|   Kathryn|    White|  kwhite5@google.com|Female|195.131.81.179|3583136326049310|           Indonesia| 2/25/1983| 69227.11|   Account Executive|        |
    |2016-02-03T08:33:08Z|  7|    Samuel|   Holmes|sholmes6@foxnews.com|  Male|232.234.81.197|3582641366974690|            Portugal|12/18/1987| 14247.62|Senior Financial ...|        |
    |2016-02-03T06:47:06Z|  8|     Harry|   Howell| hhowell7@eepurl.com|  Male|  91.235.51.73|            null|Bosnia and Herzeg...|  3/1/1962|186469.43|    Web Developer IV|        |
    |2016-02-03T03:52:53Z|  9|      Jose|   Foster|   jfoster8@yelp.com|  Male|  132.31.53.61|            null|         South Korea| 3/27/1992|231067.84|Software Test Eng...|   1E+02|
    |2016-02-03T18:29:47Z| 10|     Emily|  Stewart|estewart9@opensou...|Female|143.28.251.245|3574254110301671|             Nigeria| 1/28/1997| 27234.28|     Health Coach IV|        |
    +--------------------+---+----------+---------+--------------------+------+--------------+----------------+--------------------+----------+---------+--------------------+--------+
    990 more rows...
    

Apache Arrow

Apache Arrow is a cross-language development platform for in-memory data. It specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware.

Feather uses the Apache Arrow columnar memory specification to represent binary data on disk. This makes read and write operations very fast. This is particularly important for encoding null/NA values and variable-length types like UTF8 strings. Feather is a part of the broader Apache Arrow project. Feather defines its own simplified schemas and metadata for on-disk representation.

In the below example, we write a DataFrame into Feather file and then read it back.


    smile> val temp = java.io.File.createTempFile("chinook", "arrow")
    temp: java.io.File = /var/folders/cb/577dvd4n2db0ghdn3gn7ss0h0000gn/T/chinook5413820941564790310arrow

    smile> val path = temp.toPath()
    path: java.nio.file.Path = /var/folders/cb/577dvd4n2db0ghdn3gn7ss0h0000gn/T/chinook5413820941564790310arrow

    smile> write.arrow(df, path)
    [main] INFO smile.io.Arrow - write 1000 rows

    smile> val df = read.arrow(path)
    df: DataFrame = [registration_dttm: DateTime, id: Integer, first_name: String, last_name: String, email: String, gender: String, ip_address: String, cc: String, country: String, birthdate: String, salary: Double, title: String, comments: String]
    +-------------------+---+----------+---------+--------------------+------+--------------+----------------+--------------------+----------+---------+--------------------+--------+
    |  registration_dttm| id|first_name|last_name|               email|gender|    ip_address|              cc|             country| birthdate|   salary|               title|comments|
    +-------------------+---+----------+---------+--------------------+------+--------------+----------------+--------------------+----------+---------+--------------------+--------+
    |2016-02-03T07:55:29|  1|    Amanda|   Jordan|    ajordan0@com.com|Female|   1.197.201.2|6759521864920116|           Indonesia|  3/8/1971| 49756.53|    Internal Auditor|   1E+02|
    |2016-02-03T17:04:03|  2|    Albert|  Freeman|     afreeman1@is.gd|  Male|218.111.175.34|                |              Canada| 1/16/1968|150280.17|       Accountant IV|        |
    |2016-02-03T01:09:31|  3|    Evelyn|   Morgan|emorgan2@altervis...|Female|  7.161.136.94|6767119071901597|              Russia|  2/1/1960|144972.51| Structural Engineer|        |
    |2016-02-03T00:36:21|  4|    Denise|    Riley|    driley3@gmpg.org|Female| 140.35.109.83|3576031598965625|               China|  4/8/1997| 90263.05|Senior Cost Accou...|        |
    |2016-02-03T05:05:31|  5|    Carlos|    Burns|cburns4@miitbeian...|      |169.113.235.40|5602256255204850|        South Africa|          |     null|                    |        |
    |2016-02-03T07:22:34|  6|   Kathryn|    White|  kwhite5@google.com|Female|195.131.81.179|3583136326049310|           Indonesia| 2/25/1983| 69227.11|   Account Executive|        |
    |2016-02-03T08:33:08|  7|    Samuel|   Holmes|sholmes6@foxnews.com|  Male|232.234.81.197|3582641366974690|            Portugal|12/18/1987| 14247.62|Senior Financial ...|        |
    |2016-02-03T06:47:06|  8|     Harry|   Howell| hhowell7@eepurl.com|  Male|  91.235.51.73|                |Bosnia and Herzeg...|  3/1/1962|186469.43|    Web Developer IV|        |
    ...
    

    smile> var temp = java.io.File.createTempFile("chinook", "arrow")
    temp ==> /var/folders/cb/577dvd4n2db0ghdn3gn7ss0h0000gn/T/chinook5430879887643149276arrow

    smile> var path = temp.toPath()
    path ==> /var/folders/cb/577dvd4n2db0ghdn3gn7ss0h0000gn/T/chinook5430879887643149276arrow

    smile> Write.arrow(df, path)
    [main] INFO smile.io.Arrow - write 1000 rows

    smile> var arrowdf = Read.arrow(path)
    [main] INFO smile.io.Arrow - read 1000 rows and 13 columns
    arrowdf ==> [registration_dttm: DateTime, id: Integer, first_name: String, last_name: String, email: String, gender: String, ip_address: String, cc: String, country: String, birthdate: String, salary: Double, title: String, comments: String]
    +-------------------+---+----------+---------+--------------------+------+--------------+----------------+--------------------+----------+---------+--------------------+--------+
    |  registration_dttm| id|first_name|last_name|               email|gender|    ip_address|              cc|             country| birthdate|   salary|               title|comments|
    +-------------------+---+----------+---------+--------------------+------+--------------+----------------+--------------------+----------+---------+--------------------+--------+
    |2016-02-03T07:55:29|  1|    Amanda|   Jordan|    ajordan0@com.com|Female|   1.197.201.2|6759521864920116|           Indonesia|  3/8/1971| 49756.53|    Internal Auditor|   1E+02|
    |2016-02-03T17:04:03|  2|    Albert|  Freeman|     afreeman1@is.gd|  Male|218.111.175.34|                |              Canada| 1/16/1968|150280.17|       Accountant IV|        |
    |2016-02-03T01:09:31|  3|    Evelyn|   Morgan|emorgan2@altervis...|Female|  7.161.136.94|6767119071901597|              Russia|  2/1/1960|144972.51| Structural Engineer|        |
    |2016-02-03T00:36:21|  4|    Denise|    Riley|    driley3@gmpg.org|Female| 140.35.109.83|3576031598965625|               China|  4/8/1997| 90263.05|Senior Cost Accou...|        |
    |2016-02-03T05:05:31|  5|    Carlos|    Burns|cburns4@miitbeian...|      |169.113.235.40|5602256255204850|        South Africa|          |     null|                    |        |
    |2016-02-03T07:22:34|  6|   Kathryn|    White|  kwhite5@google.com|Female|195.131.81.179|3583136326049310|           Indonesia| 2/25/1983| 69227.11|   Account Executive|        |
    |2016-02-03T08:33:08|  7|    Samuel|   Holmes|sholmes6@foxnews.com|  Male|232.234.81.197|3582641366974690|            Portugal|12/18/1987| 14247.62|Senior Financial ...|        |
    |2016-02-03T06:47:06|  8|     Harry|   Howell| hhowell7@eepurl.com|  Male|  91.235.51.73|                |Bosnia and Herzeg...|  3/1/1962|186469.43|    Web Developer IV|        |
    ...
          

    >>> val temp = java.io.File.createTempFile("chinook", "arrow")
    >>> val path = temp.toPath()
    >>> write.arrow(df, path)
    [main] INFO smile.io.Arrow - write 1000 rows
    >>> val df = read.arrow(path)
    [main] INFO smile.io.Arrow - read 1000 rows and 13 columns
    >>> df
    res109: smile.data.DataFrame = [registration_dttm: DateTime, id: Integer, first_name: String, last_name: String, email: String, gender: String, ip_address: String, cc: String, country: String, birthdate: String, salary: Double, title: String, comments: String]
    +-----------------+---+----------+---------+--------------------+------+--------------+----------------+--------------------+----------+---------+--------------------+--------+
    |registration_dttm| id|first_name|last_name|               email|gender|    ip_address|              cc|             country| birthdate|   salary|               title|comments|
    +-----------------+---+----------+---------+--------------------+------+--------------+----------------+--------------------+----------+---------+--------------------+--------+
    |             null|  1|    Amanda|   Jordan|    ajordan0@com.com|Female|   1.197.201.2|6759521864920116|           Indonesia|  3/8/1971| 49756.53|    Internal Auditor|   1E+02|
    |             null|  2|    Albert|  Freeman|     afreeman1@is.gd|  Male|218.111.175.34|                |              Canada| 1/16/1968|150280.17|       Accountant IV|        |
    |             null|  3|    Evelyn|   Morgan|emorgan2@altervis...|Female|  7.161.136.94|6767119071901597|              Russia|  2/1/1960|144972.51| Structural Engineer|        |
    |             null|  4|    Denise|    Riley|    driley3@gmpg.org|Female| 140.35.109.83|3576031598965625|               China|  4/8/1997| 90263.05|Senior Cost Accou...|        |
    |             null|  5|    Carlos|    Burns|cburns4@miitbeian...|      |169.113.235.40|5602256255204850|        South Africa|          |     null|                    |        |
    |             null|  6|   Kathryn|    White|  kwhite5@google.com|Female|195.131.81.179|3583136326049310|           Indonesia| 2/25/1983| 69227.11|   Account Executive|        |
    |             null|  7|    Samuel|   Holmes|sholmes6@foxnews.com|  Male|232.234.81.197|3582641366974690|            Portugal|12/18/1987| 14247.62|Senior Financial ...|        |
    |             null|  8|     Harry|   Howell| hhowell7@eepurl.com|  Male|  91.235.51.73|                |Bosnia and Herzeg...|  3/1/1962|186469.43|    Web Developer IV|        |
    |             null|  9|      Jose|   Foster|   jfoster8@yelp.com|  Male|  132.31.53.61|                |         South Korea| 3/27/1992|231067.84|Software Test Eng...|   1E+02|
    |             null| 10|     Emily|  Stewart|estewart9@opensou...|Female|143.28.251.245|3574254110301671|             Nigeria| 1/28/1997| 27234.28|     Health Coach IV|        |
    +-----------------+---+----------+---------+--------------------+------+--------------+----------------+--------------------+----------+---------+--------------------+--------+
    990 more rows...
    

SAS7BDAT

SAS7BDAT is currently the main format used for storing SAS datasets across all platforms.


    smile> val df = read.sas(Paths.getTestData("sas/airline.sas7bdat"))
    df: DataFrame = [YEAR: double, Y: double, W: double, R: double, L: double, K: double]
    +----+-----+-----+------+-----+-----+
    |YEAR|    Y|    W|     R|    L|    K|
    +----+-----+-----+------+-----+-----+
    |1948|1.214|0.243|0.1454|1.415|0.612|
    |1949|1.354| 0.26|0.2181|1.384|0.559|
    |1950|1.569|0.278|0.3157|1.388|0.573|
    |1951|1.948|0.297| 0.394| 1.55|0.564|
    |1952|2.265| 0.31|0.3559|1.802|0.574|
    |1953|2.731|0.322|0.3593|1.926|0.711|
    |1954|3.025|0.335|0.4025|1.964|0.776|
    |1955|3.562| 0.35|0.3961|2.116|0.827|
    |1956|3.979|0.361|0.3822|2.435|  0.8|
    |1957| 4.42|0.379|0.3045|2.707|0.921|
    +----+-----+-----+------+-----+-----+
    22 more rows...
    

    smile> var sasdf = Read.sas("data/sas/airline.sas7bdat")
    sasdf ==> [YEAR: double, Y: double, W: double, R: double, L: double, K: double]
    +----+-----+-----+------+-----+-----+
    |YEAR|    Y|    W|     R|    L|    K|
    +----+-----+-----+------+-----+-----+
    |1948|1.214|0.243|0.1454|1.415|0.612|
    |1949|1.354| 0.26|0.2181|1.384|0.559|
    |1950|1.569|0.278|0.3157|1.388|0.573|
    |1951|1.948|0.297| 0.394| 1.55|0.564|
    |1952|2.265| 0.31|0.3559|1.802|0.574|
    |1953|2.731|0.322|0.3593|1.926|0.711|
    |1954|3.025|0.335|0.4025|1.964|0.776|
    |1955|3.562| 0.35|0.3961|2.116|0.827|
    |1956|3.979|0.361|0.3822|2.435|  0.8|
    |1957| 4.42|0.379|0.3045|2.707|0.921|
    +----+-----+-----+------+-----+-----+
    22 more rows...
          

    >>> val df = read.sas("data/sas/airline.sas7bdat")
    >>> df
    res112: smile.data.DataFrame = [YEAR: double, Y: double, W: double, R: double, L: double, K: double]
    +----+-----+-----+------+-----+-----+
    |YEAR|    Y|    W|     R|    L|    K|
    +----+-----+-----+------+-----+-----+
    |1948|1.214|0.243|0.1454|1.415|0.612|
    |1949|1.354| 0.26|0.2181|1.384|0.559|
    |1950|1.569|0.278|0.3157|1.388|0.573|
    |1951|1.948|0.297| 0.394| 1.55|0.564|
    |1952|2.265| 0.31|0.3559|1.802|0.574|
    |1953|2.731|0.322|0.3593|1.926|0.711|
    |1954|3.025|0.335|0.4025|1.964|0.776|
    |1955|3.562| 0.35|0.3961|2.116|0.827|
    |1956|3.979|0.361|0.3822|2.435|  0.8|
    |1957| 4.42|0.379|0.3045|2.707|0.921|
    +----+-----+-----+------+-----+-----+
    22 more rows...
    

Relational Database

It is also easy to load data from relation databases through JDBC.


    smile> import $ivy.`org.xerial:sqlite-jdbc:3.28.0`
    import $ivy.$

    smile> Class.forName("org.sqlite.JDBC")
    res23: Class[?0] = class org.sqlite.JDBC

    smile> val url = String.format("jdbc:sqlite:%s", Paths.getTestData("sqlite/chinook.db").toAbsolutePath())
    url: String = "jdbc:sqlite:data/sqlite/chinook.db"
    smile> val sql = """select e.firstname as 'Employee First', e.lastname as 'Employee Last', c.firstname as 'Customer First', c.lastname as 'Customer Last', c.country, i.total
                     from employees as e
                     join customers as c on e.employeeid = c.supportrepid
                     join invoices as i on c.customerid = i.customerid
                    """
    sql: String = """select e.firstname as 'Employee First', e.lastname as 'Employee Last', c.firstname as 'Customer First', c.lastname as 'Customer Last', c.country, i.total
                     from employees as e
                     join customers as c on e.employeeid = c.supportrepid
                     join invoices as i on c.customerid = i.customerid
                    """

    smile> val conn = java.sql.DriverManager.getConnection(url)
    conn: java.sql.Connection = org.sqlite.jdbc4.JDBC4Connection@782cd00

    smile> val stmt = conn.createStatement()
    stmt: java.sql.Statement = org.sqlite.jdbc4.JDBC4Statement@40df1311

    smile> val rs = stmt.executeQuery(sql)
    rs: java.sql.ResultSet = org.sqlite.jdbc4.JDBC4ResultSet@5a524a19

    smile> val df = DataFrame.of(rs)
    df: DataFrame = [Employee First: String, Employee Last: String, Customer First: String, Customer Last: String, Country: String, Total: Double]
    +--------------+-------------+--------------+-------------+-------+-----+
    |Employee First|Employee Last|Customer First|Customer Last|Country|Total|
    +--------------+-------------+--------------+-------------+-------+-----+
    |          Jane|      Peacock|          Luís|    Gonçalves| Brazil| 3.98|
    |          Jane|      Peacock|          Luís|    Gonçalves| Brazil| 3.96|
    |          Jane|      Peacock|          Luís|    Gonçalves| Brazil| 5.94|
    |          Jane|      Peacock|          Luís|    Gonçalves| Brazil| 0.99|
    |          Jane|      Peacock|          Luís|    Gonçalves| Brazil| 1.98|
    |          Jane|      Peacock|          Luís|    Gonçalves| Brazil|13.86|
    |          Jane|      Peacock|          Luís|    Gonçalves| Brazil| 8.91|
    |         Steve|      Johnson|        Leonie|       Köhler|Germany| 1.98|
    |         Steve|      Johnson|        Leonie|       Köhler|Germany|13.86|
    |         Steve|      Johnson|        Leonie|       Köhler|Germany| 8.91|
    +--------------+-------------+--------------+-------------+-------+-----+
    402 more rows...
    

    smile> Class.forName("org.sqlite.JDBC")
    $1 ==> class org.sqlite.JDBC

    smile> var url = String.format("jdbc:sqlite:%s", smile.util.Paths.getTestData("sqlite/chinook.db").toAbsolutePath())
    url ==> "jdbc:sqlite:/Users/hli/github/smile/shell/target ... ../data/sqlite/chinook.db"

    smile> var sql = "select e.firstname as 'Employee First', e.lastname as 'Employee Last', c.firstname as 'Customer First', c.lastname as 'Customer Last', c.country, i.total " +
                     "from employees as e " +
                     "join customers as c on e.employeeid = c.supportrepid " +
                     "join invoices as i on c.customerid = i.customerid "
    sql ==> "select e.firstname as 'Employee First', e.lastna ... ustomerid = i.customerid "

    smile> var conn = java.sql.DriverManager.getConnection(url)
    conn ==> org.sqlite.jdbc4.JDBC4Connection@1df82230

    smile> var stmt = conn.createStatement()
    stmt ==> org.sqlite.jdbc4.JDBC4Statement@75329a49

    smile> var rs = stmt.executeQuery(sql)
    rs ==> org.sqlite.jdbc4.JDBC4ResultSet@48aaecc3

    smile> var sqldf = DataFrame.of(rs)
    sqldf ==> [Employee First: String, Employee Last: String, Customer First: String, Customer Last: String, Country: String, Total: Double]
    +--------------+-------------+--------------+-------------+-------+-----+
    |Employee First|Employee Last|Customer First|Customer Last|Country|Total|
    +--------------+-------------+--------------+-------------+-------+-----+
    |          Jane|      Peacock|          Luís|    Gonçalves| Brazil| 3.98|
    |          Jane|      Peacock|          Luís|    Gonçalves| Brazil| 3.96|
    |          Jane|      Peacock|          Luís|    Gonçalves| Brazil| 5.94|
    |          Jane|      Peacock|          Luís|    Gonçalves| Brazil| 0.99|
    |          Jane|      Peacock|          Luís|    Gonçalves| Brazil| 1.98|
    |          Jane|      Peacock|          Luís|    Gonçalves| Brazil|13.86|
    |          Jane|      Peacock|          Luís|    Gonçalves| Brazil| 8.91|
    |         Steve|      Johnson|        Leonie|       Köhler|Germany| 1.98|
    |         Steve|      Johnson|        Leonie|       Köhler|Germany|13.86|
    |         Steve|      Johnson|        Leonie|       Köhler|Germany| 8.91|
    +--------------+-------------+--------------+-------------+-------+-----+
    402 more rows...
          

Weka ARFF

Weka ARFF (attribute relation file format) is an ASCII text file format that is essentially a CSV file with a header that describes the metadata. ARFF was developed for use in the Weka machine learning software.

A dataset is firstly described, beginning with the name of the dataset (or the relation in ARFF terminology). Each of the variables (or attribute in ARFF terminology) used to describe the observations is then identified, together with their data type, each definition on a single line. The actual observations are then listed, each on a single line, with fields separated by commas, much like a CSV file.

Missing values in an ARFF dataset are identified using the question mark '?'. Comments can be included in the file, introduced at the beginning of a line with a '%', whereby the remainder of the line is ignored.

A significant advantage of the ARFF data file over the CSV data file is the metadata information. Also, the ability to include comments ensure we can record extra information about the data set, including how it was derived, where it came from, and how it might be cited.

In the directory data/weka, we have many sample ARFF files. We can also read data from remote servers by HTTP, FTP, etc.


    smile> val df = read.arff("https://github.com/haifengl/smile/blob/master/shell/src/universal/data/weka/cpu.arff?raw=true")
    [main] INFO smile.io.Arff - Read ARFF relation cpu
    df: DataFrame = [MYCT: float, MMIN: float, MMAX: float, CACH: float, CHMIN: float, CHMAX: float, class: float]
    +----+-----+-----+----+-----+-----+-----+
    |MYCT| MMIN| MMAX|CACH|CHMIN|CHMAX|class|
    +----+-----+-----+----+-----+-----+-----+
    | 125|  256| 6000| 256|   16|  128|  199|
    |  29| 8000|32000|  32|    8|   32|  253|
    |  29| 8000|32000|  32|    8|   32|  253|
    |  29| 8000|32000|  32|    8|   32|  253|
    |  29| 8000|16000|  32|    8|   16|  132|
    |  26| 8000|32000|  64|    8|   32|  290|
    |  23|16000|32000|  64|   16|   32|  381|
    |  23|16000|32000|  64|   16|   32|  381|
    |  23|16000|64000|  64|   16|   32|  749|
    |  23|32000|64000| 128|   32|   64| 1238|
    +----+-----+-----+----+-----+-----+-----+
    199 more rows...
    

    smile> var cpu = Read.arff("https://github.com/haifengl/smile/blob/master/shell/src/universal/data/weka/cpu.arff?raw=true")
    [main] INFO smile.io.Arff - Read ARFF relation cpu
    cpu ==> [MYCT: float, MMIN: float, MMAX: float, CACH: float, CHMIN: float, CHMAX: float, class: float]
    +----+-----+-----+----+-----+-----+-----+
    |MYCT| MMIN| MMAX|CACH|CHMIN|CHMAX|class|
    +----+-----+-----+----+-----+-----+-----+
    | 125|  256| 6000| 256|   16|  128|  199|
    |  29| 8000|32000|  32|    8|   32|  253|
    |  29| 8000|32000|  32|    8|   32|  253|
    |  29| 8000|32000|  32|    8|   32|  253|
    |  29| 8000|16000|  32|    8|   16|  132|
    |  26| 8000|32000|  64|    8|   32|  290|
    |  23|16000|32000|  64|   16|   32|  381|
    |  23|16000|32000|  64|   16|   32|  381|
    |  23|16000|64000|  64|   16|   32|  749|
    |  23|32000|64000| 128|   32|   64| 1238|
    +----+-----+-----+----+-----+-----+-----+
    199 more rows...
          

    >>> val df = read.arff("https://github.com/haifengl/smile/blob/master/shell/src/universal/data/weka/cpu.arff?raw=true")
    [main] INFO smile.io.Arff - Read ARFF relation cpu
    >>> df
    res114: smile.data.DataFrame = [MYCT: float, MMIN: float, MMAX: float, CACH: float, CHMIN: float, CHMAX: float, class: float]
    +----+-----+-----+----+-----+-----+-----+
    |MYCT| MMIN| MMAX|CACH|CHMIN|CHMAX|class|
    +----+-----+-----+----+-----+-----+-----+
    | 125|  256| 6000| 256|   16|  128|  199|
    |  29| 8000|32000|  32|    8|   32|  253|
    |  29| 8000|32000|  32|    8|   32|  253|
    |  29| 8000|32000|  32|    8|   32|  253|
    |  29| 8000|16000|  32|    8|   16|  132|
    |  26| 8000|32000|  64|    8|   32|  290|
    |  23|16000|32000|  64|   16|   32|  381|
    |  23|16000|32000|  64|   16|   32|  381|
    |  23|16000|64000|  64|   16|   32|  749|
    |  23|32000|64000| 128|   32|   64| 1238|
    +----+-----+-----+----+-----+-----+-----+
    199 more rows...
    

Delimited Text and CSV

The delimited text files are widely used in machine learning research community. The comma-separated values (CSV) file is a special case. Smile provides flexible parser for them based on Apache Commons CSV library.


    def csv(file: String, delimiter: Char = ',', header: Boolean = true, quote: Char = '"', escape: Char = '\\', schema: StructType = null): DataFrame
    

In Java API, the user may provide a CSVFormat argument to specify the format of a CSV file.


    public interface Read {
        /** Reads a CSV file. */
        static DataFrame csv(String path) throws IOException, URISyntaxException

        /** Reads a CSV file. */
        static DataFrame csv(String path, CSVFormat format) throws IOException, URISyntaxException

        /** Reads a CSV file. */
        static DataFrame csv(String path, CSVFormat format, StructType schema) throws IOException, URISyntaxException

        /** Reads a CSV file. */
        static DataFrame csv(Path path) throws IOException

        /** Reads a CSV file. */
        static DataFrame csv(Path path, CSVFormat format) throws IOException

        /** Reads a CSV file. */
        static DataFrame csv(Path path, CSVFormat format, StructType schema) throws IOException
    }
          

    fun csv(file: String, delimiter: Char = ',', header: Boolean = true, quote: Char = '"', escape: Char = '\\', schema: StructType? = null): DataFrame
    

The parser tries it best to infer the schema of data from the top rows.


    smile> val zip = read.csv("data/usps/zip.train", delimiter = ' ', header = false)
    zip: DataFrame = [V1: int, V2: double, V3: double, V4: double, V5: double, V6: double, V7: double, V8: double, V9: double, V10: double, V11: double, V12: double, V13: double, V14: double, V15: double, V16: double, V17: double, V18: double, V19: double, V20: double, V21: double, V22: double, V23: double, V24: double, V25: double, V26: double, V27: double, V28: double, V29: double, V30: double, V31: double, V32: double, V33: double, V34: double, V35: double, V36: double, V37: double, V38: double, V39: double, V40: double, V41: double, V42: double, V43: double, V44: double, V45: double, V46: double, V47: double, V48: double, V49: double, V50: double, V51: double, V52: double, V53: double, V54: double, V55: double, V56: double, V57: double, V58: double, V59: double, V60: double, V61: double, V62: double, V63: double, V64: double, V65: double, V66: double, V67: double, V68: double, V69: double, V70: double, V71: double, V72: double, V73: double, V74: double, V75: double, V76: double, V77: double, V78: double, V79: double, V80: double, V81: double, V82: double, V83: double, V84: double, V85: double, V86: double, V87: double, V88: double, V89: double, V90: double, V91: double, V92: double, V93: double, V94: double, V95: double, V96: double, V97: double, V98: double, V99: double, V100: double, V101: double, V102: double, V103: double, V104: double, V105: double, V106: double, V107: double, V108: double, V109: double, V110: double, V111: double, V112: double, V113: double, V114: double, V115: double, V116: double, V117: double, V118: double, V119: double, V120: double, V121: double, V122: double, V123: double, V124: double, V125: double, V126: double, V127: double, V128: double, V129: double, V130: double, V131: double, V132: double, V133: double, V134: double, V135: double, V136: double, V137: double, V138: double, V139: double, V140: double, V141: double, V142: double, V143: double, V144: double, V145: double, V146: double, V147: double, V148: double, V149: double, V150: double, V151: double, V152: double, V153: double, V154: double, V155: double, V156: double, V157: double, V158: double, V159: double, V160: double, V161: double, V162: double, V163: double, V164: double, V165: double, V166: double, V167: double, V168: double, V169: double, V170: double, V171: double, V172: double, V173: double, V174: double, V175: double, V176: double, V177: double, V178: double, V179: double, V180: double, V181: double, V182: double, V183: double, V184: double, V185: double, V186: double, V187: double, V188: double, V189: double, V190: double, V191: double, V192: double, V193: double, V194: double, V195: double, V196: double, V197: double, V198: double, V199: double, V200: double, V201: double, V202: double, V203: double, V204: double, V205: double, V206: double, V207: double, V208: double, V209: double, V210: double, V211: double, V212: double, V213: double, V214: double, V215: double, V216: double, V217: double, V218: double, V219: double, V220: double, V221: double, V222: double, V223: double, V224: double, V225: double, V226: double, V227: double, V228: double, V229: double, V230: double, V231: double, V232: double, V233: double, V234: double, V235: double, V236: double, V237: double, V238: double, V239: double, V240: double, V241: double, V242: double, V243: double, V244: double, V245: double, V246: double, V247: double, V248: double, V249: double, V250: double, V251: double, V252: double, V253: double, V254: double, V255: double, V256: double, V257: double]    

    smile> import org.apache.commons.csv.CSVFormat

    smile> var format = CSVFormat.DEFAULT.withDelimiter(' ')
    format ==> Delimiter=< > QuoteChar=<"> RecordSeparator=<
    >  ... red SkipHeaderRecord:false

    smile> var zip = Read.csv("data/usps/zip.train", format)
    zip ==> [V1: int, V2: double, V3: double, V4: double, V5: ...   -1|-0.454| 0.879|-0.745|
          

    >>> val zip = read.csv("data/usps/zip.train", delimiter = ' ', header = false)
    >>> zip
    res116: smile.data.DataFrame = [V1: int, V2: double, V3: double, V4: double, V5: double, V6: double, V7: double, V8: double, V9: double, V10: double, V11: double, V12: double, V13: double, V14: double, V15: double, V16: double, V17: double, V18: double, V19: double, V20: double, V21: double, V22: double, V23: double, V24: double, V25: double, V26: double, V27: double, V28: double, V29: double, V30: double, V31: double, V32: double, V33: double, V34: double, V35: double, V36: double, V37: double, V38: double, V39: double, V40: double, V41: double, V42: double, V43: double, V44: double, V45: double, V46: double, V47: double, V48: double, V49: double, V50: double, V51: double, V52: double, V53: double, V54: double, V55: double, V56: double, V57: double, V58: double, V59: double, V60: double, V61: double, V62: double, V63: double, V64: double, V65: double, V66: double, V67: double, V68: double, V69: double, V70: double, V71: double, V72: double, V73: double, V74: double, V75: double, V76: double, V77: double, V78: double, V79: double, V80: double, V81: double, V82: double, V83: double, V84: double, V85: double, V86: double, V87: double, V88: double, V89: double, V90: double, V91: double, V92: double, V93: double, V94: double, V95: double, V96: double, V97: double, V98: double, V99: double, V100: double, V101: double, V102: double, V103: double, V104: double, V105: double, V106: double, V107: double, V108: double, V109: double, V110: double, V111: double, V112: double, V113: double, V114: double, V115: double, V116: double, V117: double, V118: double, V119: double, V120: double, V121: double, V122: double, V123: double, V124: double, V125: double, V126: double, V127: double, V128: double, V129: double, V130: double, V131: double, V132: double, V133: double, V134: double, V135: double, V136: double, V137: double, V138: double, V139: double, V140: double, V141: double, V142: double, V143: double, V144: double, V145: double, V146: double, V147: double, V148: double, V149: double, V150: double, V151: double, V152: double, V153: double, V154: double, V155: double, V156: double, V157: double, V158: double, V159: double, V160: double, V161: double, V162: double, V163: double, V164: double, V165: double, V166: double, V167: double, V168: double, V169: double, V170: double, V171: double, V172: double, V173: double, V174: double, V175: double, V176: double, V177: double, V178: double, V179: double, V180: double, V181: double, V182: double, V183: double, V184: double, V185: double, V186: double, V187: double, V188: double, V189: double, V190: double, V191: double, V192: double, V193: double, V194: double, V195: double, V196: double, V197: double, V198: double, V199: double, V200: double, V201: double, V202: double, V203: double, V204: double, V205: double, V206: double, V207: double, V208: double, V209: double, V210: double, V211: double, V212: double, V213: double, V214: double, V215: double, V216: double, V217: double, V218: double, V219: double, V220: double, V221: double, V222: double, V223: double, V224: double, V225: double, V226: double, V227: double, V228: double, V229: double, V230: double, V231: double, V232: double, V233: double, V234: double, V235: double, V236: double, V237: double, V238: double, V239: double, V240: double, V241: double, V242: double, V243: double, V244: double, V245: double, V246: double, V247: double, V248: double, V249: double, V250: double, V251: double, V252: double, V253: double, V254: double, V255: double, V256: double, V257: double]
    ...
    

In case that the parser fails to infer the schema, the user may provide a predefined schema.


smile> val airport = new NominalScale("ABE", "ABI", "ABQ", "ABY", "ACK", "ACT",
         "ACV", "ACY", "ADK", "ADQ", "AEX", "AGS", "AKN", "ALB", "ALO", "AMA", "ANC",
         "APF", "ASE", "ATL", "ATW", "AUS", "AVL", "AVP", "AZO", "BDL", "BET", "BFL",
         "BGM", "BGR", "BHM", "BIL", "BIS", "BJI", "BLI", "BMI", "BNA", "BOI", "BOS",
         "BPT", "BQK", "BQN", "BRO", "BRW", "BTM", "BTR", "BTV", "BUF", "BUR", "BWI",
         "BZN", "CAE", "CAK", "CDC", "CDV", "CEC", "CHA", "CHO", "CHS", "CIC", "CID",
         "CKB", "CLD", "CLE", "CLL", "CLT", "CMH", "CMI", "CMX", "COD", "COS", "CPR",
         "CRP", "CRW", "CSG", "CVG", "CWA", "CYS", "DAB", "DAL", "DAY", "DBQ", "DCA",
         "DEN", "DFW", "DHN", "DLG", "DLH", "DRO", "DSM", "DTW", "EAU", "EGE", "EKO",
         "ELM", "ELP", "ERI", "EUG", "EVV", "EWN", "EWR", "EYW", "FAI", "FAR", "FAT",
         "FAY", "FCA", "FLG", "FLL", "FLO", "FMN", "FNT", "FSD", "FSM", "FWA", "GEG",
         "GFK", "GGG", "GJT", "GNV", "GPT", "GRB", "GRK", "GRR", "GSO", "GSP", "GST",
         "GTF", "GTR", "GUC", "HDN", "HHH", "HKY", "HLN", "HNL", "HOU", "HPN", "HRL",
         "HSV", "HTS", "HVN", "IAD", "IAH", "ICT", "IDA", "ILG", "ILM", "IND", "INL",
         "IPL", "ISO", "ISP", "ITO", "IYK", "JAC", "JAN", "JAX", "JFK", "JNU", "KOA",
         "KTN", "LAN", "LAR", "LAS", "LAW", "LAX", "LBB", "LBF", "LCH", "LEX", "LFT",
         "LGA", "LGB", "LIH", "LIT", "LNK", "LRD", "LSE", "LWB", "LWS", "LYH", "MAF",
         "MBS", "MCI", "MCN", "MCO", "MDT", "MDW", "MEI", "MEM", "MFE", "MFR", "MGM",
         "MHT", "MIA", "MKE", "MLB", "MLI", "MLU", "MOB", "MOD", "MOT", "MQT", "MRY",
         "MSN", "MSO", "MSP", "MSY", "MTH", "MTJ", "MYR", "OAJ", "OAK", "OGD", "OGG",
         "OKC", "OMA", "OME", "ONT", "ORD", "ORF", "OTZ", "OXR", "PBI", "PDX", "PFN",
         "PHF", "PHL", "PHX", "PIA", "PIE", "PIH", "PIT", "PLN", "PMD", "PNS", "PSC",
         "PSE", "PSG", "PSP", "PUB", "PVD", "PVU", "PWM", "RAP", "RCA", "RDD", "RDM",
         "RDU", "RFD", "RHI", "RIC", "RNO", "ROA", "ROC", "ROW", "RST", "RSW", "SAN",
         "SAT", "SAV", "SBA", "SBN", "SBP", "SCC", "SCE", "SDF", "SEA", "SFO", "SGF",
         "SGU", "SHV", "SIT", "SJC", "SJT", "SJU", "SLC", "SLE", "SMF", "SMX", "SNA",
         "SOP", "SPI", "SPS", "SRQ", "STL", "STT", "STX", "SUN", "SUX", "SWF", "SYR",
         "TEX", "TLH", "TOL", "TPA", "TRI", "TTN", "TUL", "TUP", "TUS", "TVC", "TWF",
         "TXK", "TYR", "TYS", "VCT", "VIS", "VLD", "VPS", "WRG", "WYS", "XNA", "YAK",
         "YKM", "YUM")
airport: NominalScale = nominal[ABE, ABI, ABQ, ABY, ACK, ACT, ACV, ACY, ADK, ADQ, AEX, AGS, AKN, ALB, ALO, AMA, ANC, APF, ASE, ATL, ATW, AUS, AVL, AVP, AZO, BDL, BET, BFL, BGM, BGR, BHM, BIL, BIS, BJI, BLI, BMI, BNA, BOI, BOS, BPT, BQK, BQN, BRO, BRW, BTM, BTR, BTV, BUF, BUR, BWI, BZN, CAE, CAK, CDC, CDV, CEC, CHA, CHO, CHS, CIC, CID, CKB, CLD, CLE, CLL, CLT, CMH, CMI, CMX, COD, COS, CPR, CRP, CRW, CSG, CVG, CWA, CYS, DAB, DAL, DAY, DBQ, DCA, DEN, DFW, DHN, DLG, DLH, DRO, DSM, DTW, EAU, EGE, EKO, ELM, ELP, ERI, EUG, EVV, EWN, EWR, EYW, FAI, FAR, FAT, FAY, FCA, FLG, FLL, FLO, FMN, FNT, FSD, FSM, FWA, GEG, GFK, GGG, GJT, GNV, GPT, GRB, GRK, GRR, GSO, GSP, GST, GTF, GTR, GUC, HDN, HHH, HKY, HLN, HNL, HOU, HPN, HRL, HSV, HTS, HVN, IAD, IAH, ICT, IDA, ILG, ILM, IND, INL, IPL, ISO, ISP, ITO, IYK, JAC, JAN, JAX, JFK, JNU, KOA, KTN, LAN, LAR, LAS, LAW, LAX, LBB, LBF, LCH, LEX, LFT, LGA, LGB, LIH, LIT, LNK, LRD, LSE, LWB, LWS, LYH, MAF, MBS, MCI, MCN, MCO, MDT, MDW, MEI, MEM, MFE, MFR, MGM, MHT, MIA, MKE, MLB, MLI, MLU, MOB, MOD, MOT, MQT, MRY, MSN, MSO, MSP, MSY, MTH, MTJ, MYR, OAJ, OAK, OGD, OGG, OKC, OMA, OME, ONT, ORD, ORF, OTZ, OXR, PBI, PDX, PFN, PHF, PHL, PHX, PIA, PIE, PIH, PIT, PLN, PMD, PNS, PSC, PSE, PSG, PSP, PUB, PVD, PVU, PWM, RAP, RCA, RDD, RDM, RDU, RFD, RHI, RIC, RNO, ROA, ROC, ROW, RST, RSW, SAN, SAT, SAV, SBA, SBN, SBP, SCC, SCE, SDF, SEA, SFO, SGF, SGU, SHV, SIT, SJC, SJT, SJU, SLC, SLE, SMF, SMX, SNA, SOP, SPI, SPS, SRQ, STL, STT, STX, SUN, SUX, SWF, SYR, TEX, TLH, TOL, TPA, TRI, TTN, TUL, TUP, TUS, TVC, TWF, TXK, TYR, TYS, VCT, VIS, VLD, VPS, WRG, WYS, XNA, YAK, YKM, YUM]

smile> val schema = DataTypes.struct(
         new StructField("Month", DataTypes.ByteType, new NominalScale("c-1", "c-2", "c-3", "c-4",
           "c-5", "c-6", "c-7", "c-8", "c-9", "c-10", "c-11", "c-12")),
         new StructField("DayofMonth", DataTypes.ByteType, new NominalScale("c-1", "c-2", "c-3", "c-4",
           "c-5", "c-6", "c-7", "c-8", "c-9", "c-10", "c-11", "c-12", "c-13", "c-14", "c-15", "c-16", "c-17", "c-18",
           "c-19", "c-20", "c-21", "c-22", "c-23", "c-24", "c-25", "c-26", "c-27", "c-28", "c-29", "c-30", "c-31")),
         new StructField("DayOfWeek", DataTypes.ByteType, new NominalScale("c-1", "c-2", "c-3", "c-4",
           "c-5", "c-6", "c-7")),
         new StructField("DepTime", DataTypes.IntegerType),
         new StructField("UniqueCarrier", DataTypes.ByteType, new NominalScale("9E", "AA", "AQ", "AS",
           "B6", "CO", "DH", "DL", "EV", "F9", "FL", "HA", "HP", "MQ", "NW", "OH", "OO", "TZ", "UA", "US", "WN", "XE", "YV")),
         new StructField("Origin", DataTypes.ShortType, airport),
         new StructField("Dest", DataTypes.ShortType, airport),
         new StructField("Distance", DataTypes.IntegerType),
         new StructField("dep_delayed_15min", DataTypes.ByteType, new NominalScale("N", "Y"))
       )
schema: StructType = [Month: byte nominal[c-1, c-2, c-3, c-4, c-5, c-6, c-7, c-8, c-9, c-10, c-11, c-12], DayofMonth: byte nominal[c-1, c-2, c-3, c-4, c-5, c-6, c-7, c-8, c-9, c-10, c-11, c-12, c-13, c-14, c-15, c-16, c-17, c-18, c-19, c-20, c-21, c-22, c-23, c-24, c-25, c-26, c-27, c-28, c-29, c-30, c-31], DayOfWeek: byte nominal[c-1, c-2, c-3, c-4, c-5, c-6, c-7], DepTime: int, UniqueCarrier: byte nominal[9E, AA, AQ, AS, B6, CO, DH, DL, EV, F9, FL, HA, HP, MQ, NW, OH, OO, TZ, UA, US, WN, XE, YV], Origin: short nominal[ABE, ABI, ABQ, ABY, ACK, ACT, ACV, ACY, ADK, ADQ, AEX, AGS, AKN, ALB, ALO, AMA, ANC, APF, ASE, ATL, ATW, AUS, AVL, AVP, AZO, BDL, BET, BFL, BGM, BGR, BHM, BIL, BIS, BJI, BLI, BMI, BNA, BOI, BOS, BPT, BQK, BQN, BRO, BRW, BTM, BTR, BTV, BUF, BUR, BWI, BZN, CAE, CAK, CDC, CDV, CEC, CHA, CHO, CHS, CIC, CID, CKB, CLD, CLE, CLL, CLT, CMH, CMI, CMX, COD, COS, CPR, CRP, CRW, CSG, CVG, CWA, CYS, DAB, DAL, DAY, DBQ, DCA, DEN, DFW, DHN, DLG, DLH, DRO, DSM, DTW, EAU, EGE, EKO, ELM, ELP, ERI, EUG, EVV, EWN, EWR, EYW, FAI, FAR, FAT, FAY, FCA, FLG, FLL, FLO, FMN, FNT, FSD, FSM, FWA, GEG, GFK, GGG, GJT, GNV, GPT, GRB, GRK, GRR, GSO, GSP, GST, GTF, GTR, GUC, HDN, HHH, HKY, HLN, HNL, HOU, HPN, HRL, HSV, HTS, HVN, IAD, IAH, ICT, IDA, ILG, ILM, IND, INL, IPL, ISO, ISP, ITO, IYK, JAC, JAN, JAX, JFK, JNU, KOA, KTN, LAN, LAR, LAS, LAW, LAX, LBB, LBF, LCH, LEX, LFT, LGA, LGB, LIH, LIT, LNK, LRD, LSE, LWB, LWS, LYH, MAF, MBS, MCI, MCN, MCO, MDT, MDW, MEI, MEM, MFE, MFR, MGM, MHT, MIA, MKE, MLB, MLI, MLU, MOB, MOD, MOT, MQT, MRY, MSN, MSO, MSP, MSY, MTH, MTJ, MYR, OAJ, OAK, OGD, OGG, OKC, OMA, OME, ONT, ORD, ORF, OTZ, OXR, PBI, PDX, PFN, PHF, PHL, PHX, PIA, PIE, PIH, PIT, PLN, PMD, PNS, PSC, PSE, PSG, PSP, PUB, PVD, PVU, PWM, RAP, RCA, RDD, RDM, RDU, RFD, RHI, RIC, RNO, ROA, ROC, ROW, RST, RSW, SAN, SAT, SAV, SBA, SBN, SBP, SCC, SCE, SDF, SEA, SFO, SGF, SGU, SHV, SIT, SJC, SJT, SJU, SLC, SLE, SMF, SMX, SNA, SOP, SPI, SPS, SRQ, STL, STT, STX, SUN, SUX, SWF, SYR, TEX, TLH, TOL, TPA, TRI, TTN, TUL, TUP, TUS, TVC, TWF, TXK, TYR, TYS, VCT, VIS, VLD, VPS, WRG, WYS, XNA, YAK, YKM, YUM], Dest: short nominal[ABE, ABI, ABQ, ABY, ACK, ACT, ACV, ACY, ADK, ADQ, AEX, AGS, AKN, ALB, ALO, AMA, ANC, APF, ASE, ATL, ATW, AUS, AVL, AVP, AZO, BDL, BET, BFL, BGM, BGR, BHM, BIL, BIS, BJI, BLI, BMI, BNA, BOI, BOS, BPT, BQK, BQN, BRO, BRW, BTM, BTR, BTV, BUF, BUR, BWI, BZN, CAE, CAK, CDC, CDV, CEC, CHA, CHO, CHS, CIC, CID, CKB, CLD, CLE, CLL, CLT, CMH, CMI, CMX, COD, COS, CPR, CRP, CRW, CSG, CVG, CWA, CYS, DAB, DAL, DAY, DBQ, DCA, DEN, DFW, DHN, DLG, DLH, DRO, DSM, DTW, EAU, EGE, EKO, ELM, ELP, ERI, EUG, EVV, EWN, EWR, EYW, FAI, FAR, FAT, FAY, FCA, FLG, FLL, FLO, FMN, FNT, FSD, FSM, FWA, GEG, GFK, GGG, GJT, GNV, GPT, GRB, GRK, GRR, GSO, GSP, GST, GTF, GTR, GUC, HDN, HHH, HKY, HLN, HNL, HOU, HPN, HRL, HSV, HTS, HVN, IAD, IAH, ICT, IDA, ILG, ILM, IND, INL, IPL, ISO, ISP, ITO, IYK, JAC, JAN, JAX, JFK, JNU, KOA, KTN, LAN, LAR, LAS, LAW, LAX, LBB, LBF, LCH, LEX, LFT, LGA, LGB, LIH, LIT, LNK, LRD, LSE, LWB, LWS, LYH, MAF, MBS, MCI, MCN, MCO, MDT, MDW, MEI, MEM, MFE, MFR, MGM, MHT, MIA, MKE, MLB, MLI, MLU, MOB, MOD, MOT, MQT, MRY, MSN, MSO, MSP, MSY, MTH, MTJ, MYR, OAJ, OAK, OGD, OGG, OKC, OMA, OME, ONT, ORD, ORF, OTZ, OXR, PBI, PDX, PFN, PHF, PHL, PHX, PIA, PIE, PIH, PIT, PLN, PMD, PNS, PSC, PSE, PSG, PSP, PUB, PVD, PVU, PWM, RAP, RCA, RDD, RDM, RDU, RFD, RHI, RIC, RNO, ROA, ROC, ROW, RST, RSW, SAN, SAT, SAV, SBA, SBN, SBP, SCC, SCE, SDF, SEA, SFO, SGF, SGU, SHV, SIT, SJC, SJT, SJU, SLC, SLE, SMF, SMX, SNA, SOP, SPI, SPS, SRQ, STL, STT, STX, SUN, SUX, SWF, SYR, TEX, TLH, TOL, TPA, TRI, TTN, TUL, TUP, TUS, TVC, TWF, TXK, TYR, TYS, VCT, VIS, VLD, VPS, WRG, WYS, XNA, YAK, YKM, YUM], Distance: int, dep_delayed_15min: byte nominal[N, Y]]

smile> val airline = read.csv("shell/src/universal/data/airline/train-1m.csv", schema = schema)
airline: DataFrame = [Month: byte nominal[c-1, c-2, c-3, c-4, c-5, c-6, c-7, c-8, c-9, c-10, c-11, c-12], DayofMonth: byte nominal[c-1, c-2, c-3, c-4, c-5, c-6, c-7, c-8, c-9, c-10, c-11, c-12, c-13, c-14, c-15, c-16, c-17, c-18, c-19, c-20, c-21, c-22, c-23, c-24, c-25, c-26, c-27, c-28, c-29, c-30, c-31], DayOfWeek: byte nominal[c-1, c-2, c-3, c-4, c-5, c-6, c-7], DepTime: int, UniqueCarrier: byte nominal[9E, AA, AQ, AS, B6, CO, DH, DL, EV, F9, FL, HA, HP, MQ, NW, OH, OO, TZ, UA, US, WN, XE, YV], Origin: short nominal[ABE, ABI, ABQ, ABY, ACK, ACT, ACV, ACY, ADK, ADQ, AEX, AGS, AKN, ALB, ALO, AMA, ANC, APF, ASE, ATL, ATW, AUS, AVL, AVP, AZO, BDL, BET, BFL, BGM, BGR, BHM, BIL, BIS, BJI, BLI, BMI, BNA, BOI, BOS, BPT, BQK, BQN, BRO, BRW, BTM, BTR, BTV, BUF, BUR, BWI, BZN, CAE, CAK, CDC, CDV, CEC, CHA, CHO, CHS, CIC, CID, CKB, CLD, CLE, CLL, CLT, CMH, CMI, CMX, COD, COS, CPR, CRP, CRW, CSG, CVG, CWA, CYS, DAB, DAL, DAY, DBQ, DCA, DEN, DFW, DHN, DLG, DLH, DRO, DSM, DTW, EAU, EGE, EKO, ELM, ELP, ERI, EUG, EVV, EWN, EWR, EYW, FAI, FAR, FAT, FAY, FCA, FLG, FLL, FLO, FMN, FNT, FSD, FSM, FWA, GEG, GFK, GGG, GJT, GNV, GPT, GRB, GRK, GRR, GSO, GSP, GST, GTF, GTR, GUC, HDN, HHH, HKY, HLN, HNL, HOU, HPN, HRL, HSV, HTS, HVN, IAD, IAH, ICT, IDA, ILG, ILM, IND, INL, IPL, ISO, ISP, ITO, IYK, JAC, JAN, JAX, JFK, JNU, KOA, KTN, LAN, LAR, LAS, LAW, LAX, LBB, LBF, LCH, LEX, LFT, LGA, LGB, LIH, LIT, LNK, LRD, LSE, LWB, LWS, LYH, MAF, MBS, MCI, MCN, MCO, MDT, MDW, MEI, MEM, MFE, MFR, MGM, MHT, MIA, MKE, MLB, MLI, MLU, MOB, MOD, MOT, MQT, MRY, MSN, MSO, MSP, MSY, MTH, MTJ, MYR, OAJ, OAK, OGD, OGG, OKC, OMA, OME, ONT, ORD, ORF, OTZ, OXR, PBI, PDX, PFN, PHF, PHL, PHX, PIA, PIE, PIH, PIT, PLN, PMD, PNS, PSC, PSE, PSG, PSP, PUB, PVD, PVU, PWM, RAP, RCA, RDD, RDM, RDU, RFD, RHI, RIC, RNO, ROA, ROC, ROW, RST, RSW, SAN, SAT, SAV, SBA, SBN, SBP, SCC, SCE, SDF, SEA, SFO, SGF, SGU, SHV, SIT, SJC, SJT, SJU, SLC, SLE, SMF, SMX, SNA, SOP, SPI, SPS, SRQ, STL, STT, STX, SUN, SUX, SWF, SYR, TEX, TLH, TOL, TPA, TRI, TTN, TUL, TUP, TUS, TVC, TWF, TXK, TYR, TYS, VCT, VIS, VLD, VPS, WRG, WYS, XNA, YAK, YKM, YUM], Dest: short nominal[ABE, ABI, ABQ, ABY, ACK, ACT, ACV, ACY, ADK, ADQ, AEX, AGS, AKN, ALB, ALO, AMA, ANC, APF, ASE, ATL, ATW, AUS, AVL, AVP, AZO, BDL, BET, BFL, BGM, BGR, BHM, BIL, BIS, BJI, BLI, BMI, BNA, BOI, BOS, BPT, BQK, BQN, BRO, BRW, BTM, BTR, BTV, BUF, BUR, BWI, BZN, CAE, CAK, CDC, CDV, CEC, CHA, CHO, CHS, CIC, CID, CKB, CLD, CLE, CLL, CLT, CMH, CMI, CMX, COD, COS, CPR, CRP, CRW, CSG, CVG, CWA, CYS, DAB, DAL, DAY, DBQ, DCA, DEN, DFW, DHN, DLG, DLH, DRO, DSM, DTW, EAU, EGE, EKO, ELM, ELP, ERI, EUG, EVV, EWN, EWR, EYW, FAI, FAR, FAT, FAY, FCA, FLG, FLL, FLO, FMN, FNT, FSD, FSM, FWA, GEG, GFK, GGG, GJT, GNV, GPT, GRB, GRK, GRR, GSO, GSP, GST, GTF, GTR, GUC, HDN, HHH, HKY, HLN, HNL, HOU, HPN, HRL, HSV, HTS, HVN, IAD, IAH, ICT, IDA, ILG, ILM, IND, INL, IPL, ISO, ISP, ITO, IYK, JAC, JAN, JAX, JFK, JNU, KOA, KTN, LAN, LAR, LAS, LAW, LAX, LBB, LBF, LCH, LEX, LFT, LGA, LGB, LIH, LIT, LNK, LRD, LSE, LWB, LWS, LYH, MAF, MBS, MCI, MCN, MCO, MDT, MDW, MEI, MEM, MFE, MFR, MGM, MHT, MIA, MKE, MLB, MLI, MLU, MOB, MOD, MOT, MQT, MRY, MSN, MSO, MSP, MSY, MTH, MTJ, MYR, OAJ, OAK, OGD, OGG, OKC, OMA, OME, ONT, ORD, ORF, OTZ, OXR, PBI, PDX, PFN, PHF, PHL, PHX, PIA, PIE, PIH, PIT, PLN, PMD, PNS, PSC, PSE, PSG, PSP, PUB, PVD, PVU, PWM, RAP, RCA, RDD, RDM, RDU, RFD, RHI, RIC, RNO, ROA, ROC, ROW, RST, RSW, SAN, SAT, SAV, SBA, SBN, SBP, SCC, SCE, SDF, SEA, SFO, SGF, SGU, SHV, SIT, SJC, SJT, SJU, SLC, SLE, SMF, SMX, SNA, SOP, SPI, SPS, SRQ, STL, STT, STX, SUN, SUX, SWF, SYR, TEX, TLH, TOL, TPA, TRI, TTN, TUL, TUP, TUS, TVC, TWF, TXK, TYR, TYS, VCT, VIS, VLD, VPS, WRG, WYS, XNA, YAK, YKM, YUM], Distance: int, dep_delayed_15min: byte nominal[N, Y]]
+-----+----------+---------+-------+-------------+------+----+--------+-----------------+
|Month|DayofMonth|DayOfWeek|DepTime|UniqueCarrier|Origin|Dest|Distance|dep_delayed_15min|
...
                

smile> import smile.data.type.*

smile> import smile.data.measure.*

smile> var airport = new NominalScale("ABE", "ABI", "ABQ", "ABY", "ACK", "ACT",
   ...>   "ACV", "ACY", "ADK", "ADQ", "AEX", "AGS", "AKN", "ALB", "ALO", "AMA", "ANC",
   ...>   "APF", "ASE", "ATL", "ATW", "AUS", "AVL", "AVP", "AZO", "BDL", "BET", "BFL",
   ...>   "BGM", "BGR", "BHM", "BIL", "BIS", "BJI", "BLI", "BMI", "BNA", "BOI", "BOS",
   ...>   "BPT", "BQK", "BQN", "BRO", "BRW", "BTM", "BTR", "BTV", "BUF", "BUR", "BWI",
   ...>   "BZN", "CAE", "CAK", "CDC", "CDV", "CEC", "CHA", "CHO", "CHS", "CIC", "CID",
   ...>   "CKB", "CLD", "CLE", "CLL", "CLT", "CMH", "CMI", "CMX", "COD", "COS", "CPR",
   ...>   "CRP", "CRW", "CSG", "CVG", "CWA", "CYS", "DAB", "DAL", "DAY", "DBQ", "DCA",
   ...>   "DEN", "DFW", "DHN", "DLG", "DLH", "DRO", "DSM", "DTW", "EAU", "EGE", "EKO",
   ...>   "ELM", "ELP", "ERI", "EUG", "EVV", "EWN", "EWR", "EYW", "FAI", "FAR", "FAT",
   ...>   "FAY", "FCA", "FLG", "FLL", "FLO", "FMN", "FNT", "FSD", "FSM", "FWA", "GEG",
   ...>   "GFK", "GGG", "GJT", "GNV", "GPT", "GRB", "GRK", "GRR", "GSO", "GSP", "GST",
   ...>   "GTF", "GTR", "GUC", "HDN", "HHH", "HKY", "HLN", "HNL", "HOU", "HPN", "HRL",
   ...>   "HSV", "HTS", "HVN", "IAD", "IAH", "ICT", "IDA", "ILG", "ILM", "IND", "INL",
   ...>   "IPL", "ISO", "ISP", "ITO", "IYK", "JAC", "JAN", "JAX", "JFK", "JNU", "KOA",
   ...>   "KTN", "LAN", "LAR", "LAS", "LAW", "LAX", "LBB", "LBF", "LCH", "LEX", "LFT",
   ...>   "LGA", "LGB", "LIH", "LIT", "LNK", "LRD", "LSE", "LWB", "LWS", "LYH", "MAF",
   ...>   "MBS", "MCI", "MCN", "MCO", "MDT", "MDW", "MEI", "MEM", "MFE", "MFR", "MGM",
   ...>   "MHT", "MIA", "MKE", "MLB", "MLI", "MLU", "MOB", "MOD", "MOT", "MQT", "MRY",
   ...>   "MSN", "MSO", "MSP", "MSY", "MTH", "MTJ", "MYR", "OAJ", "OAK", "OGD", "OGG",
   ...>   "OKC", "OMA", "OME", "ONT", "ORD", "ORF", "OTZ", "OXR", "PBI", "PDX", "PFN",
   ...>   "PHF", "PHL", "PHX", "PIA", "PIE", "PIH", "PIT", "PLN", "PMD", "PNS", "PSC",
   ...>   "PSE", "PSG", "PSP", "PUB", "PVD", "PVU", "PWM", "RAP", "RCA", "RDD", "RDM",
   ...>   "RDU", "RFD", "RHI", "RIC", "RNO", "ROA", "ROC", "ROW", "RST", "RSW", "SAN",
   ...>   "SAT", "SAV", "SBA", "SBN", "SBP", "SCC", "SCE", "SDF", "SEA", "SFO", "SGF",
   ...>   "SGU", "SHV", "SIT", "SJC", "SJT", "SJU", "SLC", "SLE", "SMF", "SMX", "SNA",
   ...>   "SOP", "SPI", "SPS", "SRQ", "STL", "STT", "STX", "SUN", "SUX", "SWF", "SYR",
   ...>   "TEX", "TLH", "TOL", "TPA", "TRI", "TTN", "TUL", "TUP", "TUS", "TVC", "TWF",
   ...>   "TXK", "TYR", "TYS", "VCT", "VIS", "VLD", "VPS", "WRG", "WYS", "XNA", "YAK",
   ...>   "YKM", "YUM")
airport ==> nominal[ABE, ABI, ABQ, ABY, ACK, ACT, ACV, ACY, A ... , WYS, XNA, YAK, YKM, YUM]

smile> var schema = DataTypes.struct(
   ...>   new StructField("Month", DataTypes.ByteType, new NominalScale("c-1", "c-2", "c-3", "c-4",
   ...>     "c-5", "c-6", "c-7", "c-8", "c-9", "c-10", "c-11", "c-12")),
   ...>   new StructField("DayofMonth", DataTypes.ByteType, new NominalScale("c-1", "c-2", "c-3", "c-4",
   ...>     "c-5", "c-6", "c-7", "c-8", "c-9", "c-10", "c-11", "c-12", "c-13", "c-14", "c-15", "c-16", "c-17", "c-18",
   ...>     "c-19", "c-20", "c-21", "c-22", "c-23", "c-24", "c-25", "c-26", "c-27", "c-28", "c-29", "c-30", "c-31")),
   ...>   new StructField("DayOfWeek", DataTypes.ByteType, new NominalScale("c-1", "c-2", "c-3", "c-4",
   ...>     "c-5", "c-6", "c-7")),
   ...>   new StructField("DepTime", DataTypes.IntegerType),
   ...>   new StructField("UniqueCarrier", DataTypes.ByteType, new NominalScale("9E", "AA", "AQ", "AS",
   ...>     "B6", "CO", "DH", "DL", "EV", "F9", "FL", "HA", "HP", "MQ", "NW", "OH", "OO", "TZ", "UA", "US", "WN", "XE", "YV")),
   ...>   new StructField("Origin", DataTypes.ShortType, airport),
   ...>   new StructField("Dest", DataTypes.ShortType, airport),
   ...>   new StructField("Distance", DataTypes.IntegerType),
   ...>   new StructField("dep_delayed_15min", DataTypes.ByteType, new NominalScale("N", "Y"))
   ...> )
schema ==> [Month: byte nominal[c-1, c-2, c-3, c-4, c-5, c-6 ... 15min: byte nominal[N, Y]]

smile> var format = CSVFormat.DEFAULT.withFirstRecordAsHeader();
format ==> Delimiter=<,> QuoteChar=<"> RecordSeparator=<
>  ... eaderRecord:true Header:[]

smile> var airline = Read.csv("data/airline/train-1m.csv", format, schema);
airline ==> [Month: byte nominal[c-1, c-2, c-3, c-4, c-5, c-6 ... ----+
999990 more rows...
          

LibSVM

LibSVM is a very fast and popular library for support vector machines. LibSVM uses a sparse format where zero values do not need to be stored. Each line of a libsvm file is in the format:


    <label> <index1>:<value1> <index2>:<value2> ...
    

where <label> is the target value of the training data. For classification, it should be an integer which identifies a class (multi-class classification is supported). For regression, it's any real number. For one-class SVM, it's not used so can be any number. <index> is an integer starting from 1, and <value> is a real number. The indices must be in ascending order. The labels in the testing data file are only used to calculate accuracy or error. If they are unknown, just fill this column with a number.

To read a libsvm file, smile.io has the function

Although libsvm employs a sparse format, most libsvm files contain dense data. Therefore, Smile also provides helper functions to convert it to dense arrays.


    smile> val glass = read.libsvm("data/libsvm/glass.txt")
    glass: Dataset[Instance[SparseArray]] = smile.data.DatasetImpl@5611bba
    

    smile> var glass = Read.libsvm("data/libsvm/glass.txt")
    glass ==> smile.data.DatasetImpl@524f3b3a
          

    >>> read.libsvm("data/libsvm/glass.txt")
    res118: smile.data.Dataset<smile.data.SampleInstance<smile.util.SparseArray>> = smile.data.DatasetImpl@50d667c3
    

In case of truly sparse libsvm data, we can convert it to SparseMatrix for more efficient matrix computation.


    smile> SparseDataset.of(glass).toMatrix
    res2: SparseMatrix = smile.math.matrix.SparseMatrix@290807e5
    

    smile> var glass = Read.libsvm("data/libsvm/glass.txt")
    glass ==> smile.data.DatasetImpl@17baae6e

    smile> SparseDataset.of(glass).toMatrix()
    $4 ==> smile.math.matrix.SparseMatrix@6b53e23f
          

    >>> SparseDataset.of(glass).toMatrix()
    res120: smile.math.matrix.SparseMatrix! = smile.math.matrix.SparseMatrix@45db84b0
    

Note that read.libsvm returns a Dataset[Instance[SparseArray]] object. The Instance class has both sample object and label. To convert the sample set to a sparse matrix, we firstly convert the Dataset object to SparseDataset, which doesn't have the label. We discuss the details of SparseDataset in next section.

Coordinate Triple Tuple List

The function SparseDataset.from(Path path, int arrayIndexOrigin) can read sparse data in coordinate triple tuple list format. The parameter arrayIndexOrigin is the starting index of array. By default, it is 0 as in C/C++ and Java. But it could be 1 to parse data produced by other programming language such as Fortran.

The coordinate file stores a list of (row, column, value) tuples:

    instanceID attributeID value
    instanceID attributeID value
    instanceID attributeID value
    instanceID attributeID value
    ...
    instanceID attributeID value
    instanceID attributeID value
    instanceID attributeID value
    

Ideally, the entries are sorted (by row index, then column index) to improve random access times. This format is good for incremental matrix construction.

Optionally, there may be 2 header lines

    D    // The number of instances
    W    // The number of attributes
    

or 3 header lines

    D    // The number of instances
    W    // The number of attributes
    N    // The total number of nonzero items in the dataset.
    

These header lines will be ignored.

The sample data data/sparse/kos.txt is in the coordinate format.


    smile> val kos = SparseDataset.from(java.nio.file.Paths.get("data/sparse/kos.txt"), 1)
    kos: SparseDataset = smile.data.SparseDatasetImpl@4da602fc
    

    smile> var kos = SparseDataset.from(java.nio.file.Paths.get("data/sparse/kos.txt"), 1)
    kos ==> smile.data.SparseDatasetImpl@4d826d77
          

    >>> SparseDataset.from(java.nio.file.Paths.get("data/sparse/kos.txt"), 1)
    res123: smile.data.SparseDataset! = smile.data.SparseDatasetImpl@485b4fd0
    

Harwell-Boeing Column-Compressed Sparse Matrix

In Harwell-Boeing column-compressed sparse matrix file, nonzero values are stored in an array (top-to-bottom, then left-to-right-bottom). The row indices corresponding to the values are also stored. Besides, a list of pointers are indexes where each column starts. The class SparseMatrix supports two formats for Harwell-Boeing files. The simple one is organized as follows:

The first line contains three integers, which are the number of rows, the number of columns, and the number of nonzero entries in the matrix.

Following the first line, there are m + 1 integers that are the indices of columns, where m is the number of columns. Then there are n integers that are the row indices of nonzero entries, where n is the number of nonzero entries. Finally, there are n float numbers that are the values of nonzero entries.

The function SparseMatrix.text(Path path) can read this simple format. In the directory data/matrix, there are several sample files in the Harwell-Boeing format.


    smile> val blocks = SparseMatrix.text(java.nio.file.Paths.get("data/matrix/08blocks.txt"))
    blocks: SparseMatrix = smile.math.matrix.SparseMatrix@4263b080
    

    smile> import smile.math.matrix.*;

    smile> var blocks = SparseMatrix.text(java.nio.file.Paths.get("data/matrix/08blocks.txt"))
    blocks ==> smile.math.matrix.SparseMatrix@7ff95560
          

    >>> import smile.math.matrix.*
    >>> SparseMatrix.text(java.nio.file.Paths.get("data/matrix/08blocks.txt"))
    res126: smile.math.matrix.SparseMatrix! = smile.math.matrix.SparseMatrix@1a479168
    

The second format is more complicated and powerful, called Harwell-Boeing Exchange Format. For details, see https://people.sc.fsu.edu/~jburkardt/data/hb/hb.html. Note that our implementation supports only real-valued matrix, and we ignore the optional right hand side vectors. This format is supported by the function SparseMatrix.harwell(Path path).


smile> val five = SparseMatrix.harwell(java.nio.file.Paths.get("data/matrix/5by5_rua.hb"))
[main] INFO smile.math.matrix.SparseMatrix - Reads sparse matrix file '/Users/hli/github/smile/shell/target/universal/stage/data/matrix/5by5_rua.hb'
[main] INFO smile.math.matrix.SparseMatrix - Title                                                                   Key
[main] INFO smile.math.matrix.SparseMatrix - 5             1             1             3             0
[main] INFO smile.math.matrix.SparseMatrix - RUA                        5             5            13             0
[main] INFO smile.math.matrix.SparseMatrix - (6I3)           (13I3)          (5E15.8)            (5E15.8)
five: SparseMatrix = smile.math.matrix.SparseMatrix@1761de10
    

smile> var five = SparseMatrix.harwell(java.nio.file.Paths.get("data/matrix/5by5_rua.hb"))
[main] INFO smile.math.matrix.SparseMatrix - Reads sparse matrix file '/Users/hli/github/smile/shell/target/universal/stage/data/matrix/5by5_rua.hb'
[main] INFO smile.math.matrix.SparseMatrix - Title                                                                   Key
[main] INFO smile.math.matrix.SparseMatrix - 5             1             1             3             0
[main] INFO smile.math.matrix.SparseMatrix - RUA                        5             5            13             0
[main] INFO smile.math.matrix.SparseMatrix - (6I3)           (13I3)          (5E15.8)            (5E15.8)
five ==> smile.math.matrix.SparseMatrix@6b4a4e18
          

>>> SparseMatrix.harwell(java.nio.file.Paths.get("data/matrix/5by5_rua.hb"))
[main] INFO smile.math.matrix.SparseMatrix - Reads sparse matrix file '/Users/hli/github/smile/shell/target/universal/stage/data/matrix/5by5_rua.hb'
[main] INFO smile.math.matrix.SparseMatrix - Title                                                                   Key
[main] INFO smile.math.matrix.SparseMatrix - 5             1             1             3             0
[main] INFO smile.math.matrix.SparseMatrix - RUA                        5             5            13             0
[main] INFO smile.math.matrix.SparseMatrix - (6I3)           (13I3)          (5E15.8)            (5E15.8)
res127: smile.math.matrix.SparseMatrix! = smile.math.matrix.SparseMatrix@37672764
    

Wireframe

Smile can parse 3D wireframe models in Wavefront OBJ files.


    def read.wavefront(file: String): (Array[Array[Double]], Array[Array[Int]])
    

In the directory data/wireframe, there is a teapot wireframe model. In the next section, we will learn how to visualize the 3D wireframe models.


    smile> val (vertices, edges) = read.wavefront("data/wavefront/teapot.obj")
    vertices: Array[Array[Double]] = Array(
      Array(40.6266, 28.3457, -1.10804),
      Array(40.0714, 30.4443, -1.10804),
      Array(40.7155, 31.1438, -1.10804),
      Array(42.0257, 30.4443, -1.10804),
      Array(43.4692, 28.3457, -1.10804),
      Array(37.5425, 28.3457, 14.5117),
      Array(37.0303, 30.4443, 14.2938),
      Array(37.6244, 31.1438, 14.5466),
      Array(38.8331, 30.4443, 15.0609),
      Array(40.1647, 28.3457, 15.6274),
      Array(29.0859, 28.3457, 27.1468),
      Array(28.6917, 30.4443, 26.7527),
      Array(29.149, 31.1438, 27.2099),
      Array(30.0792, 30.4443, 28.1402),
      Array(31.1041, 28.3457, 29.165),
      Array(16.4508, 28.3457, 35.6034),
      Array(16.2329, 30.4443, 35.0912),
      Array(16.4857, 31.1438, 35.6853),
      Array(16.9999, 30.4443, 36.894),
      Array(17.5665, 28.3457, 38.2256),
      Array(0.831025, 28.3457, 38.6876),
      Array(0.831025, 30.4443, 38.1324),
      Array(0.831025, 31.1438, 38.7764),
      Array(0.831025, 30.4443, 40.0866),
    ...
    edges: Array[Array[Int]] = Array(
      Array(6, 5),
      Array(5, 0),
      Array(6, 0),
      Array(0, 1),
      Array(1, 6),
      Array(0, 6),
      Array(7, 6),
      Array(6, 1),
      Array(7, 1),
      Array(1, 2),
      Array(2, 7),
      Array(1, 7),
      Array(8, 7),
      Array(7, 2),
      Array(8, 2),
      Array(2, 3),
      Array(3, 8),
      Array(2, 8),
      Array(9, 8),
      Array(8, 3),
      Array(9, 3),
      Array(3, 4),
      Array(4, 9),
      Array(3, 9),
    ...
    

Export Data and Models

To serialize a model, you may use


    import smile._
    write(model, file)
    

    import smile.io.Write;
    Write.object(model, file)
    

This method serializes the model in Java serialization format. This is handy if you want to use a model in Spark.

You can also save a DataFrame to an ARFF file with the method write.arff(data, file). The ARFF file keeps the data type information. If you prefer the plain csv text file, you may use the methods write.csv(data, file) or write.table(data, file, "delimiter"), which save a generic two-dimensional array with comma or customized delimiter. To save one dimensional array, simply call write(array, file).

Fork me on GitHub