public interface Read

Reads data from external storage systems.

Method Summary

Static Methods

Modifier and Type

Method

Description

static DataFrame

arff(String path)

Reads an ARFF file.

static DataFrame

arff(Path path)

Reads an ARFF file.

static DataFrame

arrow(String path)

Reads an Apache Arrow file.

static DataFrame

arrow(Path path)

Reads an Apache Arrow file.

static DataFrame

avro(String path, InputStream schema)

Reads an Apache Avro file.

static DataFrame

avro(String path, String schema)

Reads an Apache Avro file.

static DataFrame

avro(Path path, InputStream schema)

Reads an Apache Avro file.

static DataFrame

avro(Path path, Path schema)

Reads an Apache Avro file.

static DataFrame

csv(String path)

Reads a CSV file.

static DataFrame

csv(String path, String format)

Reads a CSV file.

static DataFrame

csv(String path, org.apache.commons.csv.CSVFormat format)

Reads a CSV file.

static DataFrame

csv(String path, org.apache.commons.csv.CSVFormat format, StructType schema)

Reads a CSV file.

static DataFrame

csv(Path path)

Reads a CSV file.

static DataFrame

csv(Path path, org.apache.commons.csv.CSVFormat format)

Reads a CSV file.

static DataFrame

csv(Path path, org.apache.commons.csv.CSVFormat format, StructType schema)

Reads a CSV file.

static DataFrame

data(String path)

Reads a data file.

static DataFrame

data(String path, String format)

Reads a data file.

static DataFrame

json(String path)

Reads a JSON file.

static DataFrame

json(String path, JSON.Mode mode, StructType schema)

Reads a JSON file.

static DataFrame

json(Path path)

Reads a JSON file.

static DataFrame

json(Path path, JSON.Mode mode, StructType schema)

Reads a JSON file.

static SparseDataset<Integer>

libsvm(BufferedReader reader)

Reads a libsvm sparse dataset.

static SparseDataset<Integer>

libsvm(String path)

Reads a libsvm sparse dataset.

static SparseDataset<Integer>

libsvm(Path path)

Reads a libsvm sparse dataset.

static Object

object(Path path)

Reads a serialized object from a file.

static DataFrame

parquet(String path)

Reads an Apache Parquet file.

static DataFrame

parquet(Path path)

Reads an Apache Parquet file.

static DataFrame

sas(String path)

Reads a SAS7BDAT file.

static DataFrame

sas(Path path)

Reads a SAS7BDAT file.

Method Details
- object
  
  static Object object(Path path) throws IOException, ClassNotFoundException
  
  Reads a serialized object from a file.
  
  Parameters:
  
  path - the file path.
  
  Returns:
  
  the serialized object.
  
  Throws:
  
  IOException - when fails to read the stream.
  
  ClassNotFoundException - when fails to load the class.
- data
  
  static DataFrame data(String path) throws IOException, URISyntaxException, ParseException
  
  Reads a data file. Infers the data format by the file name extension.
  
  Parameters:
  
  path - the input file path.
  
  Returns:
  
  the data frame.
  
  Throws:
  
  IOException - when fails to read the file.
  
  ParseException - when fails to parse the file.
  
  URISyntaxException - when the file path syntax is wrong.
- data
  
  static DataFrame data(String path, String format) throws IOException, URISyntaxException, ParseException
  
  Reads a data file. Infers the data format by the file name extension.
  
  Parameters:
  
  path - the input file path.
  
  format - the optional file format specification. For csv files, it is such as delimiter=\t,header=true,comment=#,escape=\,quote=". For json files, it is the file mode (single-line or multi-line). For avro files, it is the path to the schema file.
  
  Returns:
  
  the data frame.
  
  Throws:
  
  IOException - when fails to read the file.
  
  ParseException - when fails to parse the file.
  
  URISyntaxException - when the file path syntax is wrong.
- csv
  
  static DataFrame csv(String path) throws IOException, URISyntaxException
  
  Reads a CSV file.
  
  Parameters:
  
  path - the input file path.
  
  Returns:
  
  the data frame.
  
  Throws:
  
  IOException - when fails to read the file.
  
  URISyntaxException - when the file path syntax is wrong.
- csv
  
  static DataFrame csv(String path, String format) throws IOException, URISyntaxException
  
  Reads a CSV file.
  
  Parameters:
  
  path - the input file path.
  
  format - the format specification in key-value pairs such as delimiter=\t,header=true,comment=#,escape=\,quote=".
  
  Returns:
  
  the data frame.
  
  Throws:
  
  IOException - when fails to read the file.
  
  URISyntaxException - when the file path syntax is wrong.
- csv
  
  static DataFrame csv(String path, org.apache.commons.csv.CSVFormat format) throws IOException, URISyntaxException
  
  Reads a CSV file.
  
  Parameters:
  
  path - the input file path.
  
  format - the CSV file format.
  
  Returns:
  
  the data frame.
  
  Throws:
  
  IOException - when fails to read the file.
  
  URISyntaxException - when the file path syntax is wrong.
- csv
  
  static DataFrame csv(String path, org.apache.commons.csv.CSVFormat format, StructType schema) throws IOException, URISyntaxException
  
  Reads a CSV file.
  
  Parameters:
  
  path - the input file path.
  
  format - the CSV file format.
  
  schema - the data schema.
  
  Returns:
  
  the data frame.
  
  Throws:
  
  IOException - when fails to read the file.
  
  URISyntaxException - when the file path syntax is wrong.
- csv
  
  static DataFrame csv(Path path) throws IOException
  
  Reads a CSV file.
  
  Parameters:
  
  path - the input file path.
  
  Returns:
  
  the data frame.
  
  Throws:
  
  IOException - when fails to read the file.
- csv
  
  static DataFrame csv(Path path, org.apache.commons.csv.CSVFormat format) throws IOException
  
  Reads a CSV file.
  
  Parameters:
  
  path - the input file path.
  
  format - the CSV file format.
  
  Returns:
  
  the data frame.
  
  Throws:
  
  IOException - when fails to read the file.
- csv
  
  static DataFrame csv(Path path, org.apache.commons.csv.CSVFormat format, StructType schema) throws IOException
  
  Reads a CSV file.
  
  Parameters:
  
  path - the input file path.
  
  format - the CSV file format.
  
  schema - the data schema.
  
  Returns:
  
  the data frame.
  
  Throws:
  
  IOException - when fails to read the file.
- json
  
  static DataFrame json(String path) throws IOException, URISyntaxException
  
  Reads a JSON file.
  
  Parameters:
  
  path - the input file path.
  
  Returns:
  
  the data frame.
  
  Throws:
  
  IOException - when fails to read the file.
  
  URISyntaxException - when the file path syntax is wrong.
- json
  
  static DataFrame json(String path, JSON.Mode mode, StructType schema) throws IOException, URISyntaxException
  
  Reads a JSON file.
  
  Parameters:
  
  path - the input file path.
  
  mode - the file mode (single-line or multi-line).
  
  schema - the data schema.
  
  Returns:
  
  the data frame.
  
  Throws:
  
  IOException - when fails to read the file.
  
  URISyntaxException - when the file path syntax is wrong.
- json
  
  static DataFrame json(Path path) throws IOException
  
  Reads a JSON file.
  
  Parameters:
  
  path - the input file path.
  
  Returns:
  
  the data frame.
  
  Throws:
  
  IOException - when fails to read the file.
- json
  
  static DataFrame json(Path path, JSON.Mode mode, StructType schema) throws IOException
  
  Reads a JSON file.
  
  Parameters:
  
  path - the input file path.
  
  mode - the file mode (single-line or multi-line).
  
  schema - the data schema.
  
  Returns:
  
  the data frame.
  
  Throws:
  
  IOException - when fails to read the file.
- arff
  
  static DataFrame arff(String path) throws IOException, ParseException, URISyntaxException
  
  Reads an ARFF file. Weka ARFF (attribute relation file format) is an ASCII text file format that is essentially a CSV file with a header that describes the meta-data. ARFF was developed for use in the Weka machine learning software.
  A dataset is firstly described, beginning with the name of the dataset (or the relation in ARFF terminology). Each of the variables (or attribute in ARFF terminology) used to describe the observations is then identified, together with their data type, each definition on a single line. The actual observations are then listed, each on a single line, with fields separated by commas, much like a CSV file.
  Missing values in an ARFF dataset are identified using the question mark '?'.
  Comments can be included in the file, introduced at the beginning of a line with a '%', whereby the remainder of the line is ignored.
  A significant advantage of the ARFF data file over the CSV data file is the metadata information.
  Also, the ability to include comments ensure we can record extra information about the data set, including how it was derived, where it came from, and how it might be cited.
  
  Parameters:
  
  path - the input file path.
  
  Returns:
  
  the data frame.
  
  Throws:
  
  IOException - when fails to read the file.
  
  ParseException - when fails to parse the file.
  
  URISyntaxException - when the file path syntax is wrong.
- arff
  
  static DataFrame arff(Path path) throws IOException, ParseException
  
  Reads an ARFF file. Weka ARFF (attribute relation file format) is an ASCII text file format that is essentially a CSV file with a header that describes the meta-data. ARFF was developed for use in the Weka machine learning software.
  A dataset is firstly described, beginning with the name of the dataset (or the relation in ARFF terminology). Each of the variables (or attribute in ARFF terminology) used to describe the observations is then identified, together with their data type, each definition on a single line. The actual observations are then listed, each on a single line, with fields separated by commas, much like a CSV file.
  Missing values in an ARFF dataset are identified using the question mark '?'.
  Comments can be included in the file, introduced at the beginning of a line with a '%', whereby the remainder of the line is ignored.
  A significant advantage of the ARFF data file over the CSV data file is the metadata information.
  Also, the ability to include comments ensure we can record extra information about the data set, including how it was derived, where it came from, and how it might be cited.
  
  Parameters:
  
  path - the input file path.
  
  Returns:
  
  the data frame.
  
  Throws:
  
  IOException - when fails to read the file.
  
  ParseException - when fails to parse the file.
- sas
  
  static DataFrame sas(String path) throws IOException, URISyntaxException
  
  Reads a SAS7BDAT file.
  
  Parameters:
  
  path - the input file path.
  
  Returns:
  
  the data frame.
  
  Throws:
  
  IOException - when fails to read the file.
  
  URISyntaxException - when the file path syntax is wrong.
- sas
  
  static DataFrame sas(Path path) throws IOException
  
  Reads a SAS7BDAT file.
  
  Parameters:
  
  path - the input file path.
  
  Returns:
  
  the data frame.
  
  Throws:
  
  IOException - when fails to read the file.
- arrow
  
  static DataFrame arrow(String path) throws IOException, URISyntaxException
  
  Reads an Apache Arrow file. Apache Arrow is a cross-language development platform for in-memory data. It specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware.
  
  Parameters:
  
  path - the input file path.
  
  Returns:
  
  the data frame.
  
  Throws:
  
  IOException - when fails to read the file.
  
  URISyntaxException - when the file path syntax is wrong.
- arrow
  
  static DataFrame arrow(Path path) throws IOException
  
  Reads an Apache Arrow file. Apache Arrow is a cross-language development platform for in-memory data. It specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware.
  
  Parameters:
  
  path - the input file path.
  
  Returns:
  
  the data frame.
  
  Throws:
  
  IOException - when fails to read the file.
- avro
  
  static DataFrame avro(String path, InputStream schema) throws IOException, URISyntaxException
  
  Reads an Apache Avro file.
  
  Parameters:
  
  path - the input file path.
  
  schema - the input stream of data schema.
  
  Returns:
  
  the data frame.
  
  Throws:
  
  IOException - when fails to read the file.
  
  URISyntaxException - when the file path syntax is wrong.
- avro
  
  static DataFrame avro(String path, String schema) throws IOException, URISyntaxException
  
  Reads an Apache Avro file.
  
  Parameters:
  
  path - the input file path.
  
  schema - the data schema file path.
  
  Returns:
  
  the data frame.
  
  Throws:
  
  IOException - when fails to read the file.
  
  URISyntaxException - when the file path syntax is wrong.
- avro
  
  static DataFrame avro(Path path, InputStream schema) throws IOException
  
  Reads an Apache Avro file.
  
  Parameters:
  
  path - the input file path.
  
  schema - the input stream of data schema.
  
  Returns:
  
  the data frame.
  
  Throws:
  
  IOException - when fails to read the file.
- avro
  
  static DataFrame avro(Path path, Path schema) throws IOException
  
  Reads an Apache Avro file.
  
  Parameters:
  
  path - the input file path.
  
  schema - the data schema file path.
  
  Returns:
  
  the data frame.
  
  Throws:
  
  IOException - when fails to read the file.
- parquet
  
  static DataFrame parquet(String path) throws IOException, URISyntaxException
  
  Reads an Apache Parquet file.
  
  Parameters:
  
  path - the input file path.
  
  Returns:
  
  the data frame.
  
  Throws:
  
  IOException - when fails to read the file.
  
  URISyntaxException - when the file path syntax is wrong.
- parquet
  
  static DataFrame parquet(Path path) throws IOException
  
  Reads an Apache Parquet file.
  
  Parameters:
  
  path - the input file path.
  
  Returns:
  
  the data frame.
  
  Throws:
  
  IOException - when fails to read the file.
- libsvm
  
  static SparseDataset<Integer> libsvm(String path) throws IOException, URISyntaxException
  Reads a libsvm sparse dataset. The format of libsvm file is:
  <label> <index1>:<value1> <index2>:<value2> ...
  where label is the target value of the training data. For classification, it should be an integer which identifies a class (multi-class classification is supported). For regression, it's any real number. For one-class SVM, it's not used so can be any number. index is an integer starting from 1, and value is a real number. The indices must be in ascending order. The labels in the testing data file are only used to calculate accuracy or error. If they are unknown, just fill this column with a number.
  Parameters:
  
  path - the input file path.
  
  Returns:
  
  the data frame.
  
  Throws:
  
  IOException - when fails to read the file.
  
  URISyntaxException - when the file path syntax is wrong.
- libsvm
  
  static SparseDataset<Integer> libsvm(Path path) throws IOException
  Reads a libsvm sparse dataset. The format of libsvm file is:
  <label> <index1>:<value1> <index2>:<value2> ...
  where label is the target value of the training data. For classification, it should be an integer which identifies a class (multi-class classification is supported). For regression, it's any real number. For one-class SVM, it's not used so can be any number. index is an integer starting from 1, and value is a real number. The indices must be in ascending order. The labels in the testing data file are only used to calculate accuracy or error. If they are unknown, just fill this column with a number.
  Parameters:
  
  path - the input file path.
  
  Returns:
  
  the data frame.
  
  Throws:
  
  IOException - when fails to read the file.
- libsvm
  
  static SparseDataset<Integer> libsvm(BufferedReader reader) throws IOException
  Reads a libsvm sparse dataset. The format of libsvm file is:
  <label> <index1>:<value1> <index2>:<value2> ...
  where label is the target value of the training data. For classification, it should be an integer which identifies a class (multi-class classification is supported). For regression, it's any real number. For one-class SVM, it's not used so can be any number. index is an integer starting from 1, and value is a real number. The indices must be in ascending order. The labels in the testing data file are only used to calculate accuracy or error. If they are unknown, just fill this column with a number.
  Parameters:
  
  reader - the file reader.
  
  Returns:
  
  the data frame.
  
  Throws:
  
  IOException - when fails to read the file.

Interface Read

Method Summary

Method Details

object

data

data

csv

csv

csv

csv

csv

csv

csv

json

json

json

json

arff

arff

sas

sas

arrow

arrow

avro

avro

avro

avro

parquet

parquet

libsvm

libsvm

libsvm