Class Word2Vec

java.lang.Object
smile.nlp.embedding.Word2Vec

public class Word2Vec extends Object
Word2vec is a group of related models that are used to produce word embeddings. These models are shallow, two-layer neural networks that are trained to reconstruct linguistic contexts of words. Word2vec takes as its input a large corpus of text and produces a vector space, typically of several hundred dimensions, with each unique word in the corpus being assigned a corresponding vector in the space. Word vectors are positioned in the vector space such that words that share common contexts in the corpus are located close to one another in the space.

Word2vec can utilize either of two model architectures to produce a distributed representation of words: continuous bag-of-words (CBOW) or continuous skip-gram. In the continuous bag-of-words architecture, the model predicts the current word from a window of surrounding context words. The order of context words does not influence prediction (bag-of-words assumption). In the continuous skip-gram architecture, the model uses the current word to predict the surrounding window of context words. The skip-gram architecture weighs nearby context words more heavily than more distant context words. According to the authors' note, CBOW is faster while skip-gram is slower but does a better job for infrequent words.

  • Field Details

    • words

      public final String[] words
      The vocabulary.
    • vectors

      public final DataFrame vectors
      The vector space.
  • Constructor Details

    • Word2Vec

      public Word2Vec(String[] words, float[][] vectors)
      Constructor.
      Parameters:
      words - the vocabulary.
      vectors - the vectors of d x n, where d is the dimension and n is the size of vocabulary.
  • Method Details

    • dimension

      public int dimension()
      Returns the dimension of embedding vector space.
      Returns:
      the dimension of embedding vector space.
    • get

      public float[] get(String word)
      Returns the embedding vector of a word.
      Parameters:
      word - the word.
      Returns:
      the embedding vector.
    • apply

      public float[] apply(String word)
      Returns the embedding vector of a word. For Scala convenience.
      Parameters:
      word - the word.
      Returns:
      the embedding vector.
    • of

      public static Word2Vec of(Path file) throws IOException
      Loads a pre-trained word2vec model from binary file of ByteOrder.LITTLE_ENDIAN.
      Parameters:
      file - the path to model file.
      Returns:
      the word2vec model.
      Throws:
      IOException - when fails to read the file.
    • of

      public static Word2Vec of(Path file, ByteOrder order) throws IOException
      Loads a pre-trained word2vec model from binary file.
      Parameters:
      file - the path to model file.
      order - the byte order of model file.
      Returns:
      the word2vec model.
      Throws:
      IOException - when fails to read the file.