Class Word2Vec

java.lang.Object
smile.nlp.Word2Vec

public class Word2Vec extends Object
Word embedding. Word embedding is the collective name for a set of language modeling and feature learning techniques in natural language processing where words or phrases from the vocabulary are mapped to vectors of real numbers. Conceptually it involves a mathematical embedding from a space with many dimensions per word to a continuous vector space with a much lower dimension.

Word2vec is a group of related models that are used to produce word embeddings. These models are shallow, two-layer neural networks that are trained to reconstruct linguistic contexts of words. Word2vec takes as its input a large corpus of text and produces a vector space, typically of several hundred dimensions, with each unique word in the corpus being assigned a corresponding vector in the space. Word vectors are positioned in the vector space such that words that share common contexts in the corpus are located close to one another in the space.

Word2vec can utilize either of two model architectures to produce a distributed representation of words: continuous bag-of-words (CBOW) or continuous skip-gram. In the continuous bag-of-words architecture, the model predicts the current word from a window of surrounding context words. The order of context words does not influence prediction (bag-of-words assumption). In the continuous skip-gram architecture, the model uses the current word to predict the surrounding window of context words. The skip-gram architecture weighs nearby context words more heavily than more distant context words. According to the authors' note, CBOW is faster while skip-gram is slower but does a better job for infrequent words.

GloVe (Global Vectors for Word Representation) is another popular unsupervised learning algorithm for obtaining vector representations for words.

GloVe is essentially a log-bilinear model with a weighted least-squares objective. The main intuition underlying the model is the simple observation that ratios of word-word co-occurrence probabilities have the potential for encoding some form of meaning.

Training is performed on aggregated global word-word co-occurrence statistics from a corpus. The training objective of GloVe is to learn word vectors such that their dot product equals the logarithm of the words' probability of co-occurrence. Owing to the fact that the logarithm of a ratio equals the difference of logarithms, this objective associates (the logarithm of) ratios of co-occurrence probabilities with vector differences in the word vector space. Because these ratios can encode some form of meaning, this information gets encoded as vector differences as well.

  • Field Details

    • words

      public final String[] words
      The vocabulary.
    • vectors

      public final DataFrame vectors
      The vector space.
  • Constructor Details

    • Word2Vec

      public Word2Vec(String[] words, float[][] vectors)
      Constructor.
      Parameters:
      words - the vocabulary.
      vectors - the vectors of d x n, where d is the dimension and n is the size of vocabulary.
  • Method Details

    • dimension

      public int dimension()
      Returns the dimension of embedding vector space.
      Returns:
      the dimension of embedding vector space.
    • apply

      public float[] apply(String word)
      Returns the embedding vector of a word, or null if the word is not in the vocabulary. Prefer lookup(String) for null-safe access.
      Parameters:
      word - the word.
      Returns:
      the embedding vector, or null if not found.
    • lookup

      public Optional<float[]> lookup(String word)
      Returns the embedding vector of a word, or empty if the word is not in the vocabulary.
      Parameters:
      word - the word.
      Returns:
      the embedding vector, or empty if not found.
    • contains

      public boolean contains(String word)
      Returns true if the word is in the vocabulary.
      Parameters:
      word - the word.
      Returns:
      true if the vocabulary contains the word.
    • size

      public int size()
      Returns the size of the vocabulary.
      Returns:
      the number of words in the vocabulary.
    • similarity

      public OptionalDouble similarity(String w1, String w2)
      Returns the cosine similarity between the embedding vectors of two words. Cosine similarity is the dot product of the unit-normalized vectors, ranging from -1 (opposite) to +1 (identical direction).
      Parameters:
      w1 - the first word.
      w2 - the second word.
      Returns:
      the cosine similarity, or OptionalDouble.empty() if either word is not in the vocabulary.
    • of

      public static Word2Vec of(Path file) throws IOException
      Loads a pre-trained word2vec model from binary file of ByteOrder.LITTLE_ENDIAN.
      Parameters:
      file - the path to model file.
      Returns:
      the word2vec model.
      Throws:
      IOException - when fails to read the file.
    • of

      public static Word2Vec of(Path file, ByteOrder order) throws IOException
      Loads a pre-trained word2vec model from binary file.
      Parameters:
      file - the path to model file.
      order - the byte order of model file.
      Returns:
      the word2vec model.
      Throws:
      IOException - when fails to read the file.
    • glove

      public static Word2Vec glove(Path file) throws IOException
      Loads a GloVe model from a text file. Each line must have the form: word f1 f2 ... fd. where d is the embedding dimension. All lines must have the same number of dimensions.
      Parameters:
      file - the path to model file.
      Returns:
      the word embedding model.
      Throws:
      IOException - when fails to read the file.
      IllegalArgumentException - if the file is empty or lines have inconsistent dimensions.