smile.nlp.Word2Vec

public class Word2Vec extends Object

Word embedding. Word embedding is the collective name for a set of language modeling and feature learning techniques in natural language processing where words or phrases from the vocabulary are mapped to vectors of real numbers. Conceptually it involves a mathematical embedding from a space with many dimensions per word to a continuous vector space with a much lower dimension.

Word2vec is a group of related models that are used to produce word embeddings. These models are shallow, two-layer neural networks that are trained to reconstruct linguistic contexts of words. Word2vec takes as its input a large corpus of text and produces a vector space, typically of several hundred dimensions, with each unique word in the corpus being assigned a corresponding vector in the space. Word vectors are positioned in the vector space such that words that share common contexts in the corpus are located close to one another in the space.

Word2vec can utilize either of two model architectures to produce a distributed representation of words: continuous bag-of-words (CBOW) or continuous skip-gram. In the continuous bag-of-words architecture, the model predicts the current word from a window of surrounding context words. The order of context words does not influence prediction (bag-of-words assumption). In the continuous skip-gram architecture, the model uses the current word to predict the surrounding window of context words. The skip-gram architecture weighs nearby context words more heavily than more distant context words. According to the authors' note, CBOW is faster while skip-gram is slower but does a better job for infrequent words.

GloVe (Global Vectors for Word Representation) is another popular unsupervised learning algorithm for obtaining vector representations for words.

GloVe is essentially a log-bilinear model with a weighted least-squares objective. The main intuition underlying the model is the simple observation that ratios of word-word co-occurrence probabilities have the potential for encoding some form of meaning.

Training is performed on aggregated global word-word co-occurrence statistics from a corpus. The training objective of GloVe is to learn word vectors such that their dot product equals the logarithm of the words' probability of co-occurrence. Owing to the fact that the logarithm of a ratio equals the difference of logarithms, this objective associates (the logarithm of) ratios of co-occurrence probabilities with vector differences in the word vector space. Because these ratios can encode some form of meaning, this information gets encoded as vector differences as well.

Field Summary

Fields

Modifier and Type

Field

Description

final DataFrame

vectors

The vector space.

final String[]

words

The vocabulary.
Constructor Summary

Constructors

Constructor

Description

Word2Vec(String[] words, float[][] vectors)

Constructor.
Method Summary

Modifier and Type

Method

Description

float[]

apply(String word)

Returns the embedding vector of a word, or null if the word is not in the vocabulary.

boolean

contains(String word)

Returns true if the word is in the vocabulary.

int

dimension()

Returns the dimension of embedding vector space.

static Word2Vec

glove(Path file)

Loads a GloVe model from a text file.

Optional<float[]>

lookup(String word)

Returns the embedding vector of a word, or empty if the word is not in the vocabulary.

static Word2Vec

of(Path file)

Loads a pre-trained word2vec model from binary file of ByteOrder.LITTLE_ENDIAN.

static Word2Vec

of(Path file, ByteOrder order)

Loads a pre-trained word2vec model from binary file.

OptionalDouble

similarity(String w1, String w2)

Returns the cosine similarity between the embedding vectors of two words.

int

size()

Returns the size of the vocabulary.

Methods inherited from class Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Details
- words
  
  public final String[] words
  
  The vocabulary.
- vectors
  
  public final DataFrame vectors
  
  The vector space.
Constructor Details
- Word2Vec
  
  public Word2Vec(String[] words, float[][] vectors)
  
  Constructor.
  
  Parameters:
  
  words - the vocabulary.
  
  vectors - the vectors of d x n, where d is the dimension and n is the size of vocabulary.
Method Details
- dimension
  
  public int dimension()
  
  Returns the dimension of embedding vector space.
  
  Returns:
  
  the dimension of embedding vector space.
- apply
  
  public float[] apply(String word)
  
  Returns the embedding vector of a word, or null if the word is not in the vocabulary. Prefer lookup(String) for null-safe access.
  
  Parameters:
  
  word - the word.
  
  Returns:
  
  the embedding vector, or null if not found.
- lookup
  
  public Optional<float[]> lookup(String word)
  
  Returns the embedding vector of a word, or empty if the word is not in the vocabulary.
  
  Parameters:
  
  word - the word.
  
  Returns:
  
  the embedding vector, or empty if not found.
- contains
  
  public boolean contains(String word)
  
  Returns true if the word is in the vocabulary.
  
  Parameters:
  
  word - the word.
  
  Returns:
  
  true if the vocabulary contains the word.
- size
  
  public int size()
  
  Returns the size of the vocabulary.
  
  Returns:
  
  the number of words in the vocabulary.
- similarity
  
  public OptionalDouble similarity(String w1, String w2)
  
  Returns the cosine similarity between the embedding vectors of two words. Cosine similarity is the dot product of the unit-normalized vectors, ranging from -1 (opposite) to +1 (identical direction).
  
  Parameters:
  
  w1 - the first word.
  
  w2 - the second word.
  
  Returns:
  
  the cosine similarity, or OptionalDouble.empty() if either word is not in the vocabulary.
- of
  
  public static Word2Vec of(Path file) throws IOException
  
  Loads a pre-trained word2vec model from binary file of ByteOrder.LITTLE_ENDIAN.
  
  Parameters:
  
  file - the path to model file.
  
  Returns:
  
  the word2vec model.
  
  Throws:
  
  IOException - when fails to read the file.
- of
  
  public static Word2Vec of(Path file, ByteOrder order) throws IOException
  
  Loads a pre-trained word2vec model from binary file.
  
  Parameters:
  
  file - the path to model file.
  
  order - the byte order of model file.
  
  Returns:
  
  the word2vec model.
  
  Throws:
  
  IOException - when fails to read the file.
- glove
  
  public static Word2Vec glove(Path file) throws IOException
  
  Loads a GloVe model from a text file. Each line must have the form: word f1 f2 ... fd. where d is the embedding dimension. All lines must have the same number of dimensions.
  
  Parameters:
  
  file - the path to model file.
  
  Returns:
  
  the word embedding model.
  
  Throws:
  
  IOException - when fails to read the file.
  
  IllegalArgumentException - if the file is empty or lines have inconsistent dimensions.

Class Word2Vec

Field Summary

Constructor Summary

Method Summary

Methods inherited from class Object

Field Details

words

vectors

Constructor Details

Word2Vec

Method Details

dimension

apply

lookup

contains

size

similarity

of

of

glove