Class Word2Vec
Word2vec is a group of related models that are used to produce word embeddings. These models are shallow, two-layer neural networks that are trained to reconstruct linguistic contexts of words. Word2vec takes as its input a large corpus of text and produces a vector space, typically of several hundred dimensions, with each unique word in the corpus being assigned a corresponding vector in the space. Word vectors are positioned in the vector space such that words that share common contexts in the corpus are located close to one another in the space.
Word2vec can utilize either of two model architectures to produce a distributed representation of words: continuous bag-of-words (CBOW) or continuous skip-gram. In the continuous bag-of-words architecture, the model predicts the current word from a window of surrounding context words. The order of context words does not influence prediction (bag-of-words assumption). In the continuous skip-gram architecture, the model uses the current word to predict the surrounding window of context words. The skip-gram architecture weighs nearby context words more heavily than more distant context words. According to the authors' note, CBOW is faster while skip-gram is slower but does a better job for infrequent words.
GloVe (Global Vectors for Word Representation) is another popular unsupervised learning algorithm for obtaining vector representations for words.
GloVe is essentially a log-bilinear model with a weighted least-squares objective. The main intuition underlying the model is the simple observation that ratios of word-word co-occurrence probabilities have the potential for encoding some form of meaning.
Training is performed on aggregated global word-word co-occurrence statistics from a corpus. The training objective of GloVe is to learn word vectors such that their dot product equals the logarithm of the words' probability of co-occurrence. Owing to the fact that the logarithm of a ratio equals the difference of logarithms, this objective associates (the logarithm of) ratios of co-occurrence probabilities with vector differences in the word vector space. Because these ratios can encode some form of meaning, this information gets encoded as vector differences as well.
-
Field Summary
Fields -
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionfloat[]Returns the embedding vector of a word, ornullif the word is not in the vocabulary.booleanReturns true if the word is in the vocabulary.intReturns the dimension of embedding vector space.static Word2VecLoads a GloVe model from a text file.Optional<float[]> Returns the embedding vector of a word, or empty if the word is not in the vocabulary.static Word2VecLoads a pre-trained word2vec model from binary file of ByteOrder.LITTLE_ENDIAN.static Word2VecLoads a pre-trained word2vec model from binary file.similarity(String w1, String w2) Returns the cosine similarity between the embedding vectors of two words.intsize()Returns the size of the vocabulary.
-
Field Details
-
words
The vocabulary. -
vectors
The vector space.
-
-
Constructor Details
-
Word2Vec
Constructor.- Parameters:
words- the vocabulary.vectors- the vectors of d x n, where d is the dimension and n is the size of vocabulary.
-
-
Method Details
-
dimension
public int dimension()Returns the dimension of embedding vector space.- Returns:
- the dimension of embedding vector space.
-
apply
Returns the embedding vector of a word, ornullif the word is not in the vocabulary. Preferlookup(String)for null-safe access.- Parameters:
word- the word.- Returns:
- the embedding vector, or
nullif not found.
-
lookup
-
contains
Returns true if the word is in the vocabulary.- Parameters:
word- the word.- Returns:
- true if the vocabulary contains the word.
-
size
public int size()Returns the size of the vocabulary.- Returns:
- the number of words in the vocabulary.
-
similarity
Returns the cosine similarity between the embedding vectors of two words. Cosine similarity is the dot product of the unit-normalized vectors, ranging from -1 (opposite) to +1 (identical direction).- Parameters:
w1- the first word.w2- the second word.- Returns:
- the cosine similarity, or
OptionalDouble.empty()if either word is not in the vocabulary.
-
of
Loads a pre-trained word2vec model from binary file of ByteOrder.LITTLE_ENDIAN.- Parameters:
file- the path to model file.- Returns:
- the word2vec model.
- Throws:
IOException- when fails to read the file.
-
of
Loads a pre-trained word2vec model from binary file.- Parameters:
file- the path to model file.order- the byte order of model file.- Returns:
- the word2vec model.
- Throws:
IOException- when fails to read the file.
-
glove
Loads a GloVe model from a text file. Each line must have the form:word f1 f2 ... fd. wheredis the embedding dimension. All lines must have the same number of dimensions.- Parameters:
file- the path to model file.- Returns:
- the word embedding model.
- Throws:
IOException- when fails to read the file.IllegalArgumentException- if the file is empty or lines have inconsistent dimensions.
-