smile.nlp.embedding.Word2Vec

public class Word2Vec extends Object

Word2vec is a group of related models that are used to produce word embeddings. These models are shallow, two-layer neural networks that are trained to reconstruct linguistic contexts of words. Word2vec takes as its input a large corpus of text and produces a vector space, typically of several hundred dimensions, with each unique word in the corpus being assigned a corresponding vector in the space. Word vectors are positioned in the vector space such that words that share common contexts in the corpus are located close to one another in the space.

Word2vec can utilize either of two model architectures to produce a distributed representation of words: continuous bag-of-words (CBOW) or continuous skip-gram. In the continuous bag-of-words architecture, the model predicts the current word from a window of surrounding context words. The order of context words does not influence prediction (bag-of-words assumption). In the continuous skip-gram architecture, the model uses the current word to predict the surrounding window of context words. The skip-gram architecture weighs nearby context words more heavily than more distant context words. According to the authors' note, CBOW is faster while skip-gram is slower but does a better job for infrequent words.

Field Summary

Fields

Modifier and Type

Field

Description

final DataFrame

vectors

The vector space.

final String[]

words

The vocabulary.
Constructor Summary

Constructors

Constructor

Description

Word2Vec(String[] words, float[][] vectors)

Constructor.
Method Summary

Modifier and Type

Method

Description

float[]

apply(String word)

Returns the embedding vector of a word.

int

dimension()

Returns the dimension of embedding vector space.

float[]

get(String word)

Returns the embedding vector of a word.

static Word2Vec

of(Path file)

Loads a pre-trained word2vec model from binary file of ByteOrder.LITTLE_ENDIAN.

static Word2Vec

of(Path file, ByteOrder order)

Loads a pre-trained word2vec model from binary file.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Details
- words
  
  public final String[] words
  
  The vocabulary.
- vectors
  
  public final DataFrame vectors
  
  The vector space.
Constructor Details
- Word2Vec
  
  public Word2Vec(String[] words, float[][] vectors)
  
  Constructor.
  
  Parameters:
  
  words - the vocabulary.
  
  vectors - the vectors of d x n, where d is the dimension and n is the size of vocabulary.
Method Details
- dimension
  
  public int dimension()
  
  Returns the dimension of embedding vector space.
  
  Returns:
  
  the dimension of embedding vector space.
- get
  
  public float[] get(String word)
  
  Returns the embedding vector of a word.
  
  Parameters:
  
  word - the word.
  
  Returns:
  
  the embedding vector.
- apply
  
  public float[] apply(String word)
  
  Returns the embedding vector of a word. For Scala convenience.
  
  Parameters:
  
  word - the word.
  
  Returns:
  
  the embedding vector.
- of
  
  public static Word2Vec of(Path file) throws IOException
  
  Loads a pre-trained word2vec model from binary file of ByteOrder.LITTLE_ENDIAN.
  
  Parameters:
  
  file - the path to model file.
  
  Returns:
  
  the word2vec model.
  
  Throws:
  
  IOException - when fails to read the file.
- of
  
  public static Word2Vec of(Path file, ByteOrder order) throws IOException
  
  Loads a pre-trained word2vec model from binary file.
  
  Parameters:
  
  file - the path to model file.
  
  order - the byte order of model file.
  
  Returns:
  
  the word2vec model.
  
  Throws:
  
  IOException - when fails to read the file.

Class Word2Vec

Field Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Field Details

words

vectors

Constructor Details

Word2Vec

Method Details

dimension

get

apply

of

of