Training new models

This page describes how to train vectors with the models that are currently implemented in VSMlib.

Word2vec

Word2vec is arguably the most popular word embedding model. We provide implementation of extended word2vec model, which can be trained on linear and dependency-based contexts, with bound and unbound context representations.

Additionally we provide an implementation which considers characters rather than words to be the minimal units. This enables it to take advantage of morphological information: as far as a word-level models such as word2vec is concerned, “walk” and “walking” are completely unrelated, except through similarities in their distributions.

To train word2vec embeddings vsmlib can be envoked via the command line interface:

>>> python3 -m vsmlib.embeddings.train_word2vec

The command line parameters are as

--dimensions size of embeddings
--context_type context type [linear’ or ‘deps’], for deps context, the annotated corpus is required
--context_representation
 context representation [‘bound’ or ‘unbound’]
--window window size’)
--model base model type [‘skipgram’ or ‘cbow’]
--negative-size
 number of negative samples
--out_type output model type [“hsm”: hierarchical softmax, “ns”: negative sampling, “original”: no approximation]
--subword specify if subword-level approach should be used [“none”, “rnn”]
--batchsize learning minibatch size
--gpu GPU ID (negative value indicates CPU)
--epochs number of epochs to learn
--maxWordLength
 max word length (only used for char-level subword)
--path_vocab path to the vocabulary
--path_corpus path to the corpus
--path_out path to save embeddings
--test run in test mode
--verbose verbose mode

Alternatively, word2vec training can be done though vsmlib python API.

>>> vsmlib.embeddings.train_word2vec.train(args)

The arguments are argparse.namespace identical to command line arguments. Instance of ModelDense is returned.

Realted papers: original w2v, Bofang, Mnih, subword.

@inproceedings{MikolovChenEtAl_2013_Efficient_estimation_of_word_representations_in_vector_space,
 title = {Efficient Estimation of Word Representations in Vector Space},
 urldate = {2015-12-03},
 booktitle = {Proceedings of International Conference on Learning Representations (ICLR)},
 author = {Mikolov, Tomas and Chen, Kai and Corrado, Greg and Dean, Jeffrey},
 year = {2013}}
@inproceedings{Li2017InvestigatingDS,
 title={Investigating Different Syntactic Context Types and Context Representations for Learning Word Embeddings},
 author={Bofang Li and Tao Liu and Zhe Zhao and Buzhou Tang and Aleksandr Drozd and Anna Rogers and Xiaoyong Du},
 booktitle={EMNLP},
 year={2017}}