Where to get data?¶
This page lists some source corpora and pre-trained word vectors you can download.
Source corpora¶
English Wikipedia, August 2013 dump, pre-processed
- One-sentence per line, cleaned from punctuation
- One-word-per-line, parser tokenization (this is the version used in the non-dependency-parsed embeddings downloadable below, so use this one if you would like to have directly comparable embeddings)
- Dependency-parsed version (CoreNLP Stanford parser)
Pre-trained VSMs¶
English
Wikipedia vectors (dump of August 2013)
Here you can download 500-dimensional pre-trained vectors for the popular CBOW, Skip-Gram and GloVe VSMs - each in 4 kinds of context:
These embeddings were generated for the following paper. Please cite it if you use them in your research:
@inproceedings{LiLiuEtAl_2017_Investigating_Different_Syntactic_Context_Types_and_Context_Representations_for_Learning_Word_Embeddings,
title = {Investigating {{Different Syntactic Context Types}} and {{Context Representations}} for {{Learning Word Embeddings}}},
url = {http://www.aclweb.org/anthology/D17-1256},
booktitle = {Proceedings of the 2017 {{Conference}} on {{Empirical Methods}} in {{Natural Language Processing}}},
author = {Li, Bofang and Liu, Tao and Zhao, Zhe and Tang, Buzhou and Drozd, Aleksandr and Rogers, Anna and Du, Xiaoyong},
year = {2017},
pages = {2411--2421}}
You can also download the source corpus (one-word-per-line format) with which you can train other VSMs for fair comparison.
Each of the 3 models (CBOW, GloVe and Skip-Gram) is available in 5 sizes (25, 50, 100, 250, and 500 dimensions) and in 4 types of context: the traditional word linear context (which is used the most often), the dependency-based structured context, and also less common structured linear and word dependency context.

Unbound linear context (aka word linear context)
500 dimensions: word_linear_cbow_500d, word_linear_sg_500d, word_linear_glove_500d
250 dimensions: word_linear_cbow_250d, word_linear_sg_250d, word_linear_glove_250d
100 dimensions: word_linear_cbow_100d, word_linear_sg_100d, word_linear_glove_100d
50 dimensions: word_linear_cbow_50d, word_linear_sg_50d, word_linear_glove_50d
25 dimensions: word_linear_cbow_25d, word_linear_sg_25d, word_linear_glove_25d
Unbound dependency context (aka word dependency context)
500 dimensions: word_deps_CBOW_500d, word_deps_sg_500d, word_deps_glove_500d
250 dimensions: word_deps_cbow_250d, word_deps_sg_250d, word_deps_glove_250d
100 dimensions: word_deps_cbow_100d, word_deps_sg_100d, word_deps_glove_100d
50 dimensions: word_deps_cbow_50d, word_deps_sg_50d, word_deps_glove_50d
25 dimensions: word_deps_cbow_25d, word_deps_sg_25d, word_deps_glove_25d
Bound linear context (aka structured linear context)
500 dimensions: structured_linear_cbow_500d, structured_linear_sg_500d, structured_linear_glove_500d
250 dimensions: structured_linear_cbow_250d, structured_linear_sg_250d, structured_linear_glove_250d
100 dimensions: structured_linear_cbow_100d, structured_linear_sg_100d, structured_linear_glove_100d
50 dimensions: structured_linear_cbow_50d, structured_linear_sg_50d, structured_linear_glove_50d
25 dimensions: structured_linear_cbow_25d, structured_linear_sg_25d, structured_linear_glove_25d
Bound dependency context (aka structured dependency context)
500 dimensions: structured_deps_cbow_500d, structured_deps_sg_500d, structured_deps_glove_500d
250 dimensions: structured_deps_cbow_250d, structured_deps_sg_250d, structured_deps_glove_250d
100 dimensions: structured_deps_cbow_100d, structured_deps_sg_100d, structured_deps_glove_100d
50 dimensions: structured_deps_cbow_50d, structured_deps_sg_50d, structured_deps_glove_50d
25 dimensions: structured_deps_cbow_25d, structured_deps_sg_25d, structured_deps_glove_25d
The training parameters are as follows: window 2, negative sampling size is set to 5 for SG and 2 for CBOW. Distribution smoothing is set to 0.75. No dynamic context or “dirty” sub-sampling. The number of iterations is set to 2, 5 and 30 for SG, CBOW and GloVe respectively.
SVD vectors:
BNC, 100M words: | |
---|---|
window 2, 500 dims, PMI; SVD C=0.6, 318 Mb, mirror |
Russian
Araneum+Wiki+Proza.ru, 6B words: | |
---|---|
window 2, 500 dims, PMI; SVD C=0.6, 2.3 Gb, mirror, paper to cite |
@inproceedings{7396482,
author={A. Drozd and A. Gladkova and S. Matsuoka},
booktitle={2015 IEEE International Conference on Data Science and Data Intensive Systems},
title={Discovering Aspectual Classes of Russian Verbs in Untagged Large Corpora},
year={2015},
pages={61-68},
doi={10.1109/DSDIS.2015.30},
month={Dec}}