The metadata¶
Vecto attempts to record and track as much information as possible about each embedding and each experiment you run. All the information about VSMs is stored in a metadata.json file in the same folder as the VSM itself.
Vecto can be used to work with VSMs that were trained elsewhere and may not come with any metadata. However, even in this case, we encourage the users to try and find out and record as much of the metadata as possible, as soon as possible. We have all been in the situation where, long after you have published a paper and forgotten all about that project, you need to reuse some code or repeat an experiment - and that it’s nigh impossible, because the code is unreadable, filenames are cryptic, and filepaths are long gone.
Moreover, keeping track of the metadata is also something that would force the researchers to be more aware of all these different hidden variables in their experiments. That would (1) prevent them from misinterpreting the properties of their models, and (2) provide some ideas about what could be tweaked.
The corpus metadata¶
It all starts with the corpus. Actually, as many corpora as you like, since it is common practice to combine corpora to train a model (to increase the volume of data, to diversify it, or in fancy curriculum learning). Here is a sample metadata file you can use as a template to describe your corpus.
Vecto records the following metadata:
todo: | a page about domains |
---|
- id
- An identifier of the corpus, unique in the collection.
- size
- The size of the corpus (in tokens).
- name
- The (preferably short) name of the corpus, often used to identify the models built from it.
- description
- The freeform description of the corpus, such as the domains it covers.
- source
- Where the corpus was obtained.
- domain
- The list of the domains of the texts, such as news, encyclopedia, fiction, medical, spoken, or web. If the corpus covers only one domain, the list only contains one item; otherwise several can be listed. We suggest using general only for balanced, representative corpora such as BNC that make a conscious effort to represent different registers.
- language
- A list containing the language codes for the corpus. There will be just one entry in case of monolingual corpora (e.g. [“en”]), and for parallel or multilingual corpora there will be several ([“en”, “de”]).
- encoding
- The encoding of the corpus files.
- format
- The format of the corpus. Some frequent options include: one-corpus-per-line, one-sentence-per-line, one-paragraph-per-line, one-word-per-line, vertical-format
- date
- The date when the corpus (or its text source) was published. It can be the date of a Wikipedia dump (e.g. 2018-07), or the year when the paper presenting the resource came out (e.g. 2017).
- path
- The path to the local copy of the corpus files.
- cite
- The bibtex entry for the paper presenting the resource, that should be referenced in subsequent work building on or using the resource. It should be bibtex rather than biblatex, as most NLP publishers have not made the switch yet.
- pre-processing
- The pre-processing steps used in preparing this resource, described in freeform text.
- cleanup
- Markup removal, format conversion, encoding, de-duplication (freeform description, URL or path to the pre-processing script)
- lowercasing
- True if the corpus was lowercased, False otherwise.
- tokenization
- The tokenizer that was used, if any (URL or path to the script, name, version).
- lemmatization
- The lemmatizer that was used, if any (URL or path to the script, name, version).
- stemming
- The stemmer that was used, if any (URL or path to the script, name, version).
- POS_tagging
- The POS-tagger that was used, if any (URL or path to the script, name, version).
- syntactic_parsing
- The syntactic parser that was used, if any (URL or path to the script, name, version).
- semantic_parsing
- The semantic parser that was used, if any (URL or path to the script, name, version).
- other_preprocessing
- Any other pre-processing that was performed, if any (URL or path to the script, name, version).
todo: | the format section should link to the input of embedding models |
---|
{
"class": "corpus",
"corpus_01": {
"id": "",
"size": ,
"name": "",
"description": "",
"source": "",
"domain": "",
"language": ["english"],
"encoding": "",
"format": "",
"date": "",
"path": "",
"pre-processing": {
"cleanup": "",
"lowercasing": ,
"tokenization": "",
"lemmatization": "",
"stemming": "",
"POS_tagging": "",
"syntactic_parsing": "",
"semantic_parsing": "",
"other_preprocessing": "",
}
}
"corpus_02": {
"id": "",
"size": ,
"name": "",
"description": "",
"source": "",
"domain": "",
"language": ["english"],
"encoding": "",
"format": "",
"date": "",
"path": "",
"pre-processing": {
"cleanup": "",
"lowercasing": ,
"tokenization": "",
"lemmatization": "",
"stemming": "",
"POS_tagging": "",
"syntactic_parsing": "",
"semantic_parsing": "",
"other_preprocessing": ""
}
}
}
The vocab metadata¶
There are two types of vocab files in Vecto. One is basically lists of the vocabulary of word embeddings. Sometimes they are stored separately from the numerical data as plain-text, one-word-per-line files (e.g. when the numerical data itself is stored in .npy format). Vecto expects such files to have a “.vocab” extension.
The other type of vocab is a tab-separated file structured as [WORD FREQUENCY].
The vocab files can have associated metadata as follows.
- size
- The number of token types.
- min_frequency
- The minimum frequency cut-off point.
- timestamp
- When the vocab file was produced
- filtering
- A freeform description of any filtering applied to the vocabulary, if any.
- lib_version
- The version of Vecto with which a given vocab file was produced (generated automatically by Vecto).
- system_info
- The system in which the vocab file was produced (generated automatically by Vecto).
- source
- Includes the metadata of the source corpus from which the vocab file was produced, as described in The corpus metadata section.
todo: | link to the vocab filtering section, if any |
---|---|
todo: | filtered_by - text file, wordlist, dict with metadata |
{
"class": "vocabulary",
"size": ,
"min_frequency": ,
"lowercasing": "",
"execution_time": "",
"timestamp": "",
"lib_version": "",
"system_info": "",
"filtered_by": {
},
"source": {
}
}
The embeddings metadata¶
The metadata collected in training of embeddings is hard to standartize, because essentially it needs to describe all the parameters of a given model, and they differ across models. Therefore this section only provides a sample, and the full list of parameters (which correspond to metadata) can be found in descriptions of the implementations of different models in the library.
todo: | link to the library of embeddings |
---|
Some of the frequent parameters applicable to most-if-not-all models include:
- model
- The name of the model, such as CBOW or GloVe.
- window
- The window size
- dimensionality
- The number of vector dimensions.
- context
- The type of context, as described by Li et al. Four common combinations are linear_unbound (the bag-of-words symmetrical context, the most commonly used), linear_bound (linear context that takes word order into account), deps_unbound (the dependency-based context which takes into account all words in a syntactic relation to the target word), and deps_boun (a version of the latter which differentiates between different syntactic relations). See the paper for mor details.
- epochs
- The number of epochs for which the model was trained.
- cite
- The bibtex entry for the paper presenting the resource, that should be referenced in subsequent work building on or using the resource. It should be bibtex rather than biblatex, as most NLP publishers have not made the switch yet.
- vocabulary
- The vocabulary metadata as described in The vocab metadata, which also includes the corpus metadata.
{
"class": "embeddings",
"model": "",
"window": ,
"dimensionality": ,
"context": "",
"epochs": ,
"cite": "",
"vocabulary": {
}
"lib_version": "",
"system_info": "",
}
The datasets metadata¶
The task datasets should be accompanied by the following metadata:
- task
- The task for which the dataset is applicable, such as word_analogy or word_relatedness.
- language
- A list containing the language codes for the corpus. There will be just one entry in case of monolingual corpora (e.g. [“en”]), and for parallel or multilingual corpora there will be several ([“en”, “de”]).
- name
- The (preferably short) name of the dataset, such as WordSim353.
- description
- The freeform brief description of the dataset, preferably including anything special about this dataset that distinguishes it from other datasets for the same task.
- domain
- The domain of the dataset, such as news, encyclopedia, fiction, medical, spoken, or web. We suggest using general only for datasets that do not target any particular domain.
- date
- The date the resource was published.
- source
- The source of the resource (e.g. a modification of another dataset, or something created by the authors from scratch or on the basis of some data that was not previously used for the same task).
- project_page
- The URL of the page describing the dataset (if any).
- version
- The version of the dataset (useful when you are developing one).
- size
- The size of the dataset. The units depend on the task: it can be e.g. 353 pairs for a similarity or analogy dataset.
- cite
- The bibtex entry for the paper presenting the resource, that should be referenced in subsequent work building on or using the resource. It should be bibtex rather than biblatex, as most NLP publishers have not made the switch yet.
{
"class": "dataset",
"task": "",
"language": ["english"],
"name": "",
"description": "",
"domain": "",
"date": "",
"source": "",
"project_page": "",
"version": "",
"size": "",
"cite": ""
}
The experiment metadata¶
As with the training of embeddings, different experiments involve different sets of metadata. The parameters of each model included in the Vecto library is described in the corresponding library page. In addition to that, the metadata for each experiment will automatically include the metadata for the dataset and embeddings (which also includes the corpus metadata).
Some of the generic metadata fields that are applicable to all experiments include:
- name
- The (hopefully descriptive) name of the model, such as LogisticRegression.
- task
- The type of the task that this model is applicable to (e.g. word_analogy or text_classification).
- description
- A brief description of the implementation, preferably including its use case (e.g. a sample implementation in a some framework, a standard baseline for some task, a state-of-the-art model.)
- The author of the code (for unpublished models).
- version
- The version of the implementation, if any.
- date
- The date when the code was published or contributed.
- source
- If the code is reimplementation of something else, this is the field to indicate it.
- cite
- The bibtex entry for the paper presenting the code, that should be referenced in subsequent work building on or comparing with this implementation. It should be bibtex rather than biblatex, as most NLP publishers have not made the switch yet.
{
"class": "experiment",
"name": "",
"task": "",
"description": "",
"author": "",
"version": "",
"date": "",
"source": "",
"cite": ""
}
Accessing the metadata in Vecto¶
All metadata is accessible from vsmlib while you are experimenting on dozens of VSMs you have built, facilitating both parameter search for a particular task and observations on what properties of VSMs result in what aspects of their performance.
You can access the VSM metadata as follows:
The name of the model, which is the name directory in which it is stored. For models generated with VSMlib, interpretable folder names with parameters are generated automatically.
>>> print(my_vsm.name)
w2v_comb2_w8_n25_i6_d300_skip_300
You can also access the metadata as a Python dictionary:
>>> print(my_vsm.metadata)
{'size_dimensions': 300, 'dimensions': 300, 'size_window': '8'}