Train original file to conllu
format file, which contains the dependency of words and other information.
$ ./corenlp.sh -annotators tokenize,ssplit,pos,lemma,ner,depparse -file wikipedia.txt -outputFormat conllu
- Create input data, which is in the form of
(word, context)
pairs: the input data is a file in which each line has two space-separated items, first is the word, second is the context. For example, in order to create synaptic content based on a dependency, parsed data in conllu format:
$ cut -f 2 conll_file | python scripts/vocab.py 50 > counted_vocabulary
$ cat conll_file | python scripts/extract_deps.py counted_vocabulary 100 > dep.contexts
The first line counts how many times each word appears in the conll_file
, keeping all counts >=50
. The second line extracts dependency contexts from the parsed file, skipping either words or contexts with words that
appear <100
times in the vocabulary.
(Note: currently, the ’extract_deps.py’ scripts is lowercasing the input).
If you want to perform sub-sampling, or prune away some examples, now will be a good time to do so.
- Create word and context vocabularies:
$ cd <word2vecf directory>
$ make
$ ./count_and_filter -train dep.contexts -cvocab cv -wvocab wv -min-count 100
This will count the words and contexts in deep.contexts, discard either words or contexts appearing <100
times,
and write the counted words to wv
and the counted contexts to cv
.
- Train the embeddings:
$ ./word2vecf -train dep.contexts -wvocab wv -cvocab cv -output dim200vecs -size 200 -negative 15 -threads 10
This will train 200-dim embeddings based on dip.contexts
, wv
, cv
(lines in dep.contexts
with word not
in wv
or context not in cv
are ignored). The -dumpcv
flag can be used in order to dump the trained context-vectors
as well.
$ ./word2vecf -train dip.contexts -wvocab wv -cvocab cv -output dim200vecs -size 200 -negative 15 -threads 10 -dumpcv dim200context-vecs
- Convert the embeddings to numpy-readable format:
$ ./scripts/vecs2nps.py dim200vecs vecs
This will create vecs.npy
and vecs.vocab
, which can be read by the infer.py
script.