Word2Vecf C Codes Usage

Dependency (`CoNLL` format) Generated by Stanford CoreNLP

Train original file to conllu format file, which contains the dependency of words and other information. 

$ ./corenlp.sh -annotators tokenize,ssplit,pos,lemma,ner,depparse -file wikipedia.txt -outputFormat conllu

Run Word2Vecf to Compute Embeddings

Create input data, which is in the form of (word, context) pairs:  the input data is a file in which each line has two space-separated items, first is the word, second is the context. For example, in order to create synaptic content based on a dependency, parsed data in conllu format:

$ cut -f 2 conll_file | python scripts/vocab.py 50 > counted_vocabulary 
$ cat conll_file | python scripts/extract_deps.py counted_vocabulary 100 > dep.contexts

The first line counts how many times each word appears in the conll_file, keeping all counts >=50. The second line extracts dependency contexts from the parsed file, skipping either words or contexts with words that  appear <100 times in the vocabulary. (Note: currently, the ’extract_deps.py’ scripts is lowercasing the input).  If you want to perform sub-sampling, or prune away some examples, now will be a good time to do so. 

Create word and context vocabularies:

$ cd <word2vecf directory>
$ make
$ ./count_and_filter -train dep.contexts -cvocab cv -wvocab wv -min-count 100

 This will count the words and contexts in deep.contexts, discard either words or contexts appearing <100 times,  and write the counted words to wv and the counted contexts to cv. 

Train the embeddings:

$ ./word2vecf -train dep.contexts -wvocab wv -cvocab cv -output dim200vecs -size 200 -negative 15 -threads 10

This will train 200-dim embeddings based on dip.contexts, wv, cv (lines in dep.contexts with word not  in wv or context not in cv are ignored). The -dumpcv flag can be used in order to dump the trained context-vectors  as well. 

$ ./word2vecf -train dip.contexts  -wvocab wv -cvocab cv -output dim200vecs -size 200 -negative 15 -threads 10 -dumpcv dim200context-vecs

Convert the embeddings to numpy-readable format:

$ ./scripts/vecs2nps.py dim200vecs vecs

This will create vecs.npy and vecs.vocab, which can be read by the infer.py script. 

Resources

Word2Vecf C Codes
Hyperwords
Stanford CoreNLP Download Page

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

w2vf_c_code_usage.md

w2vf_c_code_usage.md

Word2Vecf C Codes Usage

Dependency (`CoNLL` format) Generated by Stanford CoreNLP

Run Word2Vecf to Compute Embeddings

Resources

Files

w2vf_c_code_usage.md

Latest commit

History

w2vf_c_code_usage.md

File metadata and controls

Word2Vecf C Codes Usage

Dependency (CoNLL format) Generated by Stanford CoreNLP

Run Word2Vecf to Compute Embeddings

Resources

Dependency (`CoNLL` format) Generated by Stanford CoreNLP