Skip to content

Latest commit

 

History

History
43 lines (38 loc) · 2.57 KB

w2vf_c_code_usage.md

File metadata and controls

43 lines (38 loc) · 2.57 KB

Word2Vecf C Codes Usage

Dependency (CoNLL format) Generated by Stanford CoreNLP

Train original file to conllu format file, which contains the dependency of words and other information.


$ ./corenlp.sh -annotators tokenize,ssplit,pos,lemma,ner,depparse -file wikipedia.txt -outputFormat conllu
  


Run Word2Vecf to Compute Embeddings

  • Create input data, which is in the form of (word, context) pairs:
 the input data is a file in which each line has two space-separated items, first is the word, second is the context. For example, in order to create synaptic content based on a dependency, parsed data in conllu format:

$ cut -f 2 conll_file | python scripts/vocab.py 50 > counted_vocabulary

$ cat conll_file | python scripts/extract_deps.py counted_vocabulary 100 > dep.contexts


The first line counts how many times each word appears in the conll_file, keeping all counts >=50. The second line extracts dependency contexts from the parsed file, skipping either words or contexts with words that 
appear <100 times in the vocabulary. (Note: currently, the ’extract_deps.py’ scripts is lowercasing the input). 
If you want to perform sub-sampling, or prune away some examples, now will be a good time to do so.


  • Create word and context vocabularies:

$ cd <word2vecf directory>
$ make
$ ./count_and_filter -train dep.contexts -cvocab cv -wvocab wv -min-count 100


This will count the words and contexts in deep.contexts, discard either words or contexts appearing <100 times, 
and write the counted words to wv and the counted contexts to cv.


  • Train the embeddings:

$ ./word2vecf -train dep.contexts -wvocab wv -cvocab cv -output dim200vecs -size 200 -negative 15 -threads 10


This will train 200-dim embeddings based on dip.contexts, wv, cv (lines in dep.contexts with word not 
in wv or context not in cv are ignored). The -dumpcv flag can be used in order to dump the trained context-vectors 
as well.


$ ./word2vecf -train dip.contexts  -wvocab wv -cvocab cv -output dim200vecs -size 200 -negative 15 -threads 10 -dumpcv dim200context-vecs

  • Convert the embeddings to numpy-readable format:

$ ./scripts/vecs2nps.py dim200vecs vecs


This will create vecs.npy and vecs.vocab, which can be read by the infer.py script.


Resources