We downloaded the StackOverflow dump which is hosted on this link. The data are in XML format. We parsed the XML files and dumped the data into a database using the DumpSO.py script.
Each data set consists of three folders: Alignment, Corpora and Usages. Contents in each folder are as follows:
Folder | Content |
---|---|
Alignment | CSV file containing the Per-word entropy values plotted in Figure 2 in the paper. |
Corpora | Two text files: eng.txt and code.txt. These files contain the english and code corpora. |
Usages | Two files with .dict extensions. Each file contains the usage frequencies for every English and code tokens. There are three columns in the files. The columns are explained in the following section. |
Processing Data with OpenNMT
cd ~/OpenNMT
-
For tokenization run from command line: th tools/tokenize.lua -mode space < ~/data/eng.txt > ~/data/eng.tok th tools/tokenize.lua -mode space < ~/data/code.txt > ~/data/code.tok
-
For creating the dictionary run from the command line: th preprocess.lua -train_src ~/data/eng.tok -train_tgt ~/data/code.tok -keep_frequency true -save_data ~/data/dictionary
For each corpus there will be a dictionary with three columns (tab separated).
Example:
token | 1-indexed token id | Occurance frequency |
---|---|---|
Drawable.Drawable | 5 | 658 |
The first four tokens of each dictionary contains:
<blank>
<unk>
<s>
</s>
So, we remove them.
Besides creating the dictionary (the .dict files) previous command creates a binrary file with .t7 extension. For our purpose we do not need that file.
rm ~/data/*.t7
- From command line run: cd ~/SOParallelCorpusReplication/BerkeleyAligner
- Split your corpus into two parts: 80% for training and 20% for testing
- Run the following commands one after another:
mkdir -p ~/SOParallelCorpusReplication/BerkeleyAligner/data/train
mkdir ~/SOParallelCorpusReplication/BerkeleyAligner/data/test
- Populate the BerkeleyAligner/data/train folder with train.en and train.cd
- Populate the BerkeleyAligner/data/test folder with test.en and test.cd
- run java -Xms2g -Xmx4g -jar berkeleyaligner.jar ++.conf configuration.conf
The last command will create a folder ~/SOParallelCorpusReplication/BerkeleyAligner/output and fill in the folder with many files. Among the files we need only stage2.1.params.txt. This file is the first argument for AlignmentEntropyStat.py.