Implementation of character-level deep neural networks for text classification. Three models (CNN, VDCNN and GRU) are evaluated on four binary text classification datasets (Blog Authorship Corpus, PAN13 and PAN14 and Enron Email Dataset). Results:
Blogs | PAN13 | PAN14 | Enron | |
---|---|---|---|---|
CNN | 65% | 55% | 69% | 57% |
VDCNN | 66% | 74% | 67% | 64% |
GRU | 62% | 60% | 63% | 62% |
Overall, the VDCNN model is the most accurate, but the GRU model displays more consistent results.
A working Python 3 installation is assumed. Install the required packages using:
pip install -r requirements.txt
Note that requirements.txt
references the tensorflow-gpu
package. It is recommended to use a GPU to train the models. If no GPU is used, install the tensorflow
package instead.
Download the training data using:
./download.sh
Run the preprocessing steps using:
./process.sh
Now, you can train a model using:
./train.py -a vdcnn -d blogs pan13_tr_en
Use train.py -h
for more information.