The Large Movie Review Dataset is available at http://ai.stanford.edu/~amaas/data/sentiment/.
@InProceedings{maas-EtAl:2011:ACL-HLT2011,
author = {Maas, Andrew L. and Daly, Raymond E. and Pham, Peter T. and Huang, Dan and Ng, Andrew Y. and Potts, Christopher},
title = {Learning Word Vectors for Sentiment Analysis},
booktitle = {Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies},
month = {June},
year = {2011},
address = {Portland, Oregon, USA},
publisher = {Association for Computational Linguistics},
pages = {142--150},
url = {http://www.aclweb.org/anthology/P11-1015}
}
Get the dataset and extract it:
wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
gunzip -c aclImdb_v1.tar.gz | tar xvf -
Combine the positive and negative reviews into single separate files (each containing one review per line):
awk '{print "negative " $0}' ./aclImdb/train/neg/*.txt > negative.txt
awk '{print "positive " $0}' ./aclImdb/train/pos/*.txt > positive.txt
cat negative.txt positive.txt > training.txt
The training.txt
file should have 25,000 lines (reviews).
Now do the same for the test data:
awk '{print "negative " $0}' ./aclImdb/test/neg/*.txt > negative.txt
awk '{print "positive " $0}' ./aclImdb/test/pos/*.txt > positive.txt
cat negative.txt positive.txt > test.txt
Download and extract Apache OpenNLP:
wget https://dlcdn.apache.org/opennlp/opennlp-2.0.0/apache-opennlp-2.0.0-bin.tar.gz
gunzip -c apache-opennlp-2.0.0-bin.tar.gz | tar xvf -
Train the classification model using the default parameters:
apache-opennlp-2.0.0/bin/opennlp DoccatTrainer -lang eng -data training.txt -model reviews.bin
The output will be similar to (trimmed for brevity):
Indexing events with TwoPass using cutoff of 5
Computing event counts... done. 25000 events
Indexing... done.
Sorting and merging events... done. Reduced 25000 events to 24904.
Done indexing in 3.73 s.
Incorporating indexed data for training...
done.
Number of Event Tokens: 24904
Number of Outcomes: 2
Number of Predicates: 49338
...done.
Computing model parameters ...
Performing 100 iterations.
1: ... loglikelihood=-17328.679513999836 0.5
2: ... loglikelihood=-17211.385062834383 0.895
3: ... loglikelihood=-17096.643006176993 0.89608
4: ... loglikelihood=-16984.327243758315 0.89756
...
96: ... loglikelihood=-11482.127956384167 0.9362
97: ... loglikelihood=-11447.924532664598 0.93644
98: ... loglikelihood=-11414.001924383781 0.93664
99: ... loglikelihood=-11380.356063246394 0.937
100: ... loglikelihood=-11346.982967722744 0.93744
Writing document categorizer model ... done (0.299s)
Wrote document categorizer model to
path: /tmp/reviews.bin
Execution time: 15.709 seconds
Evaluate the model:
apache-opennlp-2.0.0/bin/opennlp DoccatEvaluator -model reviews.bin -data test.txt
The output will be similar to:
Loading Document Categorizer model ... done (0.088s)
current: 19192.4 doc/s avg: 19192.4 doc/s total: 19532 doc
Average: 20560.0 doc/s
Total: 25001 doc
Runtime: 1.216s
Accuracy: 0.85592
Number of documents: 25000
Execution time: 1.356 seconds