Skip to content
Jason Baldridge edited this page Mar 19, 2013 · 3 revisions

Classification

DUE: by Monday, March 25 by 1pm

Preamble

This assignment involves using logistic regression for a text classification tasks. The main tasks are using existing software for classification and extracting features for them to use in learning and making predictions.

Notes:

  • As usual, the description of the problems for this homework is somewhat long, and as usual, much of the text is there to give you documentation and help you know what to do.
  • You should read all of the problems before starting. Then read through each problem carefully, in its entirety, before answering questions and doing the implementation.
  • It will be best to solve the problems in order since each one builds on the previous problem.
  • Your solutions must not use any var variables.
  • If you run into what you think might be a bug with Nak or applied-nlp, please let me know right away so I can fix it if it indeed is a bug.
  • Feel free to look at the naive Bayes homework on which this is based. It won't necessarily help you, but if you want to dig deeper, that homework actually involves implementing naive Bayes yourself rather than just using an existing implementation.
  • If you have any questions or problems with any of the materials, don't hesitate to ask!

If you don’t absolutely need to have words on paper, don’t print this homework description out -- it has a lot of example output that makes it lengthy, and it will be easier to do some cutting and pasting etc right from the web page.

You may work on this homework in teams. Make sure to include the team members in the write-up. Only one submission is required per team. Teams are responsible for self-policing to ensure that everyone is helping get the work done. However, please get in touch with me if you feel someone is not working and you need help resolving the situation.

Code set up and submitting your solutions

Fork the applied-nlp repository. (If you've already forked it, you'll need to fetch and merge upstream from the master repo---see the instructions for forking.) Make sure you've completed the steps in the applied-nlp README such that you can run the anlp command on the commandline.

Clone the Nak repository and follow the instructions so that you'll be able to run the nak command on the commandline.

Note for anyone who isn't a student in the class: If you are trying this homework out after it is due, you may find that the contents of the repository have changed. In that case, you can download version 0.3.0. (Also available as tag v0.3.0 in the repo.) The version of Nak this homework requires is v1.1.1.

Your implemented solutions will be done in the following two files (currently stubs):

  • applied-nlp/src/main/scala/appliednlp/classify/PrepAttach.scala
  • applied-nlp/src/main/scala/appliednlp/classify/ConfidenceScorer.scala.

You will work with classification datasets that are located in $NAK_DIR/data/classify. Go to that directory and have a look at it. Note that most of the commands suggested in the problems assume that you are running them in the $NAK_DIR/data/classify directory.

Submission: Submit your written answers on Blackboard as <lastname>_<firstname>_hw4_answers.pdf. Make sure to include the link to your fork of applied-nlp at the top of the file.

Problem 1 - A good day to play tennis? [5 pts]

Let’s start with a simple example problem from Tom Mitchell’s book Machine Learning (you don't need the book for this problem). The problem is to predict whether it is a good day to play tennis given various factors and some initial data that provides information about whether previous days were good or bad days for tennis. The factors include (in the format "Attribute: <List of Possible Values>"):

Outlook: Sunny, Rain, Overcast
Temperature: Hot, Mild, Cool
Humidity: High, Normal
Wind: String, Weak

Table 3.2 on page 59 of Mitchell’s book contains information for fourteen days; this data is encoded in the file $NAK_DIR/data/classify/tennis/train. There is another related file called test in the same directory. As you might expect, you will learn model parameters using train and evaluate the resulting model on the examples in test.

Each row in the data files corresponds to a single classification instance. For example, here is the training set.

data/classify/tennis/train
Outlook=Sunny,Temperature=Hot,Humidity=High,Wind=Weak,No
Outlook=Sunny,Temperature=Hot,Humidity=High,Wind=Strong,No
Outlook=Overcast,Temperature=Hot,Humidity=High,Wind=Weak,Yes
Outlook=Rain,Temperature=Mild,Humidity=High,Wind=Weak,Yes
Outlook=Rain,Temperature=Cool,Humidity=Normal,Wind=Weak,Yes
Outlook=Rain,Temperature=Cool,Humidity=Normal,Wind=Strong,No
Outlook=Overcast,Temperature=Cool,Humidity=Normal,Wind=Strong,Yes
Outlook=Sunny,Temperature=Mild,Humidity=High,Wind=Weak,No
Outlook=Sunny,Temperature=Cool,Humidity=Normal,Wind=Weak,Yes
Outlook=Rain,Temperature=Mild,Humidity=Normal,Wind=Weak,Yes
Outlook=Sunny,Temperature=Mild,Humidity=Normal,Wind=Strong,Yes
Outlook=Overcast,Temperature=Mild,Humidity=High,Wind=Strong,Yes
Outlook=Overcast,Temperature=Hot,Humidity=Normal,Wind=Weak,Yes
Outlook=Rain,Temperature=Mild,Humidity=High,Wind=Strong,No

Each instance consists of a list of attribute values, separated by commas, and the last element is the classification value. The value is "Yes" if it is a good day to play tennis based on the conditions, and "No" if it is not. What we are interested in for this toy example is to determine whether the probability of playing tennis is higher than the probability of not playing tennis.

The data includes a test set for the tennis task as well, provided in full here:

data/classify/tennis/test
Outlook=Sunny,Temperature=Cool,Humidity=High,Wind=Strong,No
Outlook=Overcast,Temperature=Cool,Humidity=Normal,Wind=Weak,Yes
Outlook=Sunny,Temperature=Hot,Humidity=Normal,Wind=Weak,Yes
Outlook=Rain,Temperature=Hot,Humidity=High,Wind=Strong,No
Outlook=Sunny,Temperature=Cool,Humidity=Normal,Wind=Weak,Yes
Outlook=Overcast,Temperature=Hot,Humidity=High,Wind=Strong,No
Outlook=Sunny,Temperature=Mild,Humidity=High,Wind=Weak,Yes
Outlook=Overcast,Temperature=Mild,Humidity=Normal,Wind=Strong,Yes
Outlook=Rain,Temperature=Cool,Humidity=Normal,Wind=Strong,No
Outlook=Overcast,Temperature=Cool,Humidity=Normal,Wind=Strong,Yes
Outlook=Rain,Temperature=Hot,Humidity=Normal,Wind=Weak,Yes
Outlook=Sunny,Temperature=Cool,Humidity=High,Wind=Weak,Yes
Outlook=Rain,Temperature=Hot,Humidity=Normal,Wind=Strong,No

Like the training set, this provides a list of instances, and for each instance, the values for each of the attributes and the classification outcome. We'll train classifiers using logistic regression and use them to make predictions on each instance and then compare its most probable label with the gold standard label provide in the test set.

The only thing you need to do for this problem is run the nak command line tool such that it trains a classifier on the train file and evaluates on the test file. Do the following commands:

$ cd $NAK_DIR/data/classify
$ nak classify --train tennis/train --eval tennis/test

You should see the following output:

Accuracy: 69.23076923076923

Written answer: Train a classifier on the "test" file and evaluate it on the "train" file. Report the accuracy.

Problem 2 - Prepositional Phrase Attachment [10 pts]

Prepositional phrase attachment is the well-known task of resolving a common ambiguity in English syntax regarding whether a prepositional phrase attaches to the verb or the noun in sentences with the pattern Verb Noun_Phrase Prepositional_Phrase. An example is I saw the man with the telescope. If the prepositional phrase attaches to the verb, then the seeing was done with the telescope; if it attaches to the noun, it indicates that the man had the telescope in his possession. A clear difference can be seen with the following related examples.

  • Attach to the noun: He ate the spaghetti with meatballs.
  • Attach to the verb: He ate the spaghetti with chopsticks.

We can deal with this decision just like any simple labeling problem: each sentence receives a label V or N indicating the attachment decision, and we seek no potential gains by considering the other syntactic configurations in the same sentence.

For this problem, you will use a conveniently formatted data set for prepositional phrase attachment which has been made available by Adwait Ratnaparkhi. You can find it in the directory $NAK_DIR/data/classify/ppa. Go to that directory and list the contents. There are three files which you will use for this problem: training, devset, and test. Look at the contents of training:

data/classify/ppa/training
0 join board as director V
1 is chairman of N.V. N
2 named director of conglomerate N
3 caused percentage of deaths N
5 using crocidolite in filters V
...

Each row lists an abbreviated form of a prepositional phrase attachment. The first item is just a source sentence identifier that we can ignore. The four words correspond to the head verb, head noun, preposition, and head noun object of the preposition, in that order. The final element indicates whether the attachment was to the verb (V) or to the noun (N). For example, for the two spaghetti eating sentences given above, the abbreviated forms would be:

4242 ate spaghetti with meatballs N
4243 ate spaghetti with chopsticks V

For this exercise, you will build a classifier that learns a model from the data in training and use it to classify new instances. You will develop your model using the material in devset. You must not personally inspect the contents of test — you will run your classifier on test only once (for problem 3), when you are done developing.

The first thing you must do is produce features from the ppa data in the format used previously for the tennis problem. Each of the four columns in the ppa data implicitly indicates a separate attribute – let’s call them verb, noun, prep, and prep_obj, respectively. To begin, we'll just create features that are based directly on the attributes and their values. So, a line like,

0 join board as director V

becomes the following:

verb=join,noun=board,prep=as,prep_obj=director,V

The main method of appliednlp.classify.PpaFeatures reads in a ppa file and produce the features in the above format. Using this program, produce files for training and development as follows:

$ cd $NAK_DIR/data/classify
$ mkdir out
$ anlp run appliednlp.classify.PpaFeatures ppa/training > out/ppa.basic.training
$ anlp run appliednlp.classify.PpaFeatures ppa/devset > out/ppa.basic.devset
$ anlp run appliednlp.classify.PpaFeatures ppa/test > out/ppa.basic.test

You can run nak classify on the feature files that are produced and score the output with scalabha score:

$ nak classify --train out/ppa.basic.training --eval out/ppa.basic.devset 
Accuracy: 82.00049517207229

Ratnaparkhi et al (1994) obtain accuracies of around 81-82% on the test set, so we're doing well with just this much already.

Nak uses liblinear's L2-regularized logistic regression solver by default to estimate the parameters of the model. With this solver (and all of liblinear's other solvers), there is a parameter C which controls how much importance we give to having a simple model on the one hand and fitting the model to the training data on the other hand. Larger values of C give more importance to the latter consideration. One way of thinking about it is that we don't trust the training data fully, so we don't make the model parameters a perfect fit to the training data.

The default for C is 1.0, but you can set it to other values by using the --cost (or -c) option. For example:

$ nak classify --train out/ppa.basic.training --eval out/ppa.basic.devset -c 10
Accuracy: 80.93587521663778

Here the accuracy has decreased, so this has put too much importance on ensuring that we get low error on the training instances; this means the model has overfit the training data somewhat and thus did not generalize to the unseen devset as well as the model that used a C value of 1.0.

Find a good value for C that improves the accuracy on the development set.

Written answer. Report the best C you found and what accuracy you obtained.

Tip. You can easily check many possible values for C using the Unix command line. For example:

$ for i in {1..5}; do echo $i; nak classify --train out/ppa.basic.training --eval out/ppa.basic.devset -c $i; done

You should see multiple accuracy values go by, each one associated with the C value printed above it. (If this doesn't work, then you might not be using the bash shell.) You can also specify lists of specific values such as for i in .5 1 1.5 2 2.5 5 for drilling down further.

Remember: don't do anything with ppa/test yet!

Problem 3 - Extending the feature set [30 pts]

The simple set of features used for the ppa data above can definitely be improved upon. For example, you could have features that:

  • are combinations of the head noun and verb, e.g. verb+noun=join+board
  • are a stemmed version of the verb, e.g. verb_stem=caus from cause, causes, caused, and causing
  • identify all numbers, e.g. noun_form=number for 42, 1.5, 100,000, etc.
  • identify capitalized forms, e.g. noun_form=Xx for Texas and noun_form=XX for USA.
  • use word clusters derived from a large corpus (see the discussion of bitstrings in Ratnaparkhi et al (1994)

Part (a). Written answer. Read Ratnaparkhi et al (1994) and in 1-3 paragraphs, describe how the clusters were generated and how the bitstrings encode that clustering.

Part (b). Implementation. Create extended ppa features by modifying ExtendedFeatureExtractor, following instructions in the code. Look at the suggestions above and also look at the training file to see if you can think of other features. You should come up with at least five new (types of) features, but are encouraged to do much more. Be creative! Look at the directions in ExtendedFeatureExtractor so that you can define features that will be output when the --extended option is given to PpaFeatures.

Tip: The bitstrings are really useful.

Part (c). Written answer. Describe five of the features you added, and why you think they would be useful.

Tip. Some resources have been imported for you:

  • The Porter stemmer is available via the stemmer object -- to get the stem of a word, you can call, for example, stemmer("walked").
  • The bitstrings supplied with the data set capture word clusters. They can be imported by using the --bitstrings option, which gives you a dictionary from words to BitVector objects. There is an example in the code of how you can use them, and the code for BitVectors (which is very rudimentary, not a serious implementation at all, but fine for the needs of this problem) is included in PrepAttach.scala.

As you enable new features, you can try them out by generating new feature files and running the classifier.

$ anlp run appliednlp.classify.PpaFeatures -e -b ppa/bitstrings ppa/training > out/ppa.extended.training
$ anlp run appliednlp.classify.PpaFeatures -e -b ppa/bitstrings ppa/devset > out/ppa.extended.devset
$ nak classify --train out/ppa.extended.training --eval out/ppa.extended.devset 

Since you'll be adding and modifying features while you develop, you'll need to continually regenerate the feature files and then test again. This bash line will do that for you.

$ for f in training devset; do echo $f; anlp run appliednlp.classify.PpaFeatures -e -b ppa/bitstrings ppa/$f  > out/ppa.extended.$f; done; echo "Evaluating"; nak classify --train out/ppa.extended.training --eval out/ppa.extended.devset

As before with the basic features, find an optimal C on the dev set with the extended features. (Note that this may change as you add more features.) When you are satisfied that you cannot improve the performance any further, you can finally try out the test set! (Once you've done this, there is no going back to change the features any more.)

Part (d). Written answer. Plot the performance on the devset of the basic and extended features for different values of C from .1 to 3 in .1 increments. You can do this with R, a spreadsheet program, or anything else that allows you to plot and then include the image in your PDF hand-in. You should use a line graph, and the values of C should be on the x-axis and the performance should be on the y-axis. Plot the lines for both basic and extended features together. You'll probably want the y-axis to go from 75 to 85.

Part (e). Written answer. You've now found the best C for the dev set for each feature set, so you can use those values to see performance on the test set. What is the performance using the basic features and the extended features, each with their best C?

Note: Don't forget that you need to generate basic and extended features for the test set and use the appropriate one for the corresponding training features.

Problem 4 - Classifier confidence and performance [20 pts]

The classify command has an option --predictfile that outputs a probability distribution for each instance, for example:

$ nak classify --train tennis/train --eval tennis/test --predictfile out/tennis.predictions.txt
Accuracy: 69.23076923076923
$ cat out/tennis.predictions.txt 
No 0.7452833719203171 Yes 0.25471662807968287
Yes 0.9128038194961268 No 0.08719618050387325
Yes 0.6685672020354926 No 0.3314327979645074
No 0.7448945443961581 Yes 0.2551054556038419
Yes 0.7250866867660491 No 0.27491331323395085
Yes 0.5091961312957145 No 0.4908038687042855
Yes 0.5247183453103467 No 0.47528165468965333
Yes 0.860951048891003 No 0.139048951108997
Yes 0.5965590903796518 No 0.4034409096203481
Yes 0.8175002769378088 No 0.18249972306219123
Yes 0.7254945793089875 No 0.2745054206910125
No 0.5559534029104584 Yes 0.44404659708954164
Yes 0.530717448134348 No 0.46928255186565204

The most probable label is the leftmost on each line. Notice that the probabilities sum to one (as they should). So, if we take the most probable label as the prediction (which is the usual thing to do), this means that the model predicts "No" for the first instance in the test file:

Outlook=Sunny,Temperature=Cool,Humidity=High,Wind=Strong

This does match the actual label given in the file for that instance. (Other predictions are of course different from the test labels.)

Some examples have a higher probability for their most probable label than others. As it turns out, classifier confidence is a good indicator of when the prediction is likely to be correct: the higher the probability, the higher the confidence, and--typically--the more likely the assignment is correct.

Notice that the second line is the most confident, with a probability of 91.28 for Yes, while the sixth line is the least confident, with a probability of 50.92 for Yes. For this problem, you will sort all the predictions based on the second column probability (the probability of the best label), and then score the most confident examples, the least confident examples, and the band of examples in between.

Part (a). Implementation. Write code to calculate how accuracy changes based on confidence. Implement this in the ConfidenceScorer object in the file ConfidenceScorer.scala. You must output the accuracy for the top-third most confident examples, the middle-third set of examples and the bottom-third least confident examples.

As an example, when it is done, you should be able to run it like this on the tennis data set.

$ nak classify --train tennis/train --eval tennis/test --predictfile out/tennis.predictions.txt
Accuracy: 69.23076923076923
$ anlp run appliednlp.classify.ConfidenceScorer tennis/test out/tennis.predictions.txt 
High confidence accuracy: 100.00
Mid confidence accuracy: 75.00
Low confidence accuracy:  25.00

Of course, you'll do this for the PPA dataset, which will give a more interesting result.

Part (b). Written answer. Use your scoring implementation to calculate the accuracies of the three confidence bands for the PPA dataset for both basic and extended features on the PPA development data. Include the output in your answer. In 1-3 paragraphs, discuss how you might take advantage of this property for learning or use of such models.

Problem 5 - Another text classification data set [35 pts]

Obtain another labeled dataset and do classification on it. Most of your work will be converting the raw data format into that needed by the classifiers (including doing feature extraction). You can choose any suitable data set you like, but here are a couple of suggestions:

If you have another one you'd like to consider, send me an email and I'll let you know if it is okay.

Part (a). Implementation. Write the code for your implementation in as the file TextClassify.scala in the appliednlp/classify source code directory.

Part (b). Written answer. Describe:

  • The dataset you chose and why. (1-2 paragraphs)
  • The features you used. Minimally, you need to have unigram word features, but you can really get quite creative with this, e.g. you could compute a topic model for the data set and then use the document topics and/or the topics themselves to generate features.
  • Summarize your results. Minimally, this means stating the performance you obtained, but could also include graphs and tables.
  • Anything else interesting, including special challenges you encountered (and perhaps overcame) and anything unexpected about what you did.
  • If there is relevant literature that you looked at, mention it in your write-up

If you go above and beyond the basic requirement for this, you'll get extra credit.