Project1

Course project, Phase 1

DUE: February 1, 2013, 1pm CST

Preamble

This project phase contains a series of exercises that get you working with the Twitter streaming API and doing some simple processing of tweets and their contents.

If there is anything unclear in the instructions, get in touch with the instructor right away.

Setting up the code

Clone the course repository or pull the latest and look into the applied-nlp/project/phase1 directory. You'll finde the code stubs and blank answer file there. There isn't much in the way of helping code---ask the instructor right away if you are having trouble getting started.

Note: Even though you are creating Scala scripts here, don't be shy to define and use functions that make things work more easily.

Submitting your solutions

For each activity in this phase, fill in the information requested in answers_p1.txt and include any output files requested.

Submit your solutions as <lastname>_<firstname>_phase1.tgz. Following these commands will do the right thing (substituting your name of course):

$ cp -r phase1 /tmp/baldridge_jason_phase1
$ cd /tmp
$ tar czf baldridge_jason_phase1.tgz baldridge_jason_phase1

Note: Please make sure there aren't any files other than those requested in the instructions below. (You're submission shouldn't be over 1M.)

1. Using the Twitter streaming API

Create account

Go to the Twitter signup page and get an account. Your account name should end in "_anlp", e.g. "jbaldrid_anlp" or "a1b2c3_anlp" etc. If you already have an account, you need to sign up with an email address that is different from the one use used for that account. Here's a cool trick if you use gmail: just add +foo to your email address and you have a unique email for Twitter that still goes to your usual gmail inbox. E.g. if your gmail address is foobarbaz@gmail, then sign up on Twitter with foobarbaz+twitter1@gmail.com. When we create Twitter accounts for bots in later project phases, you can use "twitter2", "twitter3", etc. (Or any naming scheme you like that establishes a unique email for each Twitter account.)

As part of setting up the account, follow the @appliednlp account.

Written: Add your account name to the answer sheet.

Read the blog post

I've posted a simple walk-through for using the Twitter streaming API.

http://bcomposes.wordpress.com/2013/01/25/a-walk-through-for-the-twitter-streaming-api/

Read it and run the commands for yourself using the account you created above.

Searching

Use curl and the API to obtain 100 tweets that contain "austin", "chicago", "new york", or "san francisco". Save these in the file search100.json and include it in your submission.

Tip: Obtain more than 100 by directing the output to a file, and then use the Unix head or tail command to pull out just 100.

Pull tweets from your own account

Use curl and the API to follow the account you created above. While you are following your account in this way, tweet "I'm working on project phase one for @appliednlp." Confirm that it appears in the shell where you invoked curl. (Feel free to tweet more if you like.)

Written: Add the unique id of your tweet to the answer sheet.

2. Extract information from tweets

Modify the stub Scala program ExtractTweetInfo.scala so that it takes a file containing JSON tweets on the command line and outputs the user, follower count, and tweet text for each tweet. Here is some example output:

$ scala ExtractTweetInfo.scala search100.json | head -4
cmdorsey 5799 Oath Keepers \u00bb Blog Archive \u00bb Gun Owners Refuse To Register Under New York Law: http:\/\/t.co\/fjOi7R2P &lt; GOVT IS A CRIMINAL ORGANIZATION NOW
FredYonnet 20810 RT @TalibKweli: RT @DrPostALot: San Francisco: TOMORROW @FredYonnet, @TalibKweli at @yoshisSF_OAK.  Also appearing @MartianLuther. http: ...
twstdcncptsinc 218 Tonight for Indiana &amp; Chicago \nINDIANA\nWoodhollow After Dark - LADIES FREE TILL 11PM\n\nCHICAGO\nNikki - PW:... http:\/\/t.co\/6LUkOzgS
TheYankeesTrap 655 New York Yankees Creating Their Own Payroll Problems http:\/\/t.co\/1yYJxXaD #nyy #yankees

Use ExtractTweetInfo.scala on search100.json to create a file extracted100.txt, as follows.

$ scala ExtractTweetInfo.scala search100.json > extracted100.txt

Written: Paste the last five lines of extracted100.txt into the answers.

3. Token counting

Modify the stub Scala file TokenCounter.scala so that it reads in the output from ExtractTweetInfo.scala and obtains counts of all the tokens. Each token is simply any sequence of non-whitespace, so in a sentence like the following:

Do you see the reflection of the #tower?   @ McCombs School of Business (CBA) http:\/\/t.co\/aXufYsxv

The tokens include Do, you, ... #tower?, @, ..., (CBA), http:\/\/t.co\/aXufYsxv.

Your program should go through just the text portions of the tweets (ignoring the user name and follower count part of each line). Having counted all the tokens, your program should output the top ten tokens, the top ten hashtags (the subset of tokens that start with #, and the top ten mentions (the subset of tokens that start with @).

Written: Provide the output when TokenCounter.scala is run on extracted100.txt.

4. Counting using streaming tweets

Modify the stub file StreamingCounter.scala so that it processes standard input that is piped to it and prints the top hashtags for batches of tweets as well as the top ten hashtags it has ever seen up to that point. Some explanation is in order to make this task clear.

First, if you are not sure what it means to process input piped into your program, note that exercises 3 and 4 involve taking a file that is sitting on disk as a command line argument, and then processing it. Instead, we want to be able to take the output from another process, consume it as it is passed to us, and perform some computation on it. This will likely be new to some students, so here's a simple example to see what is going on.

In Unix systems, many programs are set up such that they accept standard input and they write to standard output. We've seen some of this already, but here's a self-contained example. Let's start with echo, which takes command line arguments and outputs them to standard output.

$ echo "Hello, world, how y'all doing?"
Hello, world, how y'all doing?

Instead of letting that be output to the terminal, you can pipe the output to another program. Here's an example of piping it to tr, which "translates" characters of one set to another -- here we turn all uppercase characters to lowercase.

$ echo "Hello, world, how y'all doing?" | tr 'A-Z' 'a-z'
hello, world, how y'all doing?

And again, rather than letting that output go to the screen, we can pipe it again through tr and this time ask it to turn all non-lowercase characters into newlines (the "c" option means to take the complement of the argument, and "s" means to squeeze multiple occurrences into one).

$ echo "Hello, world, how y'all doing?" | tr 'A-Z' 'a-z' | tr -cs 'a-z' '\n'
hello
world
how
y
all
doing

Now, save the following as the file Reverser.scala.

io.Source.fromInputStream(System.in).getLines.map(_.reverse).foreach(println)

This is a Scala program that accepts standard input (via the stream System.in), reverses each line and then prints it to standard output. We can now pipe the output of what we did above through this.

$ echo "Hello, world, how y'all doing?" | tr 'A-Z' 'a-z' | tr -cs 'a-z' '\n' | scala Reverser.scala 
olleh
dlrow
woh
y
lla
gniod

Given this, and given that the use of curl to access the Twitter streaming API outputs raw tweets to standard output, we can pipe that into StreamingCounter.scala (assuming it is set up to accept System.in input as shown above). Here's what your output should look like.

$ curl https://stream.twitter.com/1/statuses/sample.json -u$TWUSER:$TWPWD | scala StreamingCounter.scala 1000
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 2671k    0 2671k    0     0   126k      0 --:--:--  0:00:21 --:--:--  134k------------------------------
Number of tweets seen: 1000

[Batch] #FF:7 #RT:5 #Hispanos:3 #fail:2 #\u00daLTIMAHORA:2 #RakutenIchiba:2 #weridos:1 #ugh:1 #twisters:1 #tweet:1

[Global] #FF:7 #RT:5 #Hispanos:3 #fail:2 #\u00daLTIMAHORA:2 #RakutenIchiba:2 #weridos:1 #ugh:1 #twisters:1 #tweet:1

100 5399k    0 5399k    0     0   125k      0 --:--:--  0:00:43 --:--:--  144k------------------------------
Number of tweets seen: 2000

[Batch] #FF:6 #hispanos:3 #UkrainianDirectionersWantTMHtour:3 #RT:3 #EsTodoRisasHastaQue:3 #summer:2 #happy:2 #\u0632\u062f_\u0631\u0635\u064a\u062f\u06433:2 #TeamFollowBack:2 #MinhaNota:2

[Global] #FF:13 #RT:8 #UkrainianDirectionersWantTMHtour:4 #Hispanos:4 #hispanos:3 #TeamFollowBack:3 #EsTodoRisasHastaQue:3 #summer:2 #rt:2 #jobs:2

100 8273k    0 8273k    0     0   126k      0 --:--:--  0:01:05 --:--:--  133k------------------------------
Number of tweets seen: 3000

[Batch] #FF:6 #ff:3 #rageofbahamut:2 #lfc:2 #TFBJP:2 #yolomoment:1 #yeseB1:1 #worthit:1 #wis10:1 #weddingbusines:1

[Global] #FF:19 #RT:8 #ff:5 #hispanos:4 #UkrainianDirectionersWantTMHtour:4 #TeamFollowBack:4 #Hispanos:4 #TFBJP:3 #EsTodoRisasHastaQue:3 #summer:2

100 10.7M    0 10.7M    0     0   128k      0 --:--:--  0:01:25 --:--:--  126k------------------------------
Number of tweets seen: 4000

[Batch] #FF:5 #BrazilIsExcitedForDatesOfBelieveTour:3 #rt:2 #gameinsight:2 #TeamFollowBack:2 #RT:2 #EsTodoRisasHastaQue:2 #Afcon2013:2 #:2 #winter:1

[Global] #FF:24 #RT:10 #ff:6 #TeamFollowBack:6 #EsTodoRisasHastaQue:5 #BrazilIsExcitedForDatesOfBelieveTour:5 #rt:4 #hispanos:4 #UkrainianDirectionersWantTMHtour:4 #TFBJP:4

100 13.4M    0 13.4M    0     0   128k      0 --:--:--  0:01:47 --:--:--  125k------------------------------
Number of tweets seen: 5000

[Batch] #FF:8 #followall:2 #TuitUtil:2 #TeamFollowBack:2 #RT:2 #IWishICould:2 #GoodLuckOnTourLittleMix:2 #EsTodoRisasHastaQue:2 #BELIEVEacoustic:2 #4DAYS:2

[Global] #FF:32 #RT:12 #TeamFollowBack:8 #EsTodoRisasHastaQue:7 #ff:6 #rt:5 #hispanos:5 #BrazilIsExcitedForDatesOfBelieveTour:5 #:5 #WaysToGetSlapped:4

It will continue to process tweets until you kill the process (CTRL-C) or Twitter drops your connection.

Looking at the above example, you can see that StreamingCounter.scala accepts input from standard input, and it also has a command line option that specifies how big each batch is. After every batch of tweets, it provides the counts of the total number of tweets seen and then the top ten hashtags for that batch and the global top ten based on all tweets processed thus far.

We can also see that I ran this on a Friday (evidenced by #FF, or Follow Friday), and that Brazil has beliebers...

Note that you can run a file on disk through this version as well, using cat (which takes the contents of a file and spills it to standard output). Here's the command (giving batch of 10 since this is a small file).

$ cat search100.json | scala StreamingCounter.scala 10

You'll want to work with your program in this way while writing and debugging StreamingCounter.scala.

Written: Provide the output for batches of 1000 up to 5 batches (similar to the above).

Written: Look at the trending topics on Twitter while you are doing this. Are you finding any of the same top hashtags? What were they?

Extra

There are a number of ways you could go further with this if you would like. Note that this isn't for a lot of points (it could bring you from a 95 to 100), so it is mainly for students who are interested in trying out a bit more.

Add an English language filter so that the counts you get exclude many non-English tweets. A simple strategy is to grab an English stopword list, such as this one and then check that at least one word in the tweet is a stop word. (You'll probably want to remove one-word stopwords like 'a' from the list though.)
Track some user metrics with problem 4, such as most mentioned users, and see how well that correlates with the follower counts for those users. (There are multiple ways you could get the follower counts, but a reasonable strategy would be to write a file with usernames and the number of mentions and then write another program that reads that file, makes HTTP requests to get the user info, parse the info, etc.)
Implement problem 4 using Storm or Akka. (This is significantly more effort.)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly