-
Notifications
You must be signed in to change notification settings - Fork 19
/
Copy pathTweet-Mining.Rnw
245 lines (173 loc) · 12.7 KB
/
Tweet-Mining.Rnw
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
\documentclass[letterpaper, 12pt]{article}
\usepackage{hyperref,fancyhdr,lastpage,graphicx,newclude,microtype,}
\usepackage[justification=raggedright,labelfont=bf,format=plain,labelsep=period,textformat=period,singlelinecheck=false]{caption}
\usepackage[margin=1in]{geometry}
\usepackage[T1]{fontenc}
\usepackage[scaled]{helvet}
\renewcommand*\familydefault{\sfdefault}
\setlength{\headheight}{15.2pt}
\pagestyle{fancy}
\rhead[\small{Page \thepage\ of \pageref{LastPage}}]{\small{Page \thepage\ of \pageref{LastPage}}}
\cfoot[\small{Under a Creative Commons license.\\ See \url{https://github.com/woodrad/Twitter-Sentiment-Mining} for more information.}]{\small{Under a Creative Commons license.\\ See \url{https://github.com/woodrad/Twitter-Sentiment-Mining} for more information.}}
\def\contentsname{}
\usepackage[noae]{Sweave}
\begin{document}
\begin{center}
\huge{\bf{Mining Twitter Using R}}
\normalsize{By Mathew Woodyard}
\normalsize{Last Modified \today}
\end{center}
\hrule
\section{Motivation}
We endeavor to answer the questions: Why aren't people buying my widgets? Are people saying bad things about me? Why are the students of Southern Illinois University Edwardsville (\href{http://www.siue.edu/}{SIUE}) unhappy?
These are classic questions concerning not only product marketers and statisticians: what convinces people to tell us what they think, what they own, what they will buy, where they are, and who they are? How do we get honest answers? What can we infer from a large number of these answers?
There is some good news yet because...
\subsection{300 Million People Are Here To Chat!}
Twitter is a ``microblogging'' service that facilitates small bits of communication. Each message, a ``tweet'', contains no more than 140 characters. \href{http://www.wolframalpha.com/input/?i=number%20of%20Twitter%20users}{300 million} people use \href{http://Twitter.com}{Twitter}. But how do they use it? Why? How does this help us?
Twitter has some use cases and features that result in a nice corpus. It is often used for consumption of news stories and participation in current events. Many people also use Twitter to express emotional states: often complaining about or praising places, products, and services. Since tweets contain no more than 140 characters, finding relevant tweets is normally a fairly easy task. Twitter's automatic grouping and ``hashtags'', tags that group topics and start with a hash mark (\#), enable us find what we want with relative ease.
In my example, I will use tweets about SIUE. Tweets were selected from the stream based on the occurrence of the terms ``SIUE'' and ``\#onlyatsiue'', the latter having spent some time as a major regional trending topic.
\begin{figure}[htbp]
\centering
\includegraphics{Images/tweet.png}
\caption{A tweet in its natural habitat}
\end{figure}
Now that we know people are generating data we can use, how can we provide analysis that makes these tweets useful?
\section{Harnessing Tweets}
Our strategy will consist of the following steps.
\begin{enumerate}
\item Sip from the stream of tweets using \href{http://cran.r-project.org/web/packages/twitteR/index.html}{twitteR} and \href{http://cran.r-project.org/web/packages/ROAuth/index.html}{ROAuth} to build a corpus.
\item Discover our new corpus.
\item Select a sentiment lexicon. Download it and load it into R.
\item Score tweets using the function provided or using your own classification scheme.
\item Summarize the results.
\end{enumerate}
\subsection{Building a Corpus}
First, we sip from the stream to build a corpus. Be aware that we can only sip, and not guzzle, due to Twitter's \href{https://dev.twitter.com/docs/rate-limiting}{rate limiting} (note there is no limit on the number of tweets you post using twitteR). The code below loads some previously-downloaded tweets. Code to append new tweets from the stream is commented out and can be run when you want to expand the corpus. Note that by registering with Twitter's developer program, you can get API keys used to increase the rate limit for your IP/key combination.
<<socialsipper>>=
# Load needed libraries to allow R to interface with twitter and
# interface with the API using OAuth.
require(twitteR)
require(ROAuth)
# The code below does not work due to a current bug in twitteR.
# This means that the rate of downloading tweets will be limited.
# The following code should NEVER be shared. It includes private keys
# that authenticate against the twitter API. This code was taken from the
# twitteR documentation. Run the code from this block to the next comment.
#
# reqURL <- "https://api.twitter.com/oauth/request_token"
# accessURL <- "https://api.twitter.com/oauth/access_token"
# authURL <- "https://api.twitter.com/oauth/authorize"
# consumerKey <- "keygoeshere"
# consumerSecret <- "keygoeshere"
# twitCred <- OAuthFactory$new(consumerKey=consumer.key,
# consumerSecret=consumer.secret,
# requestURL=reqURL,
# accessURL=accessURL,
# authURL=authURL)
# twitCred$handshake()
#
# After entering the PIN provided by twitter, run the following command.
#
# registerTwitterOAuth(twitCred)
# Load saved tweets.
load(file="Data/siue.tweets.RData")
# Uncomment to add hot new tweets.
# siueTweets <- append(siue.tweets, searchTwitter('siue', n=1500))
# siueTweets <- append(siue.tweets, searchTwitter('#onlyatsiue', n=1500))
@
\subsection{Exploring Your Hot New Corpus}
Now that we have the corpus loaded, let's explore what people from SIUE are saying about their university.
<<corpusexplore>>=
tail(siue.tweets, 2)
@
So far we have some anecdotes and can probably infer that their university had a power outage, is a goose sanctuary, and has a bit of a wasp problem. But what kind of sentiments are associated with SIUE in the aggregate?
\subsection{Loading Lexicons; Tweaking Tweets}
Before we can preform more analysis on the aggregate sentiments within these tweets, we need to first download and load a sentiment lexicon into R. Then we need to modify the corpus so that we can analyze it.
The Sentiment Lexicon for this project is taken from Hu and Liu's opinion lexicon. See\\ \url{http://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html} for source information. You can find download their opinion lexicon from\\ \url{http://www.cs.uic.edu/~liub/FBS/opinion-lexicon-English.rar}. The raw text is also included in the \href{https://github.com/woodrad/Twitter-Sentiment-Mining}{GitHub repository} for this project.
<<lexload>>=
# Import sentiment lexicons. If the situation demands,
# add in domain-specific words.
positiveWords <-
scan('Hu and Liu Sentiment Lexicon/positive-words.txt',
what='character', comment.char=';')
negativeWords <-
scan('Hu and Liu Sentiment Lexicon/negative-words.txt',
what='character', comment.char=';')
@
Now we need to restructure the corpus so we can preform some analysis.
<<tweetply>>=
# Extract the text of the tweet for mining.
tweetText <-lapply(siue.tweets, function(x) x$getText())
# Remove non-UTF8 tweets.
tweetText <- enc2utf8(gsub("[\U80-\UFF]", "", tweetText, useBytes=TRUE))
@
Great. The tweets are ready to be scored.
\subsection{Scoring Tweets}
The function we are using to score tweets is mainly taken from \href{http://www.slideshare.net/jeffreybreen/r-by-example-mining-twitter-for}{``R by example: mining Twitter for consumer
attitudes towards airlines''} by Jeffrey Breen. If you want to see the function, simply look at the R code in the GitHub project or in the Rnw file that comes with this PDF.
<<functionload,echo=FALSE>>=
score.sentiment = function(sentences, positive.words, negative.words, .progress='none')
{
require(plyr)
require(stringr)
# we got a vector of sentences. plyr will handle a list or a vector as an "l" for us
# we want a simple array of scores back, so we use "l" + "a" + "ply" = laply:
scores = laply(sentences, function(sentence, positive.words, negative.words) {
# clean up sentences with R's regex-driven global substitute, gsub():
sentence = gsub('[[:punct:]]', '', sentence)
sentence = gsub('[[:cntrl:]]', '', sentence)
sentence = gsub('\\d+', '', sentence)
# and convert to lower case:
sentence = tolower(sentence)
# split into words. str_split is in the stringr package
word.list = str_split(sentence, '\\s+')
# sometimes a list() is one level of hierarchy too much
words = unlist(word.list)
# compare our words to the dictionaries of positive & negative terms
positive.matches = match(words, positive.words)
negative.matches = match(words, negative.words)
# match() returns the position of the matched term or NA
# we just want a TRUE/FALSE:
positive.matches = !is.na(positive.matches)
negative.matches = !is.na(negative.matches)
# and conveniently enough, TRUE/FALSE will be treated as 1/0 by sum():
score = sum(positive.matches) - sum(negative.matches)
return(score)
}, positive.words, negative.words, .progress=.progress )
scores.df = data.frame(score=scores, text=sentences)
return(scores.df)
}
@
<<scoretweets>>=
tweetScores <- score.sentiment(tweetText, positiveWords, negativeWords)
@
The scoring is complete.
\subsection{Summarizing Your Results}
Now we summarize the results of our mining with a humble histogram. I have selected \verb+ggplot2+ for this task, mainly because it is elegant by default and has a function (\verb+qplot+) that is intuitive to native R graphics users.
\newpage
\begin{figure}[htbp]
\centering
<<fig=TRUE>>=
require(ggplot2)
qplot(score, data=tweetScores, geom="histogram", binwidth=1)
@
\caption{Histogram of tweet scores}
\end{figure}
From exploring the data, we know that SIUE has tense animal-human relations and had a recent power outage and fire alarm. Our summary shows that this sample of tweets about SIUE are (very) slightly positive. Combined with input from subject matter experts, further analysis, and insights from other data, we can start to get a picture of what life is like at SIUE.
\section{Limitations}
Limitations exist in the methodology I have presented. One class of limitations is related to the over-simplified model implicitly used and the other is related to R.
\subsection{Limitations of Methodology}
We used a bag-of-words model when we analyzed the sentiment of the tweets. In the process we disregarded grammar and word order and ranked the tweets strictly based on the words the tweet contained. There are other ways to process the tweets that give more sophistication to the analysis.
Additionally, the function used to score the tweets is completely deaf to irony. For example, the (fabricated) tweet
\begin{quote}
Impressed and amazed by the peerless leadership of SIUE in the domain of mediocrity.
\end{quote}
actually receives a score of +2. While all these problems have solutions or techniques to mitigate them, that is beyond the scope of this presentation.
\subsection{P\emph{R}oblems}
Another source of trouble is R itself. When using R for text mining and text processing, one often feels as if one is reinventing the wheel. Other languages like Python benefit from existing frameworks like \href{http://www.nltk.org/}{NLTK} and \href{http://www.crummy.com/software/BeautifulSoup/}{Beautiful Soup}. That being said, there are some packages that make text mining easier in R. \href{http://rss.acs.unt.edu/Rdoc/library/tm/html/00Index.html}{\verb+tm} is a good example of a package that makes text mining easier.
\section{Credits}
To complete this project, I stood on the shoulders of giants (as always the case when using R). We have received aid and comfort from the packages \verb+plyr+, \verb+stringr+, \verb+twitteR+, \verb+ggplot2+ and \verb+ROAuth+. For more information on these packages, see their associated help pages and vignettes.
We also used the sentiment lexicons of Minqing Hu and Bing Liu in order to do the scoring for our analysis. In the process of scoring tweets, much inspiration was taken from Jeffrey Breen's presentation. We have all been helped more than we can express by the members of the R community.
\section{License}
``Mining Twitter Using R'' by \href{http://5lbs.org}{Mathew Woodyard} is licensed under a \href{http://creativecommons.org/licenses/by-nc-sa/3.0/us/}{Creative Commons Attribution-NonCommercial-ShareAlike 3.0 United States License}. My GitHub project found at\\ \url{https://github.com/woodrad/Twitter-Sentiment-Mining} is under the same license, with the exception of the documentation written by others, which is under the (\href{https://en.wikipedia.org/wiki/Artistic_License}{Artistic}) license listed in the documentation. Permissions beyond the scope of this license may be available at \url{http://5lbs.org}.
\end{document}