This will be the most difficult part of all the workshop... 😥 Just kidding, obviously it won't. It'll be just as easy as copying and pasting the following code in your notebook:
# General:
import tweepy # To consume Twitter's API
import pandas as pd # To handle data
import numpy as np # For number computing
# For plotting and visualization:
from IPython.display import display
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
Excellent! We can now just run this cell of code and go to the next subsection.
In order to extract tweets for a posterior analysis, we need to access to our Twitter account and create an app. The website to do this is https://apps.twitter.com/. (If you don't know how to do this, you can follow this tutorial video to create an account and an application.)
From this app that we're creating we will save the following information in a script called credentials.py
:
- Consumer Key (API Key)
- Consumer Secret (API Secret)
- Access Token
- Access Token Secret
An example of this script is the following:
# Twitter App access keys for @user
# Consume:
CONSUMER_KEY = ''
CONSUMER_SECRET = ''
# Access:
ACCESS_TOKEN = ''
ACCESS_SECRET = ''
The reason of creating this extra file is that we want to export only the value of this variables, but being unseen in our main code (our notebook). We are now able to consume Twitter's API. In order to do this, we will create a function to allow us our keys authentication. We will add this function in another cell of code and we will run it:
# We import our access keys:
from credentials import * # This will allow us to use the keys as variables
# API's setup:
def twitter_setup():
"""
Utility function to setup the Twitter's API
with our access keys provided.
"""
# Authentication and access using keys:
auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
auth.set_access_token(ACCESS_TOKEN, ACCESS_SECRET)
# Return API with authentication:
api = tweepy.API(auth)
return api
So far, so easy right? We're good to extract tweets in the next section.
Now that we've created a function to setup the Twitter API, we can use this function to create an "extractor" object. After this, we will use Tweepy's function extractor.user_timeline(screen_name, count)
to extract from screen_name
's user the quantity of count
tweets.
As it is mentioned in the title, I've chosen @realDonaldTrump as the user to extract data for a posterior analysis. Yeah, we wanna keep it interesting, LOL.
The way to extract Twitter's data is as follows:
# We create an extractor object:
extractor = twitter_setup()
# We create a tweet list as follows:
tweets = extractor.user_timeline(screen_name="realDonaldTrump", count=200)
print("Number of tweets extracted: {}.\n".format(len(tweets)))
# We print the most recent 5 tweets:
print("5 recent tweets:\n")
for tweet in tweets[:5]:
print(tweet.text)
print()
With this we will have an output similar to this one, and we are able to compare the output with the Twitter account (to check if we're being consistent):
Number of tweets extracted: 200.
5 recent tweets:
On behalf of @FLOTUS Melania & myself, THANK YOU for today's update & GREAT WORK! #SouthernBaptist @SendRelief,… https://t.co/4yZCeXCt6n
I will be going to Texas and Louisiana tomorrow with First Lady. Great progress being made! Spending weekend working at White House.
Stock Market up 5 months in a row!
'President Donald J. Trump Proclaims September 3, 2017, as a National Day of Prayer' #HurricaneHarvey #PrayForTexas… https://t.co/tOMfFWwEsN
Texas is healing fast thanks to all of the great men & women who have been working so hard. But still so much to do. Will be back tomorrow!
We now have an extractor and extracted data, which is listed in the tweets
variable. I must mention at this point that each element in that list is a tweet
object from Tweepy, and we will learn how to handle this data in the next subsection.
We now have initial information to construct a pandas DataFrame
, in order to manipulate the info in a very easy way.
IPython's display
function plots an output in a friendly way, and the head
method of a dataframe allows us to visualize the first 5 elements of the dataframe (or the first number of elements that are passed as an argument).
So, using Python's list comprehension:
# We create a pandas dataframe as follows:
data = pd.DataFrame(data=[tweet.text for tweet in tweets], columns=['Tweets'])
# We display the first 10 elements of the dataframe:
display(data.head(10))
This will create an output similar to this:
Tweets | |
---|---|
0 | On behalf of @FLOTUS Melania & myself, THA... |
1 | I will be going to Texas and Louisiana tomorro... |
2 | Stock Market up 5 months in a row! |
3 | 'President Donald J. Trump Proclaims September... |
4 | Texas is healing fast thanks to all of the gre... |
5 | ...get things done at a record clip. Many big ... |
6 | General John Kelly is doing a great job as Chi... |
7 | Wow, looks like James Comey exonerated Hillary... |
8 | THANK YOU to all of the incredible HEROES in T... |
9 | RT @FoxNews: .@KellyannePolls on Harvey recove... |
So we now have a nice table with ordered data.
An interesting thing is the number if internal methods that the tweet
structure has in Tweepy:
# Internal methods of a single tweet object:
print(dir(tweets[0]))
This outputs the following list of elements:
['class', 'delattr', 'dict', 'dir', 'doc', 'eq', 'format', 'ge', 'getattribute', 'getstate', 'gt', 'hash', 'init', 'init_subclass', 'le', 'lt', 'module', 'ne', 'new', 'reduce', 'reduce_ex', 'repr', 'setattr', 'sizeof', 'str', 'subclasshook', 'weakref', '_api', '_json', 'author', 'contributors', 'coordinates', 'created_at', 'destroy', 'entities', 'favorite', 'favorite_count', 'favorited', 'geo', 'id', 'id_str', 'in_reply_to_screen_name', 'in_reply_to_status_id', 'in_reply_to_status_id_str', 'in_reply_to_user_id', 'in_reply_to_user_id_str', 'is_quote_status', 'lang', 'parse', 'parse_list', 'place', 'possibly_sensitive', 'retweet', 'retweet_count', 'retweeted', 'retweets', 'source', 'source_url', 'text', 'truncated', 'user']
The interesting part from here is the quantity of metadata contained in a single tweet. If we want to obtain data such as the creation date, or the source of creation, we can access the info with this attributes. An example is the following:
# We print info from the first tweet:
print(tweets[0].id)
print(tweets[0].created_at)
print(tweets[0].source)
print(tweets[0].favorite_count)
print(tweets[0].retweet_count)
print(tweets[0].geo)
print(tweets[0].coordinates)
print(tweets[0].entities)
Obtaining an output like this:
903778130850131970
2017-09-02 00:34:32
Twitter for iPhone
24572
5585
None
None
{'hashtags': [{'text': 'SouthernBaptist', 'indices': [90, 106]}], 'symbols': [], 'user_mentions': [{'screen_name': 'FLOTUS', 'name': 'Melania Trump', 'id': 818876014390603776, 'id_str': '818876014390603776', 'indices': [13, 20]}, {'screen_name': 'sendrelief', 'name': 'Send Relief', 'id': 3228928584, 'id_str': '3228928584', 'indices': [107, 118]}], 'urls': [{'url': 'https://t.co/4yZCeXCt6n', 'expanded_url': 'https://twitter.com/i/web/status/903778130850131970', 'display_url': 'twitter.com/i/web/status/9…', 'indices': [121, 144]}]}
We're now able to order the relevant data and add it to our dataframe.
As we can see, we can obtain a lot of data from a single tweet. But not all this data is always useful for specific stuff. In our case we well just add some data to our dataframe. For this we will use Pythons list comprehension and a new column will be added to the dataframe by just simply adding the name of the content between square brackets and assign the content. The code goes as...:
# We add relevant data:
data['len'] = np.array([len(tweet.text) for tweet in tweets])
data['ID'] = np.array([tweet.id for tweet in tweets])
data['Date'] = np.array([tweet.created_at for tweet in tweets])
data['Source'] = np.array([tweet.source for tweet in tweets])
data['Likes'] = np.array([tweet.favorite_count for tweet in tweets])
data['RTs'] = np.array([tweet.retweet_count for tweet in tweets])
And to display again the dataframe to see the changes we just...:
# Display of first 10 elements from dataframe:
display(data.head(10))
Tweets | len | ID | Date | Source | Likes | RTs | |
---|---|---|---|---|---|---|---|
0 | On behalf of @FLOTUS Melania & myself, THA... | 144 | 903778130850131970 | 2017-09-02 00:34:32 | Twitter for iPhone | 24572 | 5585 |
1 | I will be going to Texas and Louisiana tomorro... | 132 | 903770196388831233 | 2017-09-02 00:03:00 | Twitter for iPhone | 44748 | 8825 |
2 | Stock Market up 5 months in a row! | 34 | 903766326631698432 | 2017-09-01 23:47:38 | Twitter for iPhone | 44518 | 9134 |
3 | 'President Donald J. Trump Proclaims September... | 140 | 903705867891204096 | 2017-09-01 19:47:23 | Media Studio | 47009 | 15127 |
4 | Texas is healing fast thanks to all of the gre... | 143 | 903603043714957312 | 2017-09-01 12:58:48 | Twitter for iPhone | 77680 | 15398 |
5 | ...get things done at a record clip. Many big ... | 113 | 903600265420578819 | 2017-09-01 12:47:46 | Twitter for iPhone | 54664 | 11424 |
6 | General John Kelly is doing a great job as Chi... | 140 | 903597166249246720 | 2017-09-01 12:35:27 | Twitter for iPhone | 59840 | 11678 |
7 | Wow, looks like James Comey exonerated Hillary... | 130 | 903587428488839170 | 2017-09-01 11:56:45 | Twitter for iPhone | 110667 | 35936 |
8 | THANK YOU to all of the incredible HEROES in T... | 110 | 903348312421670912 | 2017-08-31 20:06:35 | Twitter for iPhone | 112012 | 29064 |
9 | RT @FoxNews: .@KellyannePolls on Harvey recove... | 140 | 903234878124249090 | 2017-08-31 12:35:50 | Twitter for iPhone | 0 | 6638 |
Now that we have extracted and have the data in a easy-to-handle ordered way, we're ready to do a bit more of manipulation to visualize some plots and gather some statistical data. The first part of the workshop is done.
Go back to 0. Prerequisite: What will we need?
Go next to 2. Visualization and basic statistics