Skip to content

Latest commit

 

History

History
351 lines (302 loc) · 12.8 KB

01-extracting-data.md

File metadata and controls

351 lines (302 loc) · 12.8 KB

1. Extracting twitter data (tweepy + pandas)

1.1. Importing our libraries

This will be the most difficult part of all the workshop... 😥 Just kidding, obviously it won't. It'll be just as easy as copying and pasting the following code in your notebook:

# General:
import tweepy           # To consume Twitter's API
import pandas as pd     # To handle data
import numpy as np      # For number computing

# For plotting and visualization:
from IPython.display import display
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

Excellent! We can now just run this cell of code and go to the next subsection.

1.2. Creating a Twitter App

In order to extract tweets for a posterior analysis, we need to access to our Twitter account and create an app. The website to do this is https://apps.twitter.com/. (If you don't know how to do this, you can follow this tutorial video to create an account and an application.)

From this app that we're creating we will save the following information in a script called credentials.py:

  • Consumer Key (API Key)
  • Consumer Secret (API Secret)
  • Access Token
  • Access Token Secret

An example of this script is the following:

# Twitter App access keys for @user

# Consume:
CONSUMER_KEY    = ''
CONSUMER_SECRET = ''

# Access:
ACCESS_TOKEN  = ''
ACCESS_SECRET = ''

The reason of creating this extra file is that we want to export only the value of this variables, but being unseen in our main code (our notebook). We are now able to consume Twitter's API. In order to do this, we will create a function to allow us our keys authentication. We will add this function in another cell of code and we will run it:

# We import our access keys:
from credentials import *    # This will allow us to use the keys as variables

# API's setup:
def twitter_setup():
    """
    Utility function to setup the Twitter's API
    with our access keys provided.
    """
    # Authentication and access using keys:
    auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
    auth.set_access_token(ACCESS_TOKEN, ACCESS_SECRET)

    # Return API with authentication:
    api = tweepy.API(auth)
    return api

So far, so easy right? We're good to extract tweets in the next section.

1.3. Tweets extraction

Now that we've created a function to setup the Twitter API, we can use this function to create an "extractor" object. After this, we will use Tweepy's function extractor.user_timeline(screen_name, count) to extract from screen_name's user the quantity of count tweets.

As it is mentioned in the title, I've chosen @realDonaldTrump as the user to extract data for a posterior analysis. Yeah, we wanna keep it interesting, LOL.

The way to extract Twitter's data is as follows:

# We create an extractor object:
extractor = twitter_setup()

# We create a tweet list as follows:
tweets = extractor.user_timeline(screen_name="realDonaldTrump", count=200)
print("Number of tweets extracted: {}.\n".format(len(tweets)))

# We print the most recent 5 tweets:
print("5 recent tweets:\n")
for tweet in tweets[:5]:
    print(tweet.text)
    print()

With this we will have an output similar to this one, and we are able to compare the output with the Twitter account (to check if we're being consistent): Number of tweets extracted: 200. 5 recent tweets: On behalf of @FLOTUS Melania & myself, THANK YOU for today's update & GREAT WORK! #SouthernBaptist @SendRelief,… https://t.co/4yZCeXCt6n I will be going to Texas and Louisiana tomorrow with First Lady. Great progress being made! Spending weekend working at White House. Stock Market up 5 months in a row! 'President Donald J. Trump Proclaims September 3, 2017, as a National Day of Prayer' #HurricaneHarvey #PrayForTexas… https://t.co/tOMfFWwEsN Texas is healing fast thanks to all of the great men & women who have been working so hard. But still so much to do. Will be back tomorrow!

We now have an extractor and extracted data, which is listed in the tweets variable. I must mention at this point that each element in that list is a tweet object from Tweepy, and we will learn how to handle this data in the next subsection.

1.4. Creating a (pandas) DataFrame

We now have initial information to construct a pandas DataFrame, in order to manipulate the info in a very easy way.

IPython's display function plots an output in a friendly way, and the headmethod of a dataframe allows us to visualize the first 5 elements of the dataframe (or the first number of elements that are passed as an argument).

So, using Python's list comprehension:

# We create a pandas dataframe as follows:
data = pd.DataFrame(data=[tweet.text for tweet in tweets], columns=['Tweets'])

# We display the first 10 elements of the dataframe:
display(data.head(10))

This will create an output similar to this:

Tweets
0 On behalf of @FLOTUS Melania & myself, THA...
1 I will be going to Texas and Louisiana tomorro...
2 Stock Market up 5 months in a row!
3 'President Donald J. Trump Proclaims September...
4 Texas is healing fast thanks to all of the gre...
5 ...get things done at a record clip. Many big ...
6 General John Kelly is doing a great job as Chi...
7 Wow, looks like James Comey exonerated Hillary...
8 THANK YOU to all of the incredible HEROES in T...
9 RT @FoxNews: .@KellyannePolls on Harvey recove...

So we now have a nice table with ordered data.

An interesting thing is the number if internal methods that the tweetstructure has in Tweepy:

# Internal methods of a single tweet object:
print(dir(tweets[0]))

This outputs the following list of elements: ['class', 'delattr', 'dict', 'dir', 'doc', 'eq', 'format', 'ge', 'getattribute', 'getstate', 'gt', 'hash', 'init', 'init_subclass', 'le', 'lt', 'module', 'ne', 'new', 'reduce', 'reduce_ex', 'repr', 'setattr', 'sizeof', 'str', 'subclasshook', 'weakref', '_api', '_json', 'author', 'contributors', 'coordinates', 'created_at', 'destroy', 'entities', 'favorite', 'favorite_count', 'favorited', 'geo', 'id', 'id_str', 'in_reply_to_screen_name', 'in_reply_to_status_id', 'in_reply_to_status_id_str', 'in_reply_to_user_id', 'in_reply_to_user_id_str', 'is_quote_status', 'lang', 'parse', 'parse_list', 'place', 'possibly_sensitive', 'retweet', 'retweet_count', 'retweeted', 'retweets', 'source', 'source_url', 'text', 'truncated', 'user']

The interesting part from here is the quantity of metadata contained in a single tweet. If we want to obtain data such as the creation date, or the source of creation, we can access the info with this attributes. An example is the following:

# We print info from the first tweet:
print(tweets[0].id)
print(tweets[0].created_at)
print(tweets[0].source)
print(tweets[0].favorite_count)
print(tweets[0].retweet_count)
print(tweets[0].geo)
print(tweets[0].coordinates)
print(tweets[0].entities)

Obtaining an output like this: 903778130850131970 2017-09-02 00:34:32 Twitter for iPhone 24572 5585 None None {'hashtags': [{'text': 'SouthernBaptist', 'indices': [90, 106]}], 'symbols': [], 'user_mentions': [{'screen_name': 'FLOTUS', 'name': 'Melania Trump', 'id': 818876014390603776, 'id_str': '818876014390603776', 'indices': [13, 20]}, {'screen_name': 'sendrelief', 'name': 'Send Relief', 'id': 3228928584, 'id_str': '3228928584', 'indices': [107, 118]}], 'urls': [{'url': 'https://t.co/4yZCeXCt6n', 'expanded_url': 'https://twitter.com/i/web/status/903778130850131970', 'display_url': 'twitter.com/i/web/status/9…', 'indices': [121, 144]}]}

We're now able to order the relevant data and add it to our dataframe.

1.5. Adding relevant info to our dataframe

As we can see, we can obtain a lot of data from a single tweet. But not all this data is always useful for specific stuff. In our case we well just add some data to our dataframe. For this we will use Pythons list comprehension and a new column will be added to the dataframe by just simply adding the name of the content between square brackets and assign the content. The code goes as...:

# We add relevant data:
data['len']  = np.array([len(tweet.text) for tweet in tweets])
data['ID']   = np.array([tweet.id for tweet in tweets])
data['Date'] = np.array([tweet.created_at for tweet in tweets])
data['Source'] = np.array([tweet.source for tweet in tweets])
data['Likes']  = np.array([tweet.favorite_count for tweet in tweets])
data['RTs']    = np.array([tweet.retweet_count for tweet in tweets])

And to display again the dataframe to see the changes we just...:

# Display of first 10 elements from dataframe:
display(data.head(10))
Tweets len ID Date Source Likes RTs
0 On behalf of @FLOTUS Melania & myself, THA... 144 903778130850131970 2017-09-02 00:34:32 Twitter for iPhone 24572 5585
1 I will be going to Texas and Louisiana tomorro... 132 903770196388831233 2017-09-02 00:03:00 Twitter for iPhone 44748 8825
2 Stock Market up 5 months in a row! 34 903766326631698432 2017-09-01 23:47:38 Twitter for iPhone 44518 9134
3 'President Donald J. Trump Proclaims September... 140 903705867891204096 2017-09-01 19:47:23 Media Studio 47009 15127
4 Texas is healing fast thanks to all of the gre... 143 903603043714957312 2017-09-01 12:58:48 Twitter for iPhone 77680 15398
5 ...get things done at a record clip. Many big ... 113 903600265420578819 2017-09-01 12:47:46 Twitter for iPhone 54664 11424
6 General John Kelly is doing a great job as Chi... 140 903597166249246720 2017-09-01 12:35:27 Twitter for iPhone 59840 11678
7 Wow, looks like James Comey exonerated Hillary... 130 903587428488839170 2017-09-01 11:56:45 Twitter for iPhone 110667 35936
8 THANK YOU to all of the incredible HEROES in T... 110 903348312421670912 2017-08-31 20:06:35 Twitter for iPhone 112012 29064
9 RT @FoxNews: .@KellyannePolls on Harvey recove... 140 903234878124249090 2017-08-31 12:35:50 Twitter for iPhone 0 6638

Now that we have extracted and have the data in a easy-to-handle ordered way, we're ready to do a bit more of manipulation to visualize some plots and gather some statistical data. The first part of the workshop is done.

Go back to 0. Prerequisite: What will we need?
Go next to 2. Visualization and basic statistics