Timeline of Mr. Narendra Modi, Prime Minister of India
Timeline of Mr. Donald Trump, President of the USA
Note: Please use at your own discretion. I used this code to pull data for 2 twitter handles for research purposes. Twitter provides API to download tweets
Download User tweets
This code can be used to download a User's weets from Twitter.com. This can help to bypass the 3200 limit that is put in by the Twitter API. The code is provided as a Jupyter notebook and as Python file.
Requirements
- requests
- tweepy
- selenium
- pandas
- Install the necessary dependencies
- Create a folder with the twitter user name or any suitable name
- Copy all files into the same directory
- Make changes to the config.py files -
- Run the program - download_tweets_user.ipynb or download_tweets_user.py
What happens when the program is run?
The code first uses the API to fetch the most recent 3200 tweets and then uses selenium to distribute any other tweets on the worker nodes based on the dates
-
Tweet text - denoted by text
-
Number of replies to the tweet - denoted by replies_count
-
Number of retweets to the tweet - denoted by retweet_count
-
Number of times this tweet has been favorited - denoted by favorite_count
-
Url of the tweet - denoted by tweet_url
-
Creation date/time of the tweet - denoted by created_date
-
If a video was attached to the tweet, what is the url - denoted by video_url
-
If a video was attached to the tweet, how many times it is viewed - denoted by video_views
The twitter username - denoted by screen_name
The language of the tweet - denoted by language
Chrome and Firefox, both can be used to download the selenium part of the tweets. Adding more threads to them can make the process faster, but can give rise to issues such as getting throttled by Twitter or too many browsers eating a lot of RAM. The number of threads need to be optimized for the workload as explained below
Variable Name | Description | Default value |
---|---|---|
DATE_IN_PAST | Download tweets until this date in the past, if available. This data should be at least the creation date of the account. Defaults to creation date if earlier than creation date | 01-01-2020 |
DAYS_IN_PAST |
Download tweets untils these many days in past. This is similar to DATE_IN_PAST. The earlier of DATE_IN_PAST and DAYS_IN_PAST is used |
5 |
NUM_TWEETS_TO_DOWNLOAD |
Number of tweets to download for the user. The earlier of NUM_TWEETS_TO_DOWNLOAD, DATE_IN_PAST and DAYS_IN_PAST is used when there is a conflict. If this value is less than 3200, tweets only with the API are downloaded. Please keep this value more than 3200 to download all available tweets |
100 |
OUTPUT_FILE_NAME_SUFFIX | Add any suffix name to the file | None |
TIME_SLEEP | Time to sleep between each page load in Selenium. This is to avoid any detection from the server and thus throttling the connection requests. Ideally this should be kept keeping in mind the total number of tweets of the user and the time that should be spent to download the tweets | 5 |
TIME_SLEEP_BROWSER_CLOSE | The selenium browser is closed and opened to delete any possible cookies. Other details as above | 2 |
Variable Name | Description | Default value |
---|---|---|
TWITTER_USER_NAME | The twitter username without quotes | |
CONSUMER_KEY | The consumer key of the Twitter developer API | |
CONSUMER_SECRET | The consumer secret of the Twitter developer API | |
ACCESS_TOKEN | The access token of the Twitter developer API | |
ACCESS_TOKEN_SECRET | The access token secret of the Twitter developer API |
Chrome
Variable Name | Description | Default value |
---|---|---|
CHROME_GECKODRIVER_LOCATION | The location of already downloaded chromedriver from https://chromedriver.chromium.org/downloads, else it is downloaded from the web based on the operating system | None |
USE_CHROME | Use chrome to download the tweets via selenium (bool) | 0 |
NUM_THREADS_CHROME |
Number of threads to use. Each thread will have it's own chrome browser. This should depend on the number of tweets to download, the urgency, and the capacity of the system
If USE_CHROME is True and NUM_THREADS_CHROME is 0, NUM_THREADS_CHROME defaults to 1 |
1 |
linux64 | Edit the URL if your system is a linux based system | https://chromedriver.storage.googleapis.com/83.0.4103.14/chromedriver_linux64.zip |
windows | Edit the URL if your system is a windows based system | https://chromedriver.storage.googleapis.com/83.0.4103.14/chromedriver_win32.zip |
macos | Edit the URL if your system is a mac based system | https://chromedriver.storage.googleapis.com/83.0.4103.14/chromedriver_mac64.zip |
Firefox
Variable Name | Description | Default value |
---|---|---|
FIREFOX_GECKODRIVER_LOCATION | The location of already downloaded geckodriver from https://github.com/mozilla/geckodriver/releases, else it is downloaded from the web based on the operating system | None |
USE_FIREFOX | Use firefox to download the tweets via selenium (bool) | 1 |
NUM_THREADS_FIREFOX |
Number of threads to use. Each thread will have it's own firefox browser. This should depend on the number of tweets to download, the urgency, and the capacity of the system
If USE_FIREFOX is True and NUM_THREADS_FIREFOX is 0, NUM_THREADS_FIREFOX defaults to 1 |
1 |
macos | Edit the URL if your system is a mac based system | https://github.com/mozilla/geckodriver/releases/download/v0.26.0/geckodriver-v0.26.0-macos.tar.gz |
linux32 | Edit the URL if your system is a linux 32 bit based system | https://github.com/mozilla/geckodriver/releases/download/v0.26.0/geckodriver-v0.26.0-linux32.tar.gz |
linux64 | Edit the URL if your system is a linux 64 bit based system | https://github.com/mozilla/geckodriver/releases/download/v0.26.0/geckodriver-v0.26.0-linux64.tar.gz |
windows32 | Edit the URL if your system is a windows 32 bit based system | https://github.com/mozilla/geckodriver/releases/download/v0.26.0/geckodriver-v0.26.0-win32.zip |
windows64 | Edit the URL if your system is a windows 64 bit based system | https://github.com/mozilla/geckodriver/releases/download/v0.26.0/geckodriver-v0.26.0-win64.zip |
Discuss
Feel free to post any questions or comments or bugs. Twitter UI changes from time to time and hence the selenium part might break
Regular Syntax: https://twitter.com/search-advanced
Advanced Syntax: https://help.twitter.com/en/using-twitter/advanced-tweetdeck-features
API Reference: http://docs.tweepy.org/en/latest/api.html
Cursor Tutorial: http://docs.tweepy.org/en/latest/cursor_tutorial.html