user-tweet-download

Timeline of Mr. Narendra Modi, Prime Minister of India

Timeline of Mr. Donald Trump, President of the USA

Note: Please use at your own discretion. I used this code to pull data for 2 twitter handles for research purposes. Twitter provides API to download tweets

Download User tweets

This code can be used to download a User's weets from Twitter.com. This can help to bypass the 3200 limit that is put in by the Twitter API. The code is provided as a Jupyter notebook and as Python file.

Requirements

requests
tweepy
selenium
pandas

How to Run?

Install the necessary dependencies
Create a folder with the twitter user name or any suitable name
Copy all files into the same directory
Make changes to the config.py files -
Run the program - download_tweets_user.ipynb or download_tweets_user.py

What happens when the program is run?

The code first uses the API to fetch the most recent 3200 tweets and then uses selenium to distribute any other tweets on the worker nodes based on the dates

What datapoints are provided?

```
Tweet text - denoted by text
```

Number of replies to the tweet - denoted by replies_count

Number of retweets to the tweet - denoted by retweet_count

Number of times this tweet has been favorited - denoted by favorite_count

```
Url of the tweet - denoted by tweet_url
```

Creation date/time of the tweet - denoted by created_date

If a video was attached to the tweet, what is the url - denoted by video_url

If a video was attached to the tweet, how many times it is viewed - denoted by video_views

The twitter username - denoted by screen_name

The language of the tweet - denoted by language

Tradeoffs

Chrome and Firefox, both can be used to download the selenium part of the tweets. Adding more threads to them can make the process faster, but can give rise to issues such as getting throttled by Twitter or too many browsers eating a lot of RAM. The number of threads need to be optimized for the workload as explained below

Default

Variable Name	Description	Default value
DATE_IN_PAST	Download tweets until this date in the past, if available. This data should be at least the creation date of the account. Defaults to creation date if earlier than creation date	01-01-2020
DAYS_IN_PAST	Download tweets untils these many days in past. This is similar to DATE_IN_PAST. The earlier of DATE_IN_PAST and DAYS_IN_PAST is used	5
NUM_TWEETS_TO_DOWNLOAD	Number of tweets to download for the user. The earlier of NUM_TWEETS_TO_DOWNLOAD, DATE_IN_PAST and DAYS_IN_PAST is used when there is a conflict. If this value is less than 3200, tweets only with the API are downloaded. Please keep this value more than 3200 to download all available tweets	100
OUTPUT_FILE_NAME_SUFFIX	Add any suffix name to the file	None
TIME_SLEEP	Time to sleep between each page load in Selenium. This is to avoid any detection from the server and thus throttling the connection requests. Ideally this should be kept keeping in mind the total number of tweets of the user and the time that should be spent to download the tweets	5
TIME_SLEEP_BROWSER_CLOSE	The selenium browser is closed and opened to delete any possible cookies. Other details as above	2

Twitter

Variable Name	Description	Default value
TWITTER_USER_NAME	The twitter username without quotes
CONSUMER_KEY	The consumer key of the Twitter developer API
CONSUMER_SECRET	The consumer secret of the Twitter developer API
ACCESS_TOKEN	The access token of the Twitter developer API
ACCESS_TOKEN_SECRET	The access token secret of the Twitter developer API

Chrome

Variable Name	Description	Default value
CHROME_GECKODRIVER_LOCATION	The location of already downloaded chromedriver from https://chromedriver.chromium.org/downloads, else it is downloaded from the web based on the operating system	None
USE_CHROME	Use chrome to download the tweets via selenium (bool)	0
NUM_THREADS_CHROME	Number of threads to use. Each thread will have it's own chrome browser. This should depend on the number of tweets to download, the urgency, and the capacity of the system If USE_CHROME is True and NUM_THREADS_CHROME is 0, NUM_THREADS_CHROME defaults to 1	1
linux64	Edit the URL if your system is a linux based system	https://chromedriver.storage.googleapis.com/83.0.4103.14/chromedriver_linux64.zip
windows	Edit the URL if your system is a windows based system	https://chromedriver.storage.googleapis.com/83.0.4103.14/chromedriver_win32.zip
macos	Edit the URL if your system is a mac based system	https://chromedriver.storage.googleapis.com/83.0.4103.14/chromedriver_mac64.zip

Firefox

Variable Name	Description	Default value
FIREFOX_GECKODRIVER_LOCATION	The location of already downloaded geckodriver from https://github.com/mozilla/geckodriver/releases, else it is downloaded from the web based on the operating system	None
USE_FIREFOX	Use firefox to download the tweets via selenium (bool)	1
NUM_THREADS_FIREFOX	Number of threads to use. Each thread will have it's own firefox browser. This should depend on the number of tweets to download, the urgency, and the capacity of the system If USE_FIREFOX is True and NUM_THREADS_FIREFOX is 0, NUM_THREADS_FIREFOX defaults to 1	1
macos	Edit the URL if your system is a mac based system	https://github.com/mozilla/geckodriver/releases/download/v0.26.0/geckodriver-v0.26.0-macos.tar.gz
linux32	Edit the URL if your system is a linux 32 bit based system	https://github.com/mozilla/geckodriver/releases/download/v0.26.0/geckodriver-v0.26.0-linux32.tar.gz
linux64	Edit the URL if your system is a linux 64 bit based system	https://github.com/mozilla/geckodriver/releases/download/v0.26.0/geckodriver-v0.26.0-linux64.tar.gz
windows32	Edit the URL if your system is a windows 32 bit based system	https://github.com/mozilla/geckodriver/releases/download/v0.26.0/geckodriver-v0.26.0-win32.zip
windows64	Edit the URL if your system is a windows 64 bit based system	https://github.com/mozilla/geckodriver/releases/download/v0.26.0/geckodriver-v0.26.0-win64.zip

Discuss

Feel free to post any questions or comments or bugs. Twitter UI changes from time to time and hence the selenium part might break

References

Twitter Search:

Regular Syntax: https://twitter.com/search-advanced

Advanced Syntax: https://help.twitter.com/en/using-twitter/advanced-tweetdeck-features

Tweepy API

API Reference: http://docs.tweepy.org/en/latest/api.html

Cursor Tutorial: http://docs.tweepy.org/en/latest/cursor_tutorial.html

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
.gitignore		.gitignore
Donald Trump home page snapshot.png		Donald Trump home page snapshot.png
LICENSE		LICENSE
Narendra Modi Twitter page.png		Narendra Modi Twitter page.png
README.md		README.md
Snapshot.png		Snapshot.png
config.py		config.py
download_tweets_user.ipynb		download_tweets_user.ipynb
download_tweets_user.py		download_tweets_user.py
helpers.py		helpers.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

user-tweet-download

How to Run?

What datapoints are provided?

Tradeoffs

Default

References

Twitter Search:

Tweepy API

About

Releases

Packages

Languages

License

AleksLi1/user-tweet-download

Folders and files

Latest commit

History

Repository files navigation

user-tweet-download

How to Run?

What datapoints are provided?

Tradeoffs

Default

References

Twitter Search:

Tweepy API

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages