The aim of this project is to gather data for WeRateDogs Twitter page through a variety of mediums, in different formats and further work upon wrangling the data and analyzing it.
We worked with following data here:
twitter-archive-enhanced.csv
was downloaded manually and was available to be directly read. It contains details like tweet_id, timestamp, source, tweet text, retweet is, retweet user id, retweet timestamp, rating numerator, rating denominator, dog name and dog classfication provided by WeRateDogs.image_predictions.tsv
was downloaded programmatically through url. This is a tab separated file and was read accordingly using pandas. To give a background of this dataset, it was generated by neural network to predict the dog breed. It only contains top 3 image predictions, their corresponding confidence ratings and a boolean column determining if the prediction is a dog or not.tweet_json.txt
Tweets were extracted from twitter based on tweet_id fromtwitter-archive-enhanced.csv
using twitter API and tweepy. We have extracted data from the tweet_id's that are available online. Each tweet's json dump was written into the file in a newline. In the next part, we read records fromtweet_json.txt
one by one and extracted tweet_id, retweet_count, favorite_count to be written into a pandas dataframe.
Main parameters that we have taken into account for this analysis are individual tweets, the rating they received and the number of retweets/favorites that they got. Retweets and Favorite counts shows how popular and liked a particular tweet was among the followers.
Through the data we have gathered for WeRateDogs twitter Archive, we have come up with some quite interesting findings.
- 87.08 % of the tweets that are posted by WeRateDogs, get a rating in range (0.8 – 1.4]. Or we can also say that if WeRateDogs is posting a tweet, the probability that it will receive a rating in range (0.8 – 1.4] is 0.8708
- 71.58 % of the tweets that are posted by WeRateDogs, get a rating in range (0.8 – 1.2]
We also have another interesting analysis that shows the average retweets and favorites count received for each rating bin.
- Well, quite clearly the bars for the range (1.2 - 1.4] quite shoot up. It goes to show that the dogs/tweets that receive a rating in range (1.2 - 1.4] are liked the most and hence retweeted/favorited the most. Around 15.51 % of tweets receive ratings in this range.
- We have next high bars for rating above 1.4. Very small percentage of tweets (0.29 %) receive these ratings. Quite strangely they are liked lesser than (1.2 - 1.4].
- One more interesting point to note in this graph is for Dogs falling in range (0 - 0.2]. Again, quite strangely, they receive even higher retweet and favorites than those in range (0.2 - 1].
We also have data available from neural network where we predict the breed of the dog. Here p1 represents the algorithm’s first mostly likely prediction, p2 represents second most likely prediction and p3 represents third most likely prediction.
- The neural network predicts the dog in the image, 73.8 % of the times for p1. Also, the probability of correctly identifying dog's breed given they are predicted as dog, is 0.614 as per p1.
- The probability of correctly idenfying dog's breed given they are predicted as dog as per p2, is 0.140
- The probability of correctly idenfying dog's breed given they are predicted as dog as per p2, is 0.062
Given the above observations, the decline in probability of correctly identifying the dog's breed in p1, p2 and p3 is quite clearly evident.
Disclaimer: This project was developed as part of Udacity's Data Analyst Nano Degree.
- pip install pandas
- pip install tweepy
- pip install matplotlib