The dataset to be wrangled( and analyzed and visualized) is the tweet archive of Twitter user @dog_rates, also known as WeRateDogs, which is a Twitter account that rates people's dogs with a humorous comment about the dog. These ratings almost always have a denominator of 10. The numerators, though? Almost always greater than 10. 11/10, 12/10, 13/10, etc.
Wrangle WeRateDogs Twitter data to create interesting and trustworthy analyses and visualizations by investigating,
- What are the top 30 most tweeted dog breeds by WeRateDogs?
- Which breed of dogs got the highest rating on average?
- Which breed of dogs got the highest retweet and favorite counts on average?
- Which stage of dogs got the highest rating, retweet and favorite counts?
- Does the hashtags included impact retweet and favorite counts?
- How does retweet and favorite counts spread based on the tweeted day of the week on average ?
- Does the rating impact Retweet and Favorite Count?
- Relationship between Retweet and Favorite Count
Can be installed via conda or pip,
- pandas
- NumPy
- requests
- tweepy
- json
- re
- string
- datetime
- matplotlib
- seaborn
- Enhanced Twitter Archive data contains basic tweet data given by Udacity.
- Twitter API to gather retweet count and favorite count, which are notable columns, additionally details about hashtags used.
- Image Predictions File, containing the predicted breeds of dogs given by Udacity.
- Enhanced archive data contains basic tweet data (tweet ID, timestamp, text, etc.) of their tweets as they stood on August 1, 2017.
- Only original ratings (no retweets) that have images are considered.
Assess data for:
- Quality: inconsistent data, inaccurate data, non-descriptive headers, missing values (NAN).
- Tidiness: issues with structure that prevent easy analysis. Tidy data requirements: Each variable forms a column. Each observation forms a row. Each type of observational unit forms a table.
Types of assessment:
- Visual assessment
- Programmatic assessment (used Pandas)
Programmatic data cleaning process:
- Define: convert the assessments into defined cleaning tasks.
- Code: convert those definitions to code and run that code.
- Test: test your dataset, visually or with code, to make sure cleaning operations worked.