Project 4 from Udacity Data Analyst Nanodegree
In this project I gathered, assessed and cleaned data from the WeRateDogs twitter account.
The data was gathered from three different sources:
- The twitter-archive-enchaced.csv was downloaded manually and I used the pandas function read_csv to read it into a dataframe.
- The image-prediction.tsv was downloaded programmatically using the requests library and I used the pandas function read_csv to read it into a dataframe.
- The tweet-json.txt was gathered using the Twitter API and the tweepy library. Data Assessing After gathering the data and loading it into three different dataframes, I assessed it both visually and programmatically and the following quality and tidiness issues were found:
- The
tweet_id
column in thedf_image_prediction
and in thedf_twitter_archive
is an int, while it is a string in the other dataframe. - The
retweet_count
and thefavorite_count
in thedf_api
are strings, they should be int as they represent numbers. - The
df_twitter_archive
has rows of tweets that were retweets. We only want original ratings from the WeRateDogs account. - The
timestamp
column in thedf_twitter_archive
dataframe should be of datetime type. - The
rating_denominator
in thedf_twitter_archive
is not always equal to 10. - The
name
,doggo
,floffer
,pupper
andpuppo
columns in thedf_twitter_archive
dataframe have the string value 'None' instead of the null value for null values. - Dogs named 'a' and 'the' in the
df_twitter_archive
. - The
df_twitter_archive
has rows of tweets that were replys. We only want original ratings in the dataframe.
- As all tables are about the tweets, we should have a single dataframe.
- The
doggo
,floffer
,pupper
andpuppo
columns in thedf_twitter_archive
dataframe should be just one categorical column.
Finally, I copied the original data and cleaned it using a range of Python and pandas functions such as:
- Replacing incorrect values with the null value using replace
- Dropping rows that weren't useful for the analysis, such as retweets and replies from the WeRateDogs account
- Used the melt function from pandas to create a categorical column for the different types of dogs (doggo, floffer, pupper and puppo)
- Converted data types from columns that were int and should be strings
- Converted data types from columns that were strings and should be int
- Joined all the three dataframes into a single master dataframe
- Converted columns with information about date and time to the datetime data type
- Replaced names of dogs that weren't correct with the null value
- Corrected the rating denominator so that it would be 10 for all cases, as expected