The Movie Database(TMDb) is a community built movie and TV database. Every piece of data has been added by the amazing community dating back to 2008. TMDb's strong international focus and breadth of data is largely unmatched and something they're incredibly proud of. Put simply, they live and breathe community and that's precisely what makes them different.
(Image is from a copyright-free website: https://www.pexels.com/royalty-free-images/.)
This data set contains information about 10,000 movies collected from TMDb, including user ratings and revenue.
- Certain columns, like ‘cast’ and ‘genres’, contain multiple values separated by pipe (|) characters;
- There are some odd characters in the ‘cast’ column;
- The final two columns ending with “_adj” show the budget and revenue of the associated movie in terms of 2010 dollars, accounting for inflation over time.
Table of Contents |
---|
Prerequisites 🔍📜 |
Design 📐 |
Conclusions 📌 |
License 🔖 |
- Python 3.6.3
- Jupyter Notebook
- Anaconda-Navigator
Step One - Choose Data Set
Click this link to download the corresponding data.
Step Two - Get Organized
This project eventually contain:
- The report communicating any findings;
- Any Python code used during the analysis;
- The data set;
Step Three - Analyze
Brainstorm some questions that could be answered using the data set, then start answering those questions, we would mainly focus on looking at the relationships between multiple variables.
In current study, a good amount of profound analysis has been carried out. Prior to each step, deailed instructions was given and interpretions was also provided afterwards. The dataset included 10866 pieces of film information ranging from 1960 to 2015, which consisted most of the main stream movies. Based on such substantial data, the analysis would be more reliable as opposed to small scale analysis. The limitations of current study were NaN values, which could affect the process of analysis. Luckily, those NaN values were all of category type, thus it has limited impact on arithmetric computing.
However, it might matter when comparing category column with numerical column for analysis. The stragety appiled in current study is to keep those NaN value, but convert them as 'No record' which is a string type of data. Among the 19 questions, only 2 questions were affected by the NaN value, thus most of the analysis are highly reliable.