SC1015 Project - The Titanic Disaster

Welcome to Titanic Analysis repository. This is a Mini-project for SC1015 - Introduction to Data Science and Artifical Intelligence.
This project focuses on predicting who survives The Titanic based on passenger data.

Contributers:

@jdengoh: Data pre-processing, Data Splitting and Feature Analysis, KNN, XGBoost, Conclusion
@ananyakapoor12: Data pre-processing, Data Visualisation and EDA, Random Forest Classifier, Conclusion

About/Problem Definition

Our team's objective is to analyse and predict the likely survivors of the titanic disasters using passengers' data.

The underlying motivation is provide new insights on the titanic incident by:

Identifying factors that may influence passengers' survivor rate.
Predicting the likelihood of an individual's survival

Such insights may be able to tell us more about different demographics' chances of survival and factors that affect their survival rate.

We hope that such insights can be useful in the future for:

Identifying undiscovered loopholes in safety measures or precautions.
Further improving existing safety infrastructure to prevent such a disaster from occuring in the future.

Datasets

https://www.kaggle.com/competitions/titanic/data

Repository Overview

Prediction models used

KNN
XGBoost
Random Forest Classifier

Conclusion

Relating back to our problem definition and real world context:

Children and Women having a higher chance of survival could be attributed to passengers aboard prioritising the saving of woman and children first.
In reality, while ticket class should not have much impact on one's survival rate, it is interesting to see that first class passengers have a higher survival rate.
- This may be linked to their cabin location and accessibility to safety infrastructure during the time of the disaster.
- However, more data will be required to be able to confidently explain how ticket class affects survival rate directly.
Passengers who travel with smaller families seem to have a higher chance in surviving.
- Perhaps those who have many family members aboard are unwilling to leave them behind since it is unlikely that everyone in their family could be saved.
- For those travelling alone, it could be possible that they may prioritise saving smaller families.
- Another important factor is that, most large families were travelling in third class which was also another reason, they had a lower survival rate.
First class passengers were more likely to survive than second or third class ones. Since women were more likely to survive than men, almost all female passengers in first class survived.

From our prediction models, we can conclude that:

Using KNN model, our best model was using 8 best features likely due to the Curse of Dimensionality. In this attempt, our False Negatives were quite lower than our other attempts and False Positives were manageable too.
Using XGBoost, our best model was using all features for prediction likely because it is a robust model in itself. While our first attempt has a high accuracy, it also has relatively higher number of False Negatives compared to Attempt 3 and Attempt 4 with best 10 and 8 features, respectively.
Using Random Forest Classifier, our attempts with 10 and 8 best features yielded the best result and the issue here is likely to be that of overfitting. False positives in both these attempts were much higher than false negatives, a general trend observed in all our models.
Overall, from the 4 attempts on each of our three models, the highest test accuracy of 0.83708 was yielded by our XGBoost Model using all features for prediction. This is likely becaise XGBoost is a robust model with built-in methods of improving errors, reducing over-fitting and increasing accuracy.

Learning Points

Our key learnings from this project are as follows:

Data pre processing and cleaning to prepare it for model fitting.
Feature analysis on data.
Encoding of data when needed.
New prediction models - K-Nearest Neighbour, XGBoost, Random Forest Classifier.
Using best features for predictions in different models, cross-validation.
Overall pipeline of machine learning: from getting data to making our predictions.

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
Data Cleaning.ipynb		Data Cleaning.ipynb
Data Splitting and Feature Analysis.ipynb		Data Splitting and Feature Analysis.ipynb
Data Visualisation and EDA.ipynb		Data Visualisation and EDA.ipynb
Prediction Models.ipynb		Prediction Models.ipynb
README.md		README.md
SC1015 - Titanic Analysis Slides.pptx		SC1015 - Titanic Analysis Slides.pptx
clean_train.csv		clean_train.csv
test_original.csv		test_original.csv
train.csv		train.csv
train_original.csv		train_original.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SC1015 Project - The Titanic Disaster

Contributers:

About/Problem Definition

Datasets

Repository Overview

Prediction models used

Conclusion

Relating back to our problem definition and real world context:

From our prediction models, we can conclude that:

Learning Points

References

KNN

Random Forest Classifier

XGBoost

Chi-Squared test

EDA

Notebooks

About

Releases

Packages

Contributors 2

Languages

jdengoh/Titanic-Analysis

Folders and files

Latest commit

History

Repository files navigation

SC1015 Project - The Titanic Disaster

Contributers:

About/Problem Definition

Datasets

Repository Overview

Prediction models used

Conclusion

Relating back to our problem definition and real world context:

From our prediction models, we can conclude that:

Learning Points

References

KNN

Random Forest Classifier

XGBoost

Chi-Squared test

EDA

Notebooks

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages