🌆 City Predictor: UofT Student Survey Analysis 🌇

README.md

🌆 City Predictor: UofT Student Survey Analysis 🌇

Welcome to the City Predictor project! This repository contains the code and data for a machine learning model that predicts cities based on survey responses from students at the University of Toronto. The model guesses one of four cities: Rio de Janeiro, Dubai, New York City, or Paris, based on answers to 10 questions.

Our model achieved a test accuracy of 87.5% (63/72)

📊 Project Overview

In this project, we:

Collected survey data with responses to 10 questions describing a city.
Explored and cleaned the data, addressing missing values and outliers.
Tested multiple machine learning algorithms.
Chose the Random Forest model for its high accuracy.
Built a custom model from scratch, including preprocessing pipelines and hyperparameter tuning.

Feel free to explore the repository to see how different models were tested and to understand the custom implementation of our chosen model.

🚀 Getting Started

To run the code, you'll need Python 3 installed on your machine, along with pandas and numpy. The cool part of this project is that we did not use any prebuilt models or libraries like scikit-learn for the final implementation!

Prerequisites

Make sure you have the following packages installed:

pip install pandas numpy

Running the Model

Clone the repository:

git clone https://github.com/kxnoun/city-predictor.git
cd city-predictor

Explore the code:
- preprocessing.py: Functions for data preprocessing.
- pipeline.py: Custom pipeline and transformers.
- random_forest.py: Custom Random Forest implementation.
- alternate_models/: Directory containing different models tried during the project.
- main.py: Main script for training and predicting.
Run the main script:
```
python main.py
```

This will train the model on the provided dataset and make predictions based on the survey responses.

📂 Project Structure

city-predictor/
│
├── preprocessing.py
├── pipeline.py
├── random_forest.py
├── main.py
├── alternate_models/
│   ├── kNearestNeighbour.ipynb
│   ├── LogisticRegression.ipynb
│   └── ...
├── clean_dataset.csv
└── stopwords.txt

📚 Data

The dataset contains survey responses with the following features:

Q1-Q4: Likert scale questions (1 to 5)
Q5: Categories of travel companions (binary: Siblings, Co-worker, Partner, Friends)
Q6: Rankings of city aspects (1 to 6: Skyscrapers, Sport, Art and Music, Carnival, Cuisine, Economic)
Q7: Average temperature
Q8: Number of languages spoken
Q9: Number of fashion styles
Q10: Descriptive text about the city

🛠️ Custom Pipeline

The custom pipeline includes steps like:

Imputing missing values.
Converting numeric inputs.
Adjusting outliers.
Encoding categorical data.
Extracting rankings.
Cleaning text.
Vectorizing text using TF-IDF.
Training the Random Forest model.

📈 Model Performance

After trying various models, we found that the Random Forest model provided the highest accuracy on the test set. We fine-tuned the hyperparameters to achieve optimal performance.

🎉 Contributors

Adam: Data Pre-processing, Model Exploration (Random Forest), Model Implementation, Hyper-parameter Fine-tuning, Cross-validation
Neha: Data and Model Exploration (Logistic Regression, kNN), pred.py script Implementation, Report Writing
Katerina: Data Exploration and Visualization
Manahil: Model Exploration (Decision Tree), Custom RF Implementation for pred.py, Report Writing

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

README.md

🌆 City Predictor: UofT Student Survey Analysis 🌇

📊 Project Overview

🚀 Getting Started

Prerequisites

Running the Model

📂 Project Structure

📚 Data

🛠️ Custom Pipeline

📈 Model Performance

🎉 Contributors

About

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
alternate_models		alternate_models
.DS_Store		.DS_Store
.gitignore		.gitignore
CityClassifierReport.pdf		CityClassifierReport.pdf
README.md		README.md
clean_dataset.csv		clean_dataset.csv
main.py		main.py
ml_data.ipynb		ml_data.ipynb
pipeline.py		pipeline.py
pred.py		pred.py
preprocessing.py		preprocessing.py
random_forest.py		random_forest.py
sample_dataset.csv		sample_dataset.csv
stopwords.txt		stopwords.txt

kxnoun/CityClassifier

Folders and files

Latest commit

History

Repository files navigation

README.md

🌆 City Predictor: UofT Student Survey Analysis 🌇

📊 Project Overview

🚀 Getting Started

Prerequisites

Running the Model

📂 Project Structure

📚 Data

🛠️ Custom Pipeline

📈 Model Performance

🎉 Contributors

About

Topics

Resources

Stars

Watchers

Forks

Languages