This project implements a machine learning pipeline to analyze and predict survival on the Titanic dataset. It leverages a Bagging Classifier with decision trees to enhance model robustness. The project includes data preprocessing, model training, evaluation, and saving the results.
- Interactive column selection for preprocessing.
- Handles missing values and encodes categorical variables.
- Implements Bagging Classifier with decision trees.
- Exports the processed dataset to .csv and Google Sheets.
Before running the code, ensure you have the following:
- Python 3.8+
- Kaggle API credentials for downloading the dataset.
- Necessary Python libraries (see requirements.txt).
- Access to Google Sheets API (if using the csv_to_sheets function).
git clone https://github.com/<your-username>/<repository-name>.git
cd <repository-name>
pip install -r requirements.txt
- Download your kaggle.json file from Kaggle API.
- Place it in the appropriate directory (~/.kaggle on Unix or %USERPROFILE%.kaggle on Windows).
- Follow Google Sheets API documentation to set up credentials.
- Place the credentials in the project directory.
python bagging.py
- When prompted, enter a search term to find datasets on Kaggle (e.g., "Titanic", "Housing Prices").
- A list of datasets matching your search will be displayed. For example:
- Enter the number corresponding to the dataset you want to download.
- Enter a name for a new folder where the dataset will be downloaded and unzipped.
- If the downloaded dataset contains multiple .csv files, the script will load the first .csv file by default.
- The dataset is automatically loaded into a Pandas DataFrame.
- Interactively select columns for analysis.
- Handle missing values automatically.
- Specify the target column (dependent variable).
- The script splits the data into training and testing sets.
- Trains a Bagging Classifier using decision trees.
- Accuracy on the test set is displayed in the console.
- Processed data is saved as a .csv file in the save/ directory.
- Optionally, upload the dataset to Google Sheets using the *google_sheets_utils.py* script.
- Looker Studio: Example - Titanic Survival Rate
- Kaggle datasets: Kaggle Datasets.
- Scikit-learn: Scikit-learn Documentation.
- Pandas: Pandas Documentation.
- Curses: Curses Documentation.