Skip to content

Code in which an initial approach to decision trees and bagging will be made, and an attempt will be made to ensure that the model can be trained with any dataset coming from Kaggle (for this, we will again use the 'connect with Kaggle' project).

Notifications You must be signed in to change notification settings

CamilaJaviera91/bagging-with-kaggle

Repository files navigation

bagging-with-kaggle

General Machine Learning Pipeline with Bagging Classifier

This project implements a machine learning pipeline to analyze and predict survival on the Titanic dataset. It leverages a Bagging Classifier with decision trees to enhance model robustness. The project includes data preprocessing, model training, evaluation, and saving the results.

Features

  • Interactive column selection for preprocessing.
  • Handles missing values and encodes categorical variables.
  • Implements Bagging Classifier with decision trees.
  • Exports the processed dataset to .csv and Google Sheets.

Prerequisites

Before running the code, ensure you have the following:

  • Python 3.8+
  • Kaggle API credentials for downloading the dataset.
  • Necessary Python libraries (see requirements.txt).
  • Access to Google Sheets API (if using the csv_to_sheets function).

Instalation

1. Clone this repository

git clone https://github.com/<your-username>/<repository-name>.git 
cd <repository-name>

2. Intall required Python libraries

pip install -r requirements.txt

3. Set up the Kaggle API:

  • Download your kaggle.json file from Kaggle API.
  • Place it in the appropriate directory (~/.kaggle on Unix or %USERPROFILE%.kaggle on Windows).

4. Configure Google Sheets API:

Usage

1. Run the main script:

python bagging.py

Fetch a dataset from Kaggle:

  • When prompted, enter a search term to find datasets on Kaggle (e.g., "Titanic", "Housing Prices").
  • A list of datasets matching your search will be displayed. For example:

Titanic example

  • Enter the number corresponding to the dataset you want to download.

Titanic example 2

3. Specify a folder to save the dataset:

  • Enter a name for a new folder where the dataset will be downloaded and unzipped.

Titanic example 3

4. Dataset selection:

  • If the downloaded dataset contains multiple .csv files, the script will load the first .csv file by default.
  • The dataset is automatically loaded into a Pandas DataFrame.

Titanic example 4

5. Follow the prompts in bagging.py:

  • Interactively select columns for analysis.
  • Handle missing values automatically.
  • Specify the target column (dependent variable).

Titanic example 5 Titanic example 6

6. Model Training and Evaluation:

  • The script splits the data into training and testing sets.
  • Trains a Bagging Classifier using decision trees.
  • Accuracy on the test set is displayed in the console.

Titanic example 7

7. Save Results:

  • Processed data is saved as a .csv file in the save/ directory.

Titanic example 8


- Optionally, upload the dataset to Google Sheets using the *google_sheets_utils.py* script.

Titanic example 9


Titanic example 10

8. Create a Looker Studio with googlesheets:


Titanic example 11

Acknowledgments


About

Code in which an initial approach to decision trees and bagging will be made, and an attempt will be made to ensure that the model can be trained with any dataset coming from Kaggle (for this, we will again use the 'connect with Kaggle' project).

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages