Ensemble-Based Machine Learning Approach for Accurate and Cost-Effective Early Detection of Harmful Algal Blooms

This repository contains the official implementation of the research paper Ensemble-Based Machine Learning Approach for Accurate and Cost-Effective Early Detection of Harmful Algal Blooms. The approach utilizes ensemble learning, combined with data augmentation and multi-agent systems, to enhance the early detection of harmful algal blooms (HABs).

Overview

Harmful Algal Blooms (HABs) present significant ecological and economic challenges, costing the U.S. $50 million annually while impacting public health and degrading water quality. Traditional detection methods, such as manual sampling, analysis, satellite monitoring, and sensing, are time-consuming, expensive, and lack real-time monitoring capabilities. This study introduces an ensemble-based machine learning approach to predict corrected Chlorophyll-a concentration, a key indicator of HABs. Initial tests of our ensemble model revealed higher accuracy in predicting lower Chlorophyll-a concentrations compared to higher ones. To address this, we utilized large language models (LLMs) to generate synthetic data for high-value cases, effectively oversampling the long-tail data. This data augmentation resulted in a 4.77% reduction in prediction error for our ensemble model compared to training on the original dataset alone. Moreover, our final model achieved a notable 66.10% reduction in RMSE compared to conventional models using satellite data. This approach provides a scalable, cost-effective solution for early HAB detection, enhancing AI-driven environmental monitoring and prediction systems.

Requirements

To install the required dependencies, run:

pip install -r requirements.txt

Processing Data

Before processing the data, ensure that the output folder and its contents are deleted if they exist, as the preprocess.py script will generate new files and the output folder.

To process the data, run this command:

python preprocess.py

Training

This project includes training four models: Random Forests, Gradient Boosting, Neural Networks, and a Stacked/Ensemble model. To train all models at once, run:

python train.py

Evaluation

We provide multiple scripts to evaluate the models' performance on different metrics:

To evaluate the Root Mean Squared Error (RMSE) of the models, run:

python loss_RMSE.py

To evaluate the residuals of the models' predictions, run:

python residuals.py

To evaluate the models' percent error, run:

python percent_error.py

Pre-trained Models

You can download pre-trained models from the following link:

Google Drive

Results

Web-Application Demo

We have developed a simple web application to demonstrate the functionality and use of our model for HAB prediction/detection. You can access it here: Web-app demo.

The web application is built using Streamlit. The code for the app is in the following folder: Streamlit App Code.

Contributing

We welcome contributions! To contribute:

Fork the repository.
Clone your fork: git clone https://github.com/your-username/Ensemble-ML-for-HABs-Detection.git
Create a new branch: git checkout -b feature-or-bugfix-name
Make your changes and commit: git commit -m "Description of changes"
Push to your fork: git push origin feature-or-bugfix-name
Open a Pull Request.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Ensemble-Based Machine Learning Approach for Accurate and Cost-Effective Early Detection of Harmful Algal Blooms

Overview

Requirements

Processing Data

Training

Evaluation

Pre-trained Models

Results

Percent Error

Residuals

RMSE Cross Validation

Web-Application Demo

Contributing

License

Files

README.md

Latest commit

History

README.md

File metadata and controls

Ensemble-Based Machine Learning Approach for Accurate and Cost-Effective Early Detection of Harmful Algal Blooms

Overview

Requirements

Processing Data

Training

Evaluation

Pre-trained Models

Results

Percent Error

Residuals

RMSE Cross Validation

Web-Application Demo

Contributing

License