Skip to content

Latest commit

 

History

History
153 lines (105 loc) · 6.43 KB

README.md

File metadata and controls

153 lines (105 loc) · 6.43 KB

OpenFoodFactClustering

This project explores the application of clustering algorithms to categorize food products based on their nutritional content. The goal is to identify distinct nutritional profiles within a diverse dataset using K-Means, Fuzzy C-Means, and DBSCAN clustering methods.

Table of Contents

Introduction

The project addresses the challenge of clustering food products based on nutritional attributes to improve dietary recommendations and health outcomes. By leveraging unsupervised learning methods, this research aims to identify meaningful clusters in food data.

Installation

To set up the project environment, follow these steps:

  1. Clone the repository:
    git clone https://github.com/Wei-RongRong2/OpenFoodFactClustering.git
  2. Navigate to the project directory:
    cd OpenFoodFactClustering
  3. Install the required Python packages:
    pip install -r requirements.txt

Usage

To run the clustering analysis, follow these steps:

  1. Ensure you have Jupyter Notebook installed. If not, you can install it using:

    pip install notebook
  2. Navigate to the project directory where the Jupyter Notebook is located:

    cd OpenFoodFactClustering
  3. Launch Jupyter Notebook:

    jupyter notebook
  4. In the Jupyter Notebook interface, open the OpenFoodFactClustering.ipynb file.

  5. Download the dataset from Open Food Facts and rename it as en.openfoodfacts.org.products.tsv. Place the file in the same directory as the Jupyter Notebook.

  6. Run the cells in the notebook to execute the clustering analysis.

Dashboard

A simple dashboard has been created using Streamlit to visualize the clustering results. You can view the dashboard online at the following URL:

OpenFoodFactClustering Dashboard

The code for the dashboard and the CSV files containing the results are located in the Dashboard folder within this repository.

To run the dashboard locally, follow these steps:

  1. Navigate to the Dashboard folder:

    cd Dashboard
  2. If you have not installed the full set of dependencies for the project and only want to view the dashboard, install the required packages by running:

    pip install -r requirements.txt

    (This requirements.txt file is located in the Dashboard folder.)

  3. Run the Streamlit application:

    streamlit run Dashboard.py

Methodology

The project utilizes the Open Food Facts dataset and applies K-Means, Fuzzy C-Means, and DBSCAN algorithms to cluster food products. The dataset undergoes preprocessing, including missing value handling, data validation, and outlier removal.

Data Collection

  • Source: Open Food Facts dataset available on Open Food Facts
  • Size: 356,027 rows and 163 columns
  • Attributes: Product names, categories, nutritional information, ingredients, labels, and packaging details

Preprocessing

  • Missing Values: Removed columns with >20% missing data; imputed others.
  • Data Validation: Identified and corrected/removal of invalid data and extreme outliers.
  • Data Types: One-hot encoded categorical variables; scaled numerical features.
  • Duplicate Data: Removed duplicate rows and redundant columns.

Clustering Algorithms

  1. K-Means Clustering: Used for partitioning the data into k clusters based on nutritional attributes.
  2. Fuzzy C-Means Clustering: Allows for overlapping clusters with varying degrees of membership.
  3. DBSCAN Clustering: Density-based algorithm to identify clusters of varying shapes and sizes, with noise detection.

Results

The clustering analysis aimed to uncover distinct patterns within the dataset, though some challenges were encountered due to the complexity of the data. Here are the key findings:

  • K-Means: Four clusters were identified, but there was notable overlap, which may indicate the inherent complexity of the data.
  • Fuzzy C-Means: Clustering coherence and separation improved after tuning, yet some overlap persisted.
  • DBSCAN: Tuning led to better-defined clusters, although overlap remained a challenge.

These results suggest that while clustering algorithms provided some insights, the complexity of the data presented difficulties in achieving clear, non-overlapping clusters. Further refinement or alternative approaches may be needed to enhance cluster distinctiveness.

For a more detailed explanation of these steps and results, refer to the full report: Report - Clustering Food Products based on Nutritional Attributes.pdf.

Contributing

Contributions are welcome! Please fork this repository, make your changes in a new branch, and submit a pull request for review.

  1. Fork the repo
  2. Create a feature branch (git checkout -b feature-name)
  3. Commit your changes (git commit -am 'Add some feature')
  4. Push to the branch (git push origin feature-name)
  5. Create a new Pull Request

Acknowledgments

This project was developed in collaboration with limjosun. We worked together on the clustering analysis, dashboard development, and project documentation.

License

This project is part of an academic course and is intended for educational purposes only. It may contain references to copyrighted materials, and the use of such materials is strictly for academic use. Please consult your instructor or institution for guidance on sharing or distributing this work.

For more details, see the LICENSE file.

Contact

Created by Wei-RongRong2 - feel free to contact me!
For any inquiries, you can also reach out to limjosun

References

  • Open Food Facts Dataset: Kaggle Link
  • Machine Learning Algorithms: Scikit-Learn Documentation
  • Evaluation Metrics: "Silhouette Score," "Davies-Bouldin Index," "Calinski-Harabasz Index"