Summary

This project was developed for CS4168 - Data Mining at UL. This project showcases data mining techniques such as EDA, Data Preparation, Clustering, Classification, and Regression.

Overview

The project delved into the nuances of energy consumption in the steel industry, leveraging Python and Jupyter Notebook for analysis. Beginning with thorough Exploratory Data Analysis (EDA), the project unveiled insights into statistical summaries, distribution patterns, correlations, and temporal trends within the dataset. Subsequent data preparation involved a meticulous process of cleaning, formatting, imputation, and transformation to ensure data integrity and usability. Clustering techniques were then applied to identify inherent patterns and groupings within the data, shedding light on potential energy consumption profiles. Classification tasks were undertaken to classify and predict energy consumption behaviour using a variety of classifiers, facilitating a deeper understanding of the dataset's characteristics. Lastly, regression analysis was employed to model and predict energy consumption trends, providing actionable insights for optimizing energy usage and promoting sustainability in the steel industry. Through this multifaceted approach, the project aimed to unlock valuable insights critical for informed decision-making and operational efficiency improvements.

Summary

The project analyzed energy consumption in the steel industry using Python and Jupyter Notebook. It began with Exploratory Data Analysis (EDA) to uncover trends and patterns. The EDA involved creating statistical summaries, analyzing distributions and correlations, and performing time series analysis with visualizations such as histograms, pair plots, and heatmaps. Data preparation included cleaning, formatting, imputation, transformation, feature selection, sampling, and validation. Clustering utilized K-Means, MDS, t-SNE, and Hierarchical methods, evaluated by the elbow method and silhouette score. Classification involved training models like SVM, Random Forest, K-Neighbors, MLP, and Naïve Bayes, with a thorough comparison using accuracy, precision, recall, F1-score, TPR, and AUC metrics. Regression models, including Random Forest, Linear Regression, and Lasso Regression, were evaluated and optimized using grid search and dimensionality reduction techniques. Each step ensured a comprehensive analysis, enhancing the understanding and management of energy consumption in the steel industry.

How to use this repo

git clone repo
install virtual environment name as venv
pip install -r requirements.txt
checkout main
create a new branch example "task-{task name eg. eda }-{your name}"
push to origin [DO NOT MERGE WITH MAIN]

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
data		data
model		model
src		src
.gitignore		.gitignore
README.md		README.md
__init__.py		__init__.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Summary

Overview

Summary

How to use this repo

About

Releases

Packages

Languages

mihson95/data_mining_project

Folders and files

Latest commit

History

Repository files navigation

Summary

Overview

Summary

How to use this repo

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages