This project was developed for CS4168 - Data Mining at UL. This project showcases data mining techniques such as EDA, Data Preparation, Clustering, Classification, and Regression.
The project delved into the nuances of energy consumption in the steel industry, leveraging Python and Jupyter Notebook for analysis. Beginning with thorough Exploratory Data Analysis (EDA), the project unveiled insights into statistical summaries, distribution patterns, correlations, and temporal trends within the dataset. Subsequent data preparation involved a meticulous process of cleaning, formatting, imputation, and transformation to ensure data integrity and usability. Clustering techniques were then applied to identify inherent patterns and groupings within the data, shedding light on potential energy consumption profiles. Classification tasks were undertaken to classify and predict energy consumption behaviour using a variety of classifiers, facilitating a deeper understanding of the dataset's characteristics. Lastly, regression analysis was employed to model and predict energy consumption trends, providing actionable insights for optimizing energy usage and promoting sustainability in the steel industry. Through this multifaceted approach, the project aimed to unlock valuable insights critical for informed decision-making and operational efficiency improvements.
The project analyzed energy consumption in the steel industry using Python and Jupyter Notebook. It began with Exploratory Data Analysis (EDA) to uncover trends and patterns. The EDA involved creating statistical summaries, analyzing distributions and correlations, and performing time series analysis with visualizations such as histograms, pair plots, and heatmaps. Data preparation included cleaning, formatting, imputation, transformation, feature selection, sampling, and validation. Clustering utilized K-Means, MDS, t-SNE, and Hierarchical methods, evaluated by the elbow method and silhouette score. Classification involved training models like SVM, Random Forest, K-Neighbors, MLP, and Naïve Bayes, with a thorough comparison using accuracy, precision, recall, F1-score, TPR, and AUC metrics. Regression models, including Random Forest, Linear Regression, and Lasso Regression, were evaluated and optimized using grid search and dimensionality reduction techniques. Each step ensured a comprehensive analysis, enhancing the understanding and management of energy consumption in the steel industry.
- git clone repo
- install virtual environment name as venv
- pip install -r requirements.txt
- checkout main
- create a new branch example "task-{task name eg. eda }-{your name}"
- push to origin [DO NOT MERGE WITH MAIN]