Author: Andrew Kwon
This project compares three regression models (Decision Tree Regressor, Random Forest Regressor, Linear Regressor) for predicting a numerical target value. Model performance is evaluated using symmetric Mean Absolute Percentage Error (sMAPE). The main tasks completed in this project are:
- Data inspection and cleaning
- Data analysis and visualization
- Feature analysis
- Training and evaluating regression models using cross-validation and a custom scorer (sMAPE)
Python code, analysis, and solution conducted in Jupyter notebook.
This project was provided by Zyfra, an efficiency solutions developer for heavy industry. In this project, we are to prepare a prototype of a machine learning model that should predict the amount of gold recovered from gold ore. The model will help optimize gold production and eliminate unprofitable parameters.
The source dataset contains all of the pertinent parameters and outputs related to the technological process for gold ore refinement. The source dataset has already been split into a training and test set. Of note, some of the features are absent in the test set and will be addressed during the project. The project goal is to find the best model according to the sMAPE metric using the predicted rougher and final concentrate recovery values.
We are provided data on gold ore extraction and purification in the following csv files:
- gold_recovery_train.csv
- gold_recovery_test.csv
- gold_recovery_full.csv
The 87 columns in the original dataset mostly describe concentration levels for various substances at different stages of the refining process, but will not be enumerated here. Due to upload size limitations, each file was compressed into 7zip archives. Users will need to extract the files and place them in the appropriate directory for usage.
The formulas for the calculations are as follows, and additionally detailed in the notebook:
- C - share of gold in the concentrate right after flotation (rougher concentrate recovery), or after purification (final concentrate recovery)
- F - share of gold in the feed before flotation (rougher concentrate recovery), or after flotation (final concentrate recovery)
- T - share of gold in the rougher tails right after flotation (rougher concentrate recovery), or after purification (final concentrate recovery)
-
$y_i$ : target value for observation$i$ -
$\hat{y_i}$ : predicted target value for observation$i$ -
$n$ : number of observations in the sample
- pandas
- numpy
- plotly.express
- plotly.graph_objects
- sklearn.metrics
- sklearn.tree
- sklearn.ensemble
- sklearn.linear_model
- sklearn.dummy
- sklearn.model_selection