Project carried out for the subject of computer-aided decision-making
The project should include:
- Description of the input data and its exploratory analysis/visualisation
- Normalisation or standardisation of data if necessary
- Theoretical introduction to the tool used and its learning methods
- Analysis of the quality of the resulting tool (classifier, regression, ...) using relevant metrics
- Analysis of how the quality of the proposed tool will change if we change the configuration parameters? Have the model parameters been chosen optimally? (cross validation, grid search)
Column name | Description |
---|---|
duration | Video duration |
codec | Type of video coding |
width | Frame width in pixels |
height | Frame height in pixels |
bitrate | Video bit rate |
framerate | Number of frames per second |
i, p, b | Number of I, P and B frames in the video |
frames | Total number of frames in the video |
i_size,p_size,b_size | Size of P, B and I-frames |
size | Total video size |
o_bitrate | Video bit rate after compression |
o_framerate | Frame rate after compression |
o_width,o_height | Width and height of frame after compression |
umem | Memory used by the video compression process |
utime | The time it takes to process video |
Parameters that may have substantial impact on transcoding time are: video resolution (both input and output), bitrate and amount of different types of frames in a video (i-frames, p-frames, b-frames). Not included in the above heatmap are input and output codecs as those are not represented as numerical values. They might however have an impact on transcoding time and are worth analysing.
Since 2 columns in the data contained non-numerical values, we converted them into columns using One-Hot Encoding (the number of new columns is the same as the number of different values in the non-numerical columns) so that the resulting data is suitable for regression (all characteristics have numerical values).-
An interactive computing environment for creating and sharing documents with live code, equations, visualizations, and narrative text.
-
A machine learning library in Python.
-
A function for splitting data into training and testing sets, commonly used in machine learning workflows.
-
A method used in transformers to fit the model to the data and transform it simultaneously.
-
A method used to apply transformations to the input data based on the learned parameters from fit_transform.
-
A method used to make predictions on new data based on a trained model.
-
A metric for evaluating the performance of a regression model by measuring the average squared difference between predicted and actual values.
-
A function for performing cross-validated scoring of a machine learning model to assess its generalization performance.
-
A fast, powerful, and flexible open-source data manipulation and analysis library for Python. It provides data structures like DataFrame for efficient data handling.
-
A fundamental package for scientific computing in Python. It provides support for large, multi-dimensional arrays and matrices, along with mathematical functions.
-
A comprehensive library for creating static, animated, and interactive visualizations in Python. It is often used for creating plots and charts.
-
A statistical data visualization library based on Matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.
We first carried out tests using linear regression, of course, standardising the data beforehand
We then checked the operation of the polynomialFeatures
Unfortunately, in the case of our data, the ratio cannot be greater than 2 because the program is not optimal when calculating it.
Polynomial features has too many requirements with this amount of data/features, and RFE doesn't particularly work because it relies on polynomial features
We then tried using DecisionTreeregressor and RandomForestRegressor which produced the beneficial results shown below
Model | MSE | R2 Score | Best Hyperparameters |
---|---|---|---|
Linear Regression | 90.3715 | 0.6478 | {} |
Decision Tree Regression | 6.4607 | 0.9748 | {'max_depth': 30, 'min_samples_split': 2} |
Random Forest Regression | 3.4193 | 0.9867 | {'max_depth': 30, 'min_samples_split': 2, 'n_estimators': 200} |
In conclusion, the ensemble approach of the Random Forest, coupled with meticulous hyperparameter tuning, has resulted in superior regression performance compared to both Linear Regression and Decision Tree models. These findings provide valuable insights for selecting the most suitable model for predictive tasks in this context