Online Video Characteristics and Transcoding Time

Project carried out for the subject of computer-aided decision-making

The project should include:

Description of the input data and its exploratory analysis/visualisation
Normalisation or standardisation of data if necessary
Theoretical introduction to the tool used and its learning methods
Analysis of the quality of the resulting tool (classifier, regression, ...) using relevant metrics
Analysis of how the quality of the proposed tool will change if we change the configuration parameters? Have the model parameters been chosen optimally? (cross validation, grid search)

Description of the input data and its exploratory analysis/visualisation

Column name	Description
duration	Video duration
codec	Type of video coding
width	Frame width in pixels
height	Frame height in pixels
bitrate	Video bit rate
framerate	Number of frames per second
i, p, b	Number of I, P and B frames in the video
frames	Total number of frames in the video
i_size,p_size,b_size	Size of P, B and I-frames
size	Total video size
o_bitrate	Video bit rate after compression
o_framerate	Frame rate after compression
o_width,o_height	Width and height of frame after compression
umem	Memory used by the video compression process
utime	The time it takes to process video

Exploratory analysis

Exploratory visualisation

Parameters that may have substantial impact on transcoding time are: video resolution (both input and output), bitrate and amount of different types of frames in a video (i-frames, p-frames, b-frames). Not included in the above heatmap are input and output codecs as those are not represented as numerical values. They might however have an impact on transcoding time and are worth analysing.

Normalisation or standardisation of data if necessary

Since 2 columns in the data contained non-numerical values, we converted them into columns using One-Hot Encoding (the number of new columns is the same as the number of different values in the non-numerical columns) so that the resulting data is suitable for regression (all characteristics have numerical values).

Theoretical introduction to the tool used and its learning methods

Jupyter Notebook

An interactive computing environment for creating and sharing documents with live code, equations, visualizations, and narrative text.
Scikit-Learn (sklearn)

A machine learning library in Python.

train_test_split

A function for splitting data into training and testing sets, commonly used in machine learning workflows.
fit_transform

A method used in transformers to fit the model to the data and transform it simultaneously.
transform

A method used to apply transformations to the input data based on the learned parameters from fit_transform.
predict

A method used to make predictions on new data based on a trained model.
mean_squared_error

A metric for evaluating the performance of a regression model by measuring the average squared difference between predicted and actual values.
cross_val_score

A function for performing cross-validated scoring of a machine learning model to assess its generalization performance.

Pandas

A fast, powerful, and flexible open-source data manipulation and analysis library for Python. It provides data structures like DataFrame for efficient data handling.
NumPy

A fundamental package for scientific computing in Python. It provides support for large, multi-dimensional arrays and matrices, along with mathematical functions.
Matplotlib

A comprehensive library for creating static, animated, and interactive visualizations in Python. It is often used for creating plots and charts.
Seaborn

A statistical data visualization library based on Matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.

Analysis of the quality of the resulting tool (classifier, regression, ...) using relevant metrics

We first carried out tests using linear regression, of course, standardising the data beforehand

We then checked the operation of the polynomialFeatures

Unfortunately, in the case of our data, the ratio cannot be greater than 2 because the program is not optimal when calculating it. Polynomial features has too many requirements with this amount of data/features, and RFE doesn't particularly work because it relies on polynomial features

We then tried using DecisionTreeregressor and RandomForestRegressor which produced the beneficial results shown below

Model	MSE	R2 Score	Best Hyperparameters
Linear Regression	90.3715	0.6478	{}
Decision Tree Regression	6.4607	0.9748	{'max_depth': 30, 'min_samples_split': 2}
Random Forest Regression	3.4193	0.9867	{'max_depth': 30, 'min_samples_split': 2, 'n_estimators': 200}

In conclusion, the ensemble approach of the Random Forest, coupled with meticulous hyperparameter tuning, has resulted in superior regression performance compared to both Linear Regression and Decision Tree models. These findings provide valuable insights for selecting the most suitable model for predictive tasks in this context

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Online Video Characteristics and Transcoding Time

Description of the input data and its exploratory analysis/visualisation

Exploratory analysis

Exploratory visualisation

Normalisation or standardisation of data if necessary

Theoretical introduction to the tool used and its learning methods

Jupyter Notebook

Scikit-Learn (sklearn)

train_test_split

fit_transform

transform

predict

mean_squared_error

cross_val_score

Pandas

NumPy

Matplotlib

Seaborn

Analysis of the quality of the resulting tool (classifier, regression, ...) using relevant metrics

Files

README.md

Latest commit

History

README.md

File metadata and controls

Online Video Characteristics and Transcoding Time

Description of the input data and its exploratory analysis/visualisation

Exploratory analysis

Exploratory visualisation

Normalisation or standardisation of data if necessary

Theoretical introduction to the tool used and its learning methods

Jupyter Notebook

Scikit-Learn (sklearn)

train_test_split

fit_transform

transform

predict

mean_squared_error

cross_val_score

Pandas

NumPy

Matplotlib

Seaborn

Analysis of the quality of the resulting tool (classifier, regression, ...) using relevant metrics