Skip to content

Latest commit

 

History

History
203 lines (180 loc) · 7.78 KB

README.md

File metadata and controls

203 lines (180 loc) · 7.78 KB

Online Video Characteristics and Transcoding Time

Project carried out for the subject of computer-aided decision-making

The project should include:

  • Description of the input data and its exploratory analysis/visualisation
  • Normalisation or standardisation of data if necessary
  • Theoretical introduction to the tool used and its learning methods
  • Analysis of the quality of the resulting tool (classifier, regression, ...) using relevant metrics
  • Analysis of how the quality of the proposed tool will change if we change the configuration parameters? Have the model parameters been chosen optimally? (cross validation, grid search)

Description of the input data and its exploratory analysis/visualisation

Column name Description
duration Video duration
codec Type of video coding
width Frame width in pixels
height Frame height in pixels
bitrate Video bit rate
framerate Number of frames per second
i, p, b Number of I, P and B frames in the video
frames Total number of frames in the video
i_size,p_size,b_size Size of P, B and I-frames
size Total video size
o_bitrate Video bit rate after compression
o_framerate Frame rate after compression
o_width,o_height Width and height of frame after compression
umem Memory used by the video compression process
utime The time it takes to process video

Exploratory analysis

image

Exploratory visualisation

image

Parameters that may have substantial impact on transcoding time are: video resolution (both input and output), bitrate and amount of different types of frames in a video (i-frames, p-frames, b-frames). Not included in the above heatmap are input and output codecs as those are not represented as numerical values. They might however have an impact on transcoding time and are worth analysing.

Normalisation or standardisation of data if necessary

Since 2 columns in the data contained non-numerical values, we converted them into columns using One-Hot Encoding (the number of new columns is the same as the number of different values in the non-numerical columns) so that the resulting data is suitable for regression (all characteristics have numerical values).

image

Theoretical introduction to the tool used and its learning methods

  • Jupyter Notebook

    An interactive computing environment for creating and sharing documents with live code, equations, visualizations, and narrative text.

  • Scikit-Learn (sklearn)

    A machine learning library in Python.

    • train_test_split

      A function for splitting data into training and testing sets, commonly used in machine learning workflows.

    • fit_transform

      A method used in transformers to fit the model to the data and transform it simultaneously.

    • transform

      A method used to apply transformations to the input data based on the learned parameters from fit_transform.

    • predict

      A method used to make predictions on new data based on a trained model.

    • mean_squared_error

      A metric for evaluating the performance of a regression model by measuring the average squared difference between predicted and actual values.

    • cross_val_score

      A function for performing cross-validated scoring of a machine learning model to assess its generalization performance.

  • Pandas

    A fast, powerful, and flexible open-source data manipulation and analysis library for Python. It provides data structures like DataFrame for efficient data handling.

  • NumPy

    A fundamental package for scientific computing in Python. It provides support for large, multi-dimensional arrays and matrices, along with mathematical functions.

  • Matplotlib

    A comprehensive library for creating static, animated, and interactive visualizations in Python. It is often used for creating plots and charts.

  • Seaborn

    A statistical data visualization library based on Matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.

Analysis of the quality of the resulting tool (classifier, regression, ...) using relevant metrics

We first carried out tests using linear regression, of course, standardising the data beforehand

image

We then checked the operation of the polynomialFeatures

image

Unfortunately, in the case of our data, the ratio cannot be greater than 2 because the program is not optimal when calculating it. Polynomial features has too many requirements with this amount of data/features, and RFE doesn't particularly work because it relies on polynomial features

We then tried using DecisionTreeregressor and RandomForestRegressor which produced the beneficial results shown below

image

image

Model MSE R2 Score Best Hyperparameters
Linear Regression 90.3715 0.6478 {}
Decision Tree Regression 6.4607 0.9748 {'max_depth': 30, 'min_samples_split': 2}
Random Forest Regression 3.4193 0.9867 {'max_depth': 30, 'min_samples_split': 2, 'n_estimators': 200}

In conclusion, the ensemble approach of the Random Forest, coupled with meticulous hyperparameter tuning, has resulted in superior regression performance compared to both Linear Regression and Decision Tree models. These findings provide valuable insights for selecting the most suitable model for predictive tasks in this context