diff --git a/Linear-reg.html b/Linear-reg.html index 15daa45..a1e8c90 100644 --- a/Linear-reg.html +++ b/Linear-reg.html @@ -189,192 +189,26 @@

Mathematical Explanation

-

Relationship of regression lines

+
  1. Positive Linear Relationship: If the dependent variable increases on the Y-axis and independent variable increases on X-axis, then such a relationship is termed as a Positive linear relationship.
  2. Negative Linear Relationship: If the dependent variable decreases on the Y-axis and independent variable increases on the X-axis, then such a relationship is called a negative linear relationship.
  3. -
- - -

Types of Linear Regression

Linear regression can be further divided into two types of the algorithm:

-
    -
  1. Simple Linear Regression: If a single independent variable is used to predict the value of a numerical dependent variable, then such a Linear Regression algorithm is called Simple Linear Regression.
  2. -
  3. Multiple Linear regression: If more than one independent variable is used to predict the value of a numerical dependent variable, then such a Linear Regression algorithm is called Multiple Linear Regression.
  4. -
- -

Mathematical Explanation:

-

There are parameters β0, β1, and σ2, such that for any fixed value of the independent variable x , the dependent variable is a random variable related to x through the model equation:

- - $$y=\beta_0 + \beta_1 x +\epsilon$$ - -

where

- -

The goal of linear regression is to estimate the values of the regression coefficients

- -

This algorithm explains the linear relationship between the dependent(output) variable y - and the independent(predictor) variable x using a straight line - y = β 0 + β 1 x

- - -

1.2. Goal

- - - -

Note:

- - - -

1.3. Calculating the regression parameters

In simple linear regression, there is only one independent variable ( x ) and one dependent variable ( y ). The parameters (coefficients) in simple linear regression can be calculated using the method of ordinary least squares (OLS). The equations and formulas involved in calculating the parameters are as follows:

-

Model Representation:

-

The simple linear regression model can be represented as: - $$y = \beta_0 + \beta_1 x + \epsilon$$

-

Therefore, we can write:

-

ϵ = y β 0 β 1 x .

- -
    -
  1. Cost Function or mean squared error (MSE):

    -

    The MSE, measures the average squared difference between the predicted values ( y ^ ) and the actual values of the dependent variable ( y ). It is given by:

    -

    $$MSE = \frac{1}{n} \sum (y_i - \hat{y}_i)^2$$

    -

    Where:

    - -
  2. +
+ -
  • Minimization of the Cost Function:

    -

    The parameters β 0 and β 1 are estimated by minimizing the cost function. The formulas for calculating the parameter estimates are derived from the derivative of the cost function with respect to each parameter.

    -

    The parameter estimates are given by:

    - -

    The estimated parameters β 0 ^ and β 1 ^ provide the values of the intercept and slope that best fit the data according to the simple linear regression model.

    -
  • -
  • Prediction:

    -

    Once the parameter estimates are obtained, predictions can be made using the equation:

    -

    $$\hat{y} = \hat{\beta_0} + \hat{\beta_1} x$$

    -

    Where:

    - -
  • -

    These equations and formulas allow for the calculation of the parameters in simple linear regression using the method of ordinary least squares (OLS). By minimizing the sum of squared differences between predicted and actual values, the parameters are determined to best fit the data and enable prediction of the dependent variable.

    - - -
    -

    Gradient Descent for Linear Regression: -

    -
    +

    Types of Linear Regression

    +

    Linear regression can be further divided into two types of the algorithm:

    +
      +
    1. Simple Linear Regression: If a single independent variable is used to predict the value of a numerical dependent variable, then such a Linear Regression algorithm is called Simple Linear Regression.
    2. +
    3. Multiple Linear regression: If more than one independent variable is used to predict the value of a numerical dependent variable, then such a Linear Regression algorithm is called Multiple Linear Regression.
    4. +
    +

    Assumptions of Linear Regression

      @@ -406,7 +240,7 @@

      Assumptions of Linear Regression

    -

    Evaluation Metrics for Linear Regression

    +

    Model Evaluation

    To train an accurate linear regression model, we need a way to quantify how good (or bad) our model performs. In machine learning, we call such performance-measuring functions loss functions. Several popular loss functions exist for regression problems. To measure our model's performance, we'll use one of the most popular: mean-squared error (MSE). Here are some commonly used evaluation metrics:

    @@ -465,109 +299,293 @@

    Evaluation Metrics for Linear Regression

    -
  • Adjusted R-squared:

    The Adjusted R-squared accounts for the number of independent variables in the model. It penalizes the inclusion of irrelevant variables and rewards the inclusion of relevant variables.

    $$\boxed{\text{Adjusted}~ R^2 = 1-\left[\frac{(1 - R²) * (n - 1)}{(n - p - 1)}\right]}$$

    Where:

    -

    A higher Adjusted $R^2$ value indicates a better fit of the model while considering the complexity of the model.

    -

    These evaluation metrics help assess the performance of a linear regression model by quantifying the accuracy of the predictions and the extent to which the independent variables explain the dependent variable. It is important to consider multiple metrics to gain a comprehensive understanding of the model's performance.

    +

    A higher Adjusted R-squared value indicates a better fit of the model while considering the complexity of the model.

    +

    These evaluation metrics help assess the performance of a linear regression model by quantifying the accuracy of the predictions and the extent to which the independent variables explain the + dependent variable. It is important to consider multiple metrics to gain a comprehensive understanding of the model's performance.

  • -
    - -
    -

    Understanding and Addressing Fitting Issues in Machine Learning Models

    -

    Overfitting and underfitting are two common problems encountered in machine learning. They occur when a machine learning model fails to generalize well to new data.

    -
    - -
    -
    -

    Overfitting

    - - -

    Underfitting

    - - -

    Bias (Systematic Error):

    - -

    Variance (Random Error):

    - +

    + Selecting An Evaluation Metric: +

    Many methods exist for evaluating regression models, each with different concerns around interpretability, theory, and usability. The evaluation metric should reflect whatever it is you actually + care about when making predictions. For example, when we use MSE, we are implicitly saying that we think the cost of our prediction error should reflect the quadratic (squared) distance between + what we predicted and what is correct. This may work well if we want to punish outliers or if our data is minimized by the mean, but this comes at the cost of interpretability: we output our error + in squared units (though this may be fixed with RMSE). If instead we wanted our error to reflect the linear distance between what we predicted and what is correct, or we wanted our data minimized by + the median, we could try something like Mean Abosulte Error (MAE). Whatever the case, you should be thinking of your evaluation metric as part of your modeling process, and select the best metric based + on the specific concerns of your use-case.

    +
    + Are Our Coefficients Valid?: +

    In research publications and statistical software, coefficients of regression models are often presented with associated p-values. These p-values come from traditional null hypothesis statistical tests: t-tests are used to measure whether a given cofficient is significantly different than zero (the null hypothesis that a particular coefficient + βi equals zero), while F tests are used to measure whether any of the terms in a regression model are significantly different from zero. Different opinions exist on the utility of such tests.

    +
    + -

    Data Leakage:

    - +
    +

    Mathematical Explanation:

    +

    There are parameters β0, β1, and ϵ, such that for any fixed value of the independent variable x , the dependent variable is a random variable related to x through the model equation:

    + + $$y=\beta_0 + \beta_1 x +\epsilon$$ -

    Model Instability:

    +

    where

    - -

    Multicollinearity:

    +

    The goal of linear regression is to estimate the values of the regression coefficients

    + +

    This algorithm explains the linear relationship between the dependent(output) variable y + and the independent(predictor) variable x using a straight line + y = β 0 + β 1 x

    + + +

    1.2. Goal

    + +
  • The goal of the linear regression algorithm is to get the best values for β 0 + and β 1 to find the best fit line.
  • +
  • The best fit line is a line that has the least error which means the error between predicted values and actual values should be minimum.
  • +
  • For a datset with n observation ( x i , y i ) , + where i = 1 , 2 , 3.... , n the above function can be written as follows

    + +

    y i = β 0 + β 1 x i + ϵ i

    -

    Imbalanced Data:

    - +

    1.3. Calculating the regression parameters

    In simple linear regression, there is only one independent variable ( x ) and one dependent variable ( y ). The parameters (coefficients) in simple linear regression can be calculated using the method of ordinary least squares (OLS). The equations and formulas involved in calculating the parameters are as follows:

    +

    Model Representation:

    +

    The simple linear regression model can be represented as: + $$y = \beta_0 + \beta_1 x + \epsilon$$

    +

    Therefore, we can write:

    +

    ϵ = y β 0 β 1 x .

    +
      +
    1. Cost Function or mean squared error (MSE):

      +

      The MSE, measures the average squared difference between the predicted values ( y ^ ) and the actual values of the dependent variable ( y ). It is given by:

      + + $$MSE = \frac{1}{n} \sum (y_i - \hat{y}_i)^2$$ + +

      Where:

      + +
        +
      • n is the number of data points.
      • +
      • y i is the actual value of the dependent variable for the i-th data point.
      • +
      • y ^ i is the predicted value of the dependent variable for the i-th data point.
      • +
      +
    2. +
    3. Minimization of the Cost Function:

      +

      The parameters β 0 and β 1 are estimated by minimizing the cost function. The formulas for calculating the parameter estimates are derived from the derivative of the cost function with respect to each parameter.

      +

      The parameter estimates are given by:

      +
        +
      • β 1 ^ = Cov ( x , y ) V a r ( x ) β ^ 1 = ( x i x ¯ ) ( y i y ¯ ) ( x i x ¯ ) 2 $$
      • +
      • β 0 ^ = y β 1 ^ × mean ( x )

      • +

        Where:

        +
          +
        • β 0 ^ is the estimated y -intercept.

          +
        • +
        • β 1 ^ is the estimated slope.
        • +
        • Cov ( x , y ) is the covariance between x and y .
        • +
        • Var ( x ) is the variance of x .
        • +
        • mean ( x ) is the mean of x .
        • +
        • mean ( x ) is the mean of y .

          +
        + +
      +

      The estimated parameters β 0 ^ and β 1 ^ provide the values of the intercept and slope that best fit the data according to the simple linear regression model.

      +
    4. +
    5. Prediction:

      +

      Once the parameter estimates are obtained, predictions can be made using the equation:

      +

      $$\hat{y} = \hat{\beta_0} + \hat{\beta_1} x$$

      +

      Where:

      +
        +
      • y ^ is the predicted value of the dependent variable.
      • +
      • β 0 ^ is the estimated y-intercept.
      • +
      • β 1 ^ is the estimated slope.
      • +
      • x is the value of the independent variable for which the prediction is being made.

        +
      • +
      +
    6. +

      These equations and formulas allow for the calculation of the parameters in simple linear regression using the method of ordinary least squares (OLS). By minimizing the sum of squared differences between predicted and actual values, the parameters are determined to best fit the data and enable prediction of the dependent variable.

      +
    + +
    +

    Gradient Descent for Linear Regression: +

      +
    • A regression model optimizes the gradient descent algorithm to update the coefficients of the line by reducing the cost function by randomly selecting coefficient values and then iteratively updating the values to reach the minimum cost function.
    • +
    • Gradient Descent is an iterative optimization algorithm commonly used in machine learning to find the optimal parameters in a model. It can also be applied to linear regression to estimate the parameters (coefficients) that minimize the cost function.
    • +
    • The steps involved in using Gradient Descent for Linear Regression are as follows: +
        +
      1. Define the Cost Function: The cost function for linear regression is the Mean Squared Error (MSE), which measures the average squared difference between the predicted values (ŷ) and the actual values (y) of the dependent variable.
      2. + M S E = 1 2 n ( y i y ^ i ) 2 +

        Where:

        +
          +
        • n is the number of data points.
        • +
        • y i is the actual value of the dependent variable for the i-th data point. + y ^ i is the predicted value of the dependent variable for the i-th data point.
        • +
        +
      3. Initialize the Parameters: Start by initializing the parameters (coefficients) with random values. Typically, they are initialized as zero or small random values.
      4. +
      5. Calculate the Gradient: Compute the gradient of the cost function with respect to each parameter. The gradient represents the direction of steepest ascent in the cost function space. + $$\frac{\partial (MSE)}{\partial \beta_0} = \frac{1}{n}\sum (\hat{y}_i - y_i)$$ + $$\frac{\partial (MSE)}{\partial \beta_1} = \frac{1}{n}\sum (\hat{y}_i - y_i)\times x_i$$ +

        Where:

        +
          +
        • ( M S E ) β 0 + is the gradient with respect to the y-intercept parameter ( β 0 ).
        • +
        • ( M S E ) β 1 + is the gradient with respect to the slope parameter ( β 1 ).
        • +
        • y ^ i + is the predicted value of the dependent variable for the i-th data point.
        • +
        • y i is the actual value of the dependent variable for the i-th data point.
        • +
        • x i is the value of the independent variable for the i-th data point.
        • +
        +
      6. +
      7. Update the Parameters: Update the parameters using the gradient and a learning rate ( + α), which determines the step size in each iteration. + $$\beta_0 = \beta_0 - \alpha \times \frac{\partial (MSE)}{\partial \beta_0}$$ + $$\beta_1 = \beta_1 - \alpha \times \frac{\partial (MSE)}{\partial \beta_1}$$ +

        Repeat this update process for a specified number of iterations or until the change in the cost function becomes sufficiently small.

        +
      8. +
      9. Predict: Once the parameters have converged or reached the desired number of iterations, use the final parameter values to make predictions on new data. + $$\hat{y} = \beta_0 +\beta_1 x$$ +

        Where:

        +
          +
        • y ^ i is the predicted value of the dependent variable.
        • +
        • β 0 is the $y$-intercept parameter.
        • +
        • β 1 is the slope parameter.
        • +
        • x is the value of the independent variable for which the prediction is being made.
        • +
        +

        Gradient Descent iteratively adjusts the parameters by updating them in the direction of the negative gradient until it reaches a minimum point in the cost function. This process allows for the estimation of optimal parameters in linear regression, enabling the model to make accurate predictions on unseen data.

        +
        + +
        +
        +

        Let’s take an example to understand this. If we want to go from top left point of the shape to bottom of the pit, a discrete number of steps can be taken to reach the bottom.

        +
          +
        • If you decide to take larger steps each time, you may achieve the bottom sooner but, there’s a probability that you could overshoot the bottom of the pit and not even near the bottom.
        • +
        • In the gradient descent algorithm, the number of steps you’re taking can be considered as the learning rate, and this decides how fast the algorithm converges to the minima.
        • +
        +

        In the gradient descent algorithm, the number of steps you’re taking can be considered as the learning rate i.e. + α, and this decides how fast the algorithm converges to the minima.

        +
      10. +
      +
    • +
    +
    +
  • +

    -

    +
    +

    Understanding and Addressing Fitting Issues in Machine Learning Models

    +

    Overfitting and underfitting are two common problems encountered in machine learning. They occur when a machine learning model fails to generalize well to new data.

    +
    + +
    +
    +
      +
    1. Overfitting: +
        +
      • Description: Overfitting occurs when a machine learning model learns the training data too well, including the noise and irrelevant patterns. As a result, the model becomes too complex and fails to capture the underlying relationships in the data. This leads to poor performance on unseen data.
      • +
      • Signs of overfitting: +
          +
        • The model performs well on the training data but poorly on unseen data.
        • +
        • The model is complex and has a large number of parameters.
        • +
        +
      • +
      • Causes: Too complex model, excessive training time, or insufficient regularization.
      • +
      +
    2. + + +
    3. Underfitting
    4. + + +
    5. Bias (Systematic Error):
    6. + + +
    7. Variance (Random Error):
    8. + + +
    9. Data Leakage:
    10. + + +
    11. Model Instability:
    12. + + +
    13. Multicollinearity:
    14. + + +
    15. Imbalanced Data:
    16. + +
    + +

    Preventing Overfitting and Underfitting

    There are a number of techniques that can be used to prevent overfitting and underfitting. These include: +
    + +
    +
    +
    + +
    +
    diff --git a/assets/img/data-engineering/overfitting-preventaion.png b/assets/img/data-engineering/overfitting-preventaion.png new file mode 100644 index 0000000..dc3332d Binary files /dev/null and b/assets/img/data-engineering/overfitting-preventaion.png differ diff --git a/assets/img/data-engineering/overfitting-preventaion1.png b/assets/img/data-engineering/overfitting-preventaion1.png new file mode 100644 index 0000000..93cf6ad Binary files /dev/null and b/assets/img/data-engineering/overfitting-preventaion1.png differ