A linear line showing the relationship between the dependent and independent variables is called a regression line.
A regression line can show two types of relationship:
-
-
+
+
Positive Linear Relationship: If the dependent variable increases on the Y-axis and independent variable increases on X-axis, then such a relationship is termed as a Positive linear relationship.
Negative Linear Relationship: If the dependent variable decreases on the Y-axis and independent variable increases on the X-axis, then such a relationship is called a negative linear relationship.
Linear regression can be further divided into two types of the algorithm:
-
-
Simple Linear Regression: If a single independent variable is used to predict the value of a numerical dependent variable, then such a Linear Regression algorithm is called Simple Linear Regression.
-
Multiple Linear regression: If more than one independent variable is used to predict the value of a numerical dependent variable, then such a Linear Regression algorithm is called Multiple Linear Regression.
-
-
-
Mathematical Explanation:
-
There are parameters β0, β1, and σ2, such that for any fixed value of the independent variable , the dependent variable is a random variable related to through the model equation:
-
- $$y=\beta_0 + \beta_1 x +\epsilon$$
-
-
where
-
-
= Dependent Variable (Target Variable)
-
= Independent Variable (predictor Variable)
-
= intercept of the line (Gives an additional degree of freedom)
-
= Linear regression coefficient (scale factor to each input value).
-
= random error.
-
-
The goal of linear regression is to estimate the values of the regression coefficients
-
-
This algorithm explains the linear relationship between the dependent(output) variable
- and the independent(predictor) variable using a straight line
-
The goal of the linear regression algorithm is to get the best values for
- and to find the best fit line.
-
The best fit line is a line that has the least error which means the error between predicted values and actual values should be minimum.
-
For a datset with observation ,
- where the above function can be written as follows
-
-
-
-
where is the value of the observation of the dependent variable (outcome variable) in the smaple, is the value of observation
- of the independent variable or feature in the sample, is the random error (also known as residuals) in predicting the value of ,
- and are the regression parameters (or regression coefficients or feature weights).
-
-
-
-
Note:
-
-
The quantity ϵ in the model equation is the “error” -- a random variable, assumed to be symmetrically distributed with
-
-
It is to be noted here that there are no assumption made about the distribution of ϵ, yet.
-
-
The (the intercept of the true regression line) parameter is average value of Y when x is zero.
-
The (the slope of the true regression line): The expected (average) change in Y associated with a 1-unit increase in the value of x.
-
What is ?: is a measure of how much the values of Y spread out about the mean value (homogeneity of variance assumption).
In simple linear regression, there is only one independent variable () and one dependent variable (). The parameters (coefficients) in simple linear regression can be calculated using the method of ordinary least squares (OLS). The equations and formulas involved in calculating the parameters are as follows:
-
Model Representation:
-
The simple linear regression model can be represented as:
- $$y = \beta_0 + \beta_1 x + \epsilon$$
-
Therefore, we can write:
-
.
-
-
-
Cost Function or mean squared error (MSE):
-
The MSE, measures the average squared difference between the predicted values () and the actual values of the dependent variable (). It is given by:
-
$$MSE = \frac{1}{n} \sum (y_i - \hat{y}_i)^2$$
-
Where:
-
-
is the number of data points.
-
is the actual value of the dependent variable for the i-th data point.
-
is the predicted value of the dependent variable for the i-th data point.
-
-
+
+
-
Minimization of the Cost Function:
-
The parameters and are estimated by minimizing the cost function. The formulas for calculating the parameter estimates are derived from the derivative of the cost function with respect to each parameter.
-
The parameter estimates are given by:
-
-
$$
-
-
Where:
-
-
is the estimated -intercept.
-
-
is the estimated slope.
-
is the covariance between and .
-
is the variance of .
-
is the mean of .
-
is the mean of .
-
-
-
-
The estimated parameters and provide the values of the intercept and slope that best fit the data according to the simple linear regression model.
-
-
Prediction:
-
Once the parameter estimates are obtained, predictions can be made using the equation:
-
$$\hat{y} = \hat{\beta_0} + \hat{\beta_1} x$$
-
Where:
-
-
is the predicted value of the dependent variable.
-
is the estimated y-intercept.
-
is the estimated slope.
-
is the value of the independent variable for which the prediction is being made.
-
-
-
-
These equations and formulas allow for the calculation of the parameters in simple linear regression using the method of ordinary least squares (OLS). By minimizing the sum of squared differences between predicted and actual values, the parameters are determined to best fit the data and enable prediction of the dependent variable.
-
-
-
-
Gradient Descent for Linear Regression:
-
-
A regression model optimizes the gradient descent algorithm to update the coefficients of the line by reducing the cost function by randomly selecting coefficient values and then iteratively updating the values to reach the minimum cost function.
-
Gradient Descent is an iterative optimization algorithm commonly used in machine learning to find the optimal parameters in a model. It can also be applied to linear regression to estimate the parameters (coefficients) that minimize the cost function.
-
The steps involved in using Gradient Descent for Linear Regression are as follows:
-
-
Define the Cost Function: The cost function for linear regression is the Mean Squared Error (MSE), which measures the average squared difference between the predicted values (ŷ) and the actual values (y) of the dependent variable.
-
-
Where:
-
-
is the number of data points.
-
is the actual value of the dependent variable for the i-th data point.
- is the predicted value of the dependent variable for the i-th data point.
-
-
Initialize the Parameters: Start by initializing the parameters (coefficients) with random values. Typically, they are initialized as zero or small random values.
-
Calculate the Gradient: Compute the gradient of the cost function with respect to each parameter. The gradient represents the direction of steepest ascent in the cost function space.
- $$\frac{\partial (MSE)}{\partial \beta_0} = \frac{1}{n}\sum (\hat{y}_i - y_i)$$
- $$\frac{\partial (MSE)}{\partial \beta_1} = \frac{1}{n}\sum (\hat{y}_i - y_i)\times x_i$$
-
Where:
-
-
- is the gradient with respect to the y-intercept parameter ().
-
- is the gradient with respect to the slope parameter ().
-
- is the predicted value of the dependent variable for the i-th data point.
-
is the actual value of the dependent variable for the i-th data point.
-
is the value of the independent variable for the i-th data point.
-
-
-
Update the Parameters: Update the parameters using the gradient and a learning rate (
- α), which determines the step size in each iteration.
- $$\beta_0 = \beta_0 - \alpha \times \frac{\partial (MSE)}{\partial \beta_0}$$
- $$\beta_1 = \beta_1 - \alpha \times \frac{\partial (MSE)}{\partial \beta_1}$$
-
Repeat this update process for a specified number of iterations or until the change in the cost function becomes sufficiently small.
-
-
Predict: Once the parameters have converged or reached the desired number of iterations, use the final parameter values to make predictions on new data.
- $$\hat{y} = \beta_0 +\beta_1 x$$
-
Where:
-
-
is the predicted value of the dependent variable.
-
is the $y$-intercept parameter.
-
is the slope parameter.
-
is the value of the independent variable for which the prediction is being made.
-
-
Gradient Descent iteratively adjusts the parameters by updating them in the direction of the negative gradient until it reaches a minimum point in the cost function. This process allows for the estimation of optimal parameters in linear regression, enabling the model to make accurate predictions on unseen data.
-
-
Let’s take an example to understand this. If we want to go from top left point of the shape to bottom of the pit, a discrete number of steps can be taken to reach the bottom.
-
-
If you decide to take larger steps each time, you may achieve the bottom sooner but, there’s a probability that you could overshoot the bottom of the pit and not even near the bottom.
-
In the gradient descent algorithm, the number of steps you’re taking can be considered as the learning rate, and this decides how fast the algorithm converges to the minima.
-
-
In the gradient descent algorithm, the number of steps you’re taking can be considered as the learning rate i.e.
- α, and this decides how fast the algorithm converges to the minima.
Linear regression can be further divided into two types of the algorithm:
+
+
Simple Linear Regression: If a single independent variable is used to predict the value of a numerical dependent variable, then such a Linear Regression algorithm is called Simple Linear Regression.
+
Multiple Linear regression: If more than one independent variable is used to predict the value of a numerical dependent variable, then such a Linear Regression algorithm is called Multiple Linear Regression.
+
+
Assumptions of Linear Regression
@@ -406,7 +240,7 @@
Assumptions of Linear Regression
-
Evaluation Metrics for Linear Regression
+
Model Evaluation
To train an accurate linear regression model, we need a way to quantify how good (or bad) our model performs. In machine learning, we call such performance-measuring functions loss functions. Several popular loss functions exist for regression problems.
To measure our model's performance, we'll use one of the most popular: mean-squared error (MSE). Here are some commonly used evaluation metrics:
@@ -465,109 +299,293 @@
Evaluation Metrics for Linear Regression
-
Adjusted R-squared:
The Adjusted R-squared accounts for the number of independent variables in the model. It penalizes the inclusion of irrelevant variables and rewards the inclusion of relevant variables.
A higher Adjusted $R^2$ value indicates a better fit of the model while considering the complexity of the model.
-
These evaluation metrics help assess the performance of a linear regression model by quantifying the accuracy of the predictions and the extent to which the independent variables explain the dependent variable. It is important to consider multiple metrics to gain a comprehensive understanding of the model's performance.
+
A higher Adjusted R-squared value indicates a better fit of the model while considering the complexity of the model.
+
These evaluation metrics help assess the performance of a linear regression model by quantifying the accuracy of the predictions and the extent to which the independent variables explain the
+ dependent variable. It is important to consider multiple metrics to gain a comprehensive understanding of the model's performance.
-
-
-
-
Understanding and Addressing Fitting Issues in Machine Learning Models
-
Overfitting and underfitting are two common problems encountered in machine learning. They occur when a machine learning model fails to generalize well to new data.
-
-
Overfitting
-
-
Description: Overfitting occurs when a machine learning model learns the training data too well, including the noise and irrelevant patterns. As a result, the model becomes too complex and fails to capture the underlying relationships in the data. This leads to poor performance on unseen data.
-
Signs of overfitting:
-
-
The model performs well on the training data but poorly on unseen data.
-
The model is complex and has a large number of parameters.
-
-
-
Causes: Too complex model, excessive training time, or insufficient regularization.
-
-
-
Underfitting
-
-
Description: Underfitting occurs when a machine learning model is too simple and does not capture the underlying relationships in the data. This results in poor performance on both the training data and unseen data.
-
Signs of underfitting:
-
-
The model performs poorly on both the training data and unseen data.
-
The model is simple and has a small number of parameters.
-
-
-
Causes: Model complexity is too low, insufficient training, or inadequate feature representation.
-
-
-
Bias (Systematic Error):
-
-
Description: The model consistently makes predictions that deviate from the true values.
-
Symptoms: Consistent errors in predictions across different datasets.
-
Causes: Insufficiently complex model, inadequate feature representation, or biased training data.
-
-
Variance (Random Error):
-
-
Description: The model's predictions are highly sensitive to variations in the training data.
-
Symptoms: High variability in predictions when trained on different subsets of the data.
-
Causes: Too complex model, small dataset, or noisy training data.
-
+
+ Selecting An Evaluation Metric:
+
Many methods exist for evaluating regression models, each with different concerns around interpretability, theory, and usability. The evaluation metric should reflect whatever it is you actually
+ care about when making predictions. For example, when we use MSE, we are implicitly saying that we think the cost of our prediction error should reflect the quadratic (squared) distance between
+ what we predicted and what is correct. This may work well if we want to punish outliers or if our data is minimized by the mean, but this comes at the cost of interpretability: we output our error
+ in squared units (though this may be fixed with RMSE). If instead we wanted our error to reflect the linear distance between what we predicted and what is correct, or we wanted our data minimized by
+ the median, we could try something like Mean Abosulte Error (MAE). Whatever the case, you should be thinking of your evaluation metric as part of your modeling process, and select the best metric based
+ on the specific concerns of your use-case.
+
+ Are Our Coefficients Valid?:
+
In research publications and statistical software, coefficients of regression models are often presented with associated p-values. These p-values come from traditional null hypothesis statistical tests: t-tests are used to measure whether a given cofficient is significantly different than zero (the null hypothesis that a particular coefficient
+ βi equals zero), while F tests are used to measure whether any of the terms in a regression model are significantly different from zero. Different opinions exist on the utility of such tests.
+
+
-
Data Leakage:
-
-
Description: Information from the validation or test set inadvertently influences the model during training.
Causes: Improper splitting of data, using future information during training.
-
+
+
Mathematical Explanation:
+
There are parameters β0, β1, and ϵ, such that for any fixed value of the independent variable , the dependent variable is a random variable related to through the model equation:
+
+ $$y=\beta_0 + \beta_1 x +\epsilon$$
-
Model Instability:
+
where
-
Description: Small changes in the input data lead to significant changes in model predictions.
-
Symptoms: Lack of robustness in the model's performance.
-
Causes: Sensitivity to outliers, highly nonlinear relationships.
+
= Dependent Variable (Target Variable)
+
= Independent Variable (predictor Variable)
+
= intercept of the line (Gives an additional degree of freedom)
+
= Linear regression coefficient (scale factor to each input value).
+
= random error.
-
-
Multicollinearity:
+
The goal of linear regression is to estimate the values of the regression coefficients
+
+
This algorithm explains the linear relationship between the dependent(output) variable
+ and the independent(predictor) variable using a straight line
+
Description: High correlation among independent variables in regression models.
-
Symptoms: Unstable coefficient estimates, difficulty in isolating the effect of individual variables.
-
Causes: Redundant or highly correlated features.
-
+
The goal of the linear regression algorithm is to get the best values for
+ and to find the best fit line.
+
The best fit line is a line that has the least error which means the error between predicted values and actual values should be minimum.
+
For a datset with observation ,
+ where the above function can be written as follows
+
+
-
Imbalanced Data:
-
-
Description: A disproportionate distribution of classes in classification problems.
-
Symptoms: Biased models toward the majority class, poor performance on minority classes.
-
Causes: Inadequate representation of minority class, biased sampling.
+
where is the value of the observation of the dependent variable (outcome variable) in the smaple, is the value of observation
+ of the independent variable or feature in the sample, is the random error (also known as residuals) in predicting the value of ,
+ and are the regression parameters (or regression coefficients or feature weights).
In simple linear regression, there is only one independent variable () and one dependent variable (). The parameters (coefficients) in simple linear regression can be calculated using the method of ordinary least squares (OLS). The equations and formulas involved in calculating the parameters are as follows:
+
Model Representation:
+
The simple linear regression model can be represented as:
+ $$y = \beta_0 + \beta_1 x + \epsilon$$
+
Therefore, we can write:
+
.
+
+
Cost Function or mean squared error (MSE):
+
The MSE, measures the average squared difference between the predicted values () and the actual values of the dependent variable (). It is given by:
is the actual value of the dependent variable for the i-th data point.
+
is the predicted value of the dependent variable for the i-th data point.
+
+
+
Minimization of the Cost Function:
+
The parameters and are estimated by minimizing the cost function. The formulas for calculating the parameter estimates are derived from the derivative of the cost function with respect to each parameter.
+
The parameter estimates are given by:
+
+
$$
+
+
Where:
+
+
is the estimated -intercept.
+
+
is the estimated slope.
+
is the covariance between and .
+
is the variance of .
+
is the mean of .
+
is the mean of .
+
+
+
+
The estimated parameters and provide the values of the intercept and slope that best fit the data according to the simple linear regression model.
+
+
Prediction:
+
Once the parameter estimates are obtained, predictions can be made using the equation:
+
$$\hat{y} = \hat{\beta_0} + \hat{\beta_1} x$$
+
Where:
+
+
is the predicted value of the dependent variable.
+
is the estimated y-intercept.
+
is the estimated slope.
+
is the value of the independent variable for which the prediction is being made.
+
+
+
+
These equations and formulas allow for the calculation of the parameters in simple linear regression using the method of ordinary least squares (OLS). By minimizing the sum of squared differences between predicted and actual values, the parameters are determined to best fit the data and enable prediction of the dependent variable.
+
+
+
+
Gradient Descent for Linear Regression:
+
+
A regression model optimizes the gradient descent algorithm to update the coefficients of the line by reducing the cost function by randomly selecting coefficient values and then iteratively updating the values to reach the minimum cost function.
+
Gradient Descent is an iterative optimization algorithm commonly used in machine learning to find the optimal parameters in a model. It can also be applied to linear regression to estimate the parameters (coefficients) that minimize the cost function.
+
The steps involved in using Gradient Descent for Linear Regression are as follows:
+
+
Define the Cost Function: The cost function for linear regression is the Mean Squared Error (MSE), which measures the average squared difference between the predicted values (ŷ) and the actual values (y) of the dependent variable.
+
+
Where:
+
+
is the number of data points.
+
is the actual value of the dependent variable for the i-th data point.
+ is the predicted value of the dependent variable for the i-th data point.
+
+
Initialize the Parameters: Start by initializing the parameters (coefficients) with random values. Typically, they are initialized as zero or small random values.
+
Calculate the Gradient: Compute the gradient of the cost function with respect to each parameter. The gradient represents the direction of steepest ascent in the cost function space.
+ $$\frac{\partial (MSE)}{\partial \beta_0} = \frac{1}{n}\sum (\hat{y}_i - y_i)$$
+ $$\frac{\partial (MSE)}{\partial \beta_1} = \frac{1}{n}\sum (\hat{y}_i - y_i)\times x_i$$
+
Where:
+
+
+ is the gradient with respect to the y-intercept parameter ().
+
+ is the gradient with respect to the slope parameter ().
+
+ is the predicted value of the dependent variable for the i-th data point.
+
is the actual value of the dependent variable for the i-th data point.
+
is the value of the independent variable for the i-th data point.
+
+
+
Update the Parameters: Update the parameters using the gradient and a learning rate (
+ α), which determines the step size in each iteration.
+ $$\beta_0 = \beta_0 - \alpha \times \frac{\partial (MSE)}{\partial \beta_0}$$
+ $$\beta_1 = \beta_1 - \alpha \times \frac{\partial (MSE)}{\partial \beta_1}$$
+
Repeat this update process for a specified number of iterations or until the change in the cost function becomes sufficiently small.
+
+
Predict: Once the parameters have converged or reached the desired number of iterations, use the final parameter values to make predictions on new data.
+ $$\hat{y} = \beta_0 +\beta_1 x$$
+
Where:
+
+
is the predicted value of the dependent variable.
+
is the $y$-intercept parameter.
+
is the slope parameter.
+
is the value of the independent variable for which the prediction is being made.
+
+
Gradient Descent iteratively adjusts the parameters by updating them in the direction of the negative gradient until it reaches a minimum point in the cost function. This process allows for the estimation of optimal parameters in linear regression, enabling the model to make accurate predictions on unseen data.
+
+
Let’s take an example to understand this. If we want to go from top left point of the shape to bottom of the pit, a discrete number of steps can be taken to reach the bottom.
+
+
If you decide to take larger steps each time, you may achieve the bottom sooner but, there’s a probability that you could overshoot the bottom of the pit and not even near the bottom.
+
In the gradient descent algorithm, the number of steps you’re taking can be considered as the learning rate, and this decides how fast the algorithm converges to the minima.
+
+
In the gradient descent algorithm, the number of steps you’re taking can be considered as the learning rate i.e.
+ α, and this decides how fast the algorithm converges to the minima.
Overfitting and underfitting are two common problems encountered in machine learning. They occur when a machine learning model fails to generalize well to new data.
+
+
+
Overfitting:
+
+
Description: Overfitting occurs when a machine learning model learns the training data too well, including the noise and irrelevant patterns. As a result, the model becomes too complex and fails to capture the underlying relationships in the data. This leads to poor performance on unseen data.
+
Signs of overfitting:
+
+
The model performs well on the training data but poorly on unseen data.
+
The model is complex and has a large number of parameters.
+
+
+
Causes: Too complex model, excessive training time, or insufficient regularization.
+
+
+
+
+
Underfitting
+
+
Description: Underfitting occurs when a machine learning model is too simple and does not capture the underlying relationships in the data. This results in poor performance on both the training data and unseen data.
+
Signs of underfitting:
+
+
The model performs poorly on both the training data and unseen data.
+
The model is simple and has a small number of parameters.
+
+
+
Causes: Model complexity is too low, insufficient training, or inadequate feature representation.
+
+
+
Bias (Systematic Error):
+
+
Description: The model consistently makes predictions that deviate from the true values.
+
Symptoms: Consistent errors in predictions across different datasets.
+
Causes: Insufficiently complex model, inadequate feature representation, or biased training data.
+
+
+
Variance (Random Error):
+
+
Description: The model's predictions are highly sensitive to variations in the training data.
+
Symptoms: High variability in predictions when trained on different subsets of the data.
+
Causes: Too complex model, small dataset, or noisy training data.
+
+
+
Data Leakage:
+
+
Description: Information from the validation or test set inadvertently influences the model during training.
Causes: Improper splitting of data, using future information during training.
+
+
+
Model Instability:
+
+
Description: Small changes in the input data lead to significant changes in model predictions.
+
Symptoms: Lack of robustness in the model's performance.
+
Causes: Sensitivity to outliers, highly nonlinear relationships.
+
+
+
Multicollinearity:
+
+
Description: High correlation among independent variables in regression models.
+
Symptoms: Unstable coefficient estimates, difficulty in isolating the effect of individual variables.
+
Causes: Redundant or highly correlated features.
+
+
+
Imbalanced Data:
+
+
Description: A disproportionate distribution of classes in classification problems.
+
Symptoms: Biased models toward the majority class, poor performance on minority classes.
+
Causes: Inadequate representation of minority class, biased sampling.
+
+
+
+
Preventing Overfitting and Underfitting
There are a number of techniques that can be used to prevent overfitting and underfitting. These include:
+
Regularization: Regularization is a technique that penalizes complex models. This helps to prevent the model from learning the noise and irrelevant patterns in the training data. Common regularization techniques include L1 regularization, L2 regularization, and dropout.
Early stopping: Early stopping is a technique that stops training the model when it starts to overfit on the validation data. The validation data is a subset of the training data that is held out during training and used to evaluate the model's performance.
Cross-validation: Cross-validation is a technique that divides the training data into multiple folds. The model is trained on a subset of the folds and evaluated on the remaining folds. This process is repeated multiple times so that the model is evaluated on all of the data. Cross-validation can be used to select the best hyperparameters for the model.
Model selection: Model selection is a technique that compares different models and selects the one that performs best on the validation data. This can be done using a variety of techniques, such as k-fold cross-validation or Akaike Information Criterion (AIC).
+
diff --git a/assets/img/data-engineering/overfitting-preventaion.png b/assets/img/data-engineering/overfitting-preventaion.png
new file mode 100644
index 0000000..dc3332d
Binary files /dev/null and b/assets/img/data-engineering/overfitting-preventaion.png differ
diff --git a/assets/img/data-engineering/overfitting-preventaion1.png b/assets/img/data-engineering/overfitting-preventaion1.png
new file mode 100644
index 0000000..93cf6ad
Binary files /dev/null and b/assets/img/data-engineering/overfitting-preventaion1.png differ