What is Polynomial Regression?
Polynomial Regression is a process by which given a set of inputs and their corresponding outputs, we find an nth degree polynomial f(x) which converts the inputs into the outputs.
This f(x) is of the form:
Polynomial regression has several advantages over linear regression because it can be used to identify patterns that linear regression cannot. For example, if a ball is thrown upwards, we apply a quadratic function to calculate the height of the ball over time. Also, cubic equations are used to calculate planetary motion. These patterns cannot be identified using linear regression.
Generating a random dataset
To do any Polynomial Regression, the first thing we need is data.
In the first part of this tutorial, we perform polynomial regression on a random, generated dataset to understand the concepts. Then we will do the same on some real data.
Part 1: Using generated dataset
https://colab.research.google.com/drive/1_Xa5QG-HLPV8yxIOd5vD-dA6PHYAfvd8
We start by importing some libraries that we will be using in this tutorial.
- import numpy as np
- import matplotlib.pyplot as plt
- import tensorflow as tf
- import operator
- from sklearn.metrics import mean_squared_error, r2_score
- import pandas as pd
As you expect this creates random points with random coordinates. We can visualise this using a scatter plot.
- np.random.seed(0)
- x = np.random.normal(0, 1, 20)
- y = np.random.normal(0, 1, 20)
- plt.scatter(x,y, s=10)
- plt.show()
Doing Polynomial Regression
We are doing Polynomial Regression using Tensorflow. We have to feed in the degree of the polynomial that we want and the x data for this. The degree is an important feature that we will be covering later. First, we have to modify the data so that it can be accepted by tensorflow. Then we have to set some parameters like the optimizer and the loss function. Finally, we train the model for 12000 steps / epochs.
- deg=3
- W = tf.Variable(tf.random_normal([deg,1]), name='weight')
- #bias
- b = tf.Variable(tf.random_normal([1]), name='bias')
- x_=tf.placeholder(tf.float32,shape=[None,deg])
- y_=tf.placeholder(tf.float32,shape=[None, 1])
- def modify_input(x,x_size,n_value):
- x_new=np.zeros([x_size,n_value])
-
**for** i **in** range(deg):
-
x\_new[:,i]=np.power(x,(i+1))
-
x\_new[:,i]=x\_new[:,i]/np.max(x\_new[:,i])
-
**return** x\_new
- x_modified=modify_input(x,x.size,deg)
- Y_pred=tf.add(tf.matmul(x_,W),b)
- #algortihm
- loss = tf.reduce_mean(tf.square(Y_pred -y_ ))
- #training algorithm
- optimizer = tf.train.GradientDescentOptimizer(0.05).minimize(loss)
- #initializing the variables
- init = tf.global_variables_initializer()
- #starting the session session
- sess = tf.Session()
- sess.run(init)
- epoch=12000
- for step in range(epoch):
-
\_, c=sess.run([optimizer, loss], feed\_dict={x\_: x\_modified, y\_: y})
-
**if** step%1000==0 :
-
**print** ("loss: " + str(c))
- y_test=sess.run(Y_pred, feed_dict={x_:x_modified})
Finally we calculate the errors.
- mse = np.sqrt(mean_squared_error(y,y_poly_pred))
- r2 = r2_score(y,y_poly_pred)
- print (mse)
- print (r2)
1.1507521092081143
0.061440511342737425
Loss functions
We need to calculate how efficient our model is at capturing the patterns in the data. There are 2 common ways of doing this:
- Mean Square Error
- R square score (R2 score)
Let us understand the math behind these two:
Mean Square Error:
For every value of x, we have the actual value of y and the value of y that our line predicts. We find the difference between the two. Then we add the differences for each value of x. Finally we divide this by the number of values of x.
This equation has a problem though. Some times the difference will be positive and other times it will be negative. These values can cancel out and even though there may be large errors the output will show that there is no error. So to tackle this problem, we square each difference.
R2 score:
First we have to find the mean m of all the values of y:
Then we get the difference between each value of y and the mean. We square each difference and add them. Let this value be k.
Now we divide the mse by k and subtract the result from 1. This gives us the R2 score. The R2 score is a value between 0 and 1. A large R2 score means x correlates to y well and the line can predict the y value well.
Let us see how this looks in code. There are some inbuilt functions that handle the calculations for us:
- mse = np.sqrt(mean_squared_error(y,y_pred))
- r2 = r2_score(y,y_pred)
- print (mse)
- print (r2)
1.1832766119182259z
0.007636444138149345
Visualising the results
Now let us try to visualise the results.
First we find, the coefficients and the intercept of the quadratic equation generated.
- print ("Model paramters:")
- print (sess.run(W))
- print ("bias:%f" %sess.run(b))
Model paramters:
[[ 1.1229055 ]
[-2.1566594 ]
[ 0.67295593]]
bias:0.128522
Using this we can find the equation itself
- res = "y = f(x) = " + str(sess.run(b)[0])
- for i, r in enumerate(sess.run(W)):
-
res = res + " + {}\*x^{}".format("%.2f" % r[0], i + 1)
- print (res)
y = f(x) = 0.088324 + 1.23*x^1 + -1.65*x^2
Finally, we can visualise the function by plotting it. We plot a line graph of the equation.
- plt.scatter(x, y, s=10)
- sort_axis = operator.itemgetter(0)
- sorted_zip = sorted(zip(x,y_poly_pred), key=sort_axis)
- x, y_poly_pred = zip(*sorted_zip)
- plt.plot(x, y_poly_pred, color='red')
- plt.show()
Part 2: Using real data
https://colab.research.google.com/drive/1S0wz7xquJ5-6MaREEMnxx-_HA7r6BVZw
In this part of the tutorial, we will be using some data (you can get it here: https://github.com/SiddhantAttavar/PolynomialRegression/blob/master/Position_Salaries.csv ) about position level and salary relationship in a company. As you can see as the level increases, so does the salary. However, the relationship is not linear.
First we import the data.
- url = 'https://raw.githubusercontent.com/SiddhantAttavar/PolynomialRegression/master/Position\_Salaries.csv'
- datas = pd.read_csv(url)
- print (datas)
Position,Level,Salary
Business Analyst,1,45000
Junior Consultant,2,50000
Senior Consultant,3,60000
Manager,4,80000
Country Manager,5,110000
Region Manager,6,150000
Partner,7,200000
Senior Partner,8,300000
C-level,9,500000
CEO,10,1000000
The data is stored as a csv (comma separated values) file. In this file, each column is separated by a comma, which makes it easy to read.
In this case the x values are the level column and the y values are the salary column. We create the arrays using some functions in the pandas library.
- X = datas.iloc[:, 1].values
- Y = datas.iloc[:, 2].values
- Y = Y[:, np.newaxis]
Now we can plot the data using a scatter plot.
- plt.scatter(X, Y, s=10)
- plt.show()
We can do Polynomial Regression for this data with degree 2. We will modify this later in the course.
- deg = 2 #@param {type:"slider", min:1, max:20, step:1}
- W = tf.Variable(tf.random_normal([deg,1]), name='weight')
- #bias
- b = tf.Variable(tf.random_normal([1]), name='bias')
- X_=tf.placeholder(tf.float32,shape=[None,deg])
- Y_=tf.placeholder(tf.float32,shape=[None, 1])
- X_modified=modify_input(X,X.size,deg)
- Y_pred=tf.add(tf.matmul(X_,W),b)
- #algortihm
- loss = tf.reduce_mean(tf.square(Y_pred -Y_ ))
- #training algorithm
- optimizer = tf.train.GradientDescentOptimizer(0.05).minimize(loss)
- #initializing the variables
- init = tf.global_variables_initializer()
- #starting the session session
- sess = tf.Session()
- sess.run(init)
- epoch=12000
- for step in range(epoch):
-
\_, c=sess.run([optimizer, loss], feed\_dict={X\_: X\_modified, Y\_: Y})
-
**if** step%1000==0 :
-
**print** ("loss: " + str(c))
- Y_test=sess.run(Y_pred, feed_dict={X_:X_modified})
Now, we can find how well our model is performing
- rmse = np.sqrt(mean_squared_error(Y,lin_poly.predict(poly.fit_transform(X))))
- r2 = r2_score(y,y_poly_pred)
- print (rmse)
- print (r2)
1.1507521216059198
0.06144049111930627
After this we visualise the results. First, we get the coefficients and print the formula and then, we plot the equation
- print ("Model paramters:")
- print (sess.run(W))
- print ("bias:%f" %sess.run(b))
- res = "y = f(x) = " + str(sess.run(b)[0])
- for i, r in enumerate(sess.run(W)):
-
res = res + " + {}\*x^{}".format("%.2f" % r[0], i + 1)
- print (res)
- plt.scatter(X, Y, s=10)
- sort_axis = operator.itemgetter(0)
- sorted_zip = sorted(zip(X,Y_test), key=sort_axis)
- X, Y_poly_pred = zip(*sorted_zip)
- plt.plot(X, Y_poly_pred, color='red')
- plt.show()
Lastly we predict what the salary for level 11 would be.
- lin_poly.predict(poly.fit_transform([[11.0]]))[0]
- 1121833.333333334
Overfitting
Under-fitting and over-fitting are 2 things that you must always try to avoid.
Under-fitting is when your model is not able to recognise the relationship between the 2 quantities. For example it may be found when trying to apply a linear model to a quadratic relationship. Common symptoms of this are high MSE and low R2 score.
On the other hand, overfitting is also a common issue. The model performs very well on the training data, but fails to perform on new, unseen data. In this case, the curve generated passes through all or nearly all the datapoints. The model fails to understand the overall pattern and cannot generalize.
There are 2 ways of eliminating these problems:
- Providing more data: If you provide more data, the model is more likely to identify the general pattern
- Finding the correct degree for the polynomial
Finding the correct degree
In our position vs salary example, we have very limited data, so adding more data is not possible. We have to find the correct degree. Here is a table of some degrees, their graph, their MSE, and their R2 score
Degree | Graph | MSE | R2 score | Equation |
---|---|---|---|---|
1 | ||||
163388.73 | 0.66 |
| | 2 | | 82212.12 | 0.91 | | | 3 | | 38931.5 | 0.9812097727913367 | | | 5 | | 4047.5 | 0.9997969027099755 | | | 10 | | 0.0008 | 1.0 | |
Over here the linear polynomial is an underfit since it fails to capture the pattern. It also has a high MSE and a low R2 score.
The degree 5 and degree 10 polynomials overfit the data. They have high scores. In fact the degree 10 polynomial has a R2 score of 1, which is the best possible. However, given data slightly off the curve, they will not be able to generalize.
The degree 2 and degree 3 polynomials are a good fit as they capture the pattern but do not overfit. Note that in general the best fits usually do not have a degree greater than 3.
Summary
In this tutorial, we learnt the following concepts:
- Linear Regression
- Generating datasets
- Mean Square Error
- R2 score
- Polynomial Regression
- Plotting scatter plots and line graphs
- Importing datasets from csv files
- Overfitting and underfitting
Hope you are able to use these concepts in your own projects.