Polynomial Regression in Python

What is Polynomial Regression?

Polynomial Regression is a process by which given a set of inputs and their corresponding outputs, we find an nth degree polynomial f(x) which converts the inputs into the outputs.

This f(x) is of the form:

Polynomial regression has several advantages over linear regression because it can be used to identify patterns that linear regression cannot. For example, if a ball is thrown upwards, we apply a quadratic function to calculate the height of the ball over time. Also, cubic equations are used to calculate planetary motion. These patterns cannot be identified using linear regression.

Generating a random dataset

To do any Polynomial Regression, the first thing we need is data.

In the first part of this tutorial, we perform polynomial regression on a random, generated dataset to understand the concepts. Then we will do the same on some real data.

Part 1: Using generated dataset

https://colab.research.google.com/drive/1_Xa5QG-HLPV8yxIOd5vD-dA6PHYAfvd8

We start by importing some libraries that we will be using in this tutorial.

import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
import operator
from sklearn.metrics import mean_squared_error, r2_score
import pandas as pd

As you expect this creates random points with random coordinates. We can visualise this using a scatter plot.

np.random.seed(0)
x = np.random.normal(0, 1, 20)
y = np.random.normal(0, 1, 20)
plt.scatter(x,y, s=10)
plt.show()

Doing Polynomial Regression

We are doing Polynomial Regression using Tensorflow. We have to feed in the degree of the polynomial that we want and the x data for this. The degree is an important feature that we will be covering later. First, we have to modify the data so that it can be accepted by tensorflow. Then we have to set some parameters like the optimizer and the loss function. Finally, we train the model for 12000 steps / epochs.

deg=3
W = tf.Variable(tf.random_normal([deg,1]), name='weight')
#bias
b = tf.Variable(tf.random_normal([1]), name='bias')
x_=tf.placeholder(tf.float32,shape=[None,deg])
y_=tf.placeholder(tf.float32,shape=[None, 1])
def modify_input(x,x_size,n_value):
x_new=np.zeros([x_size,n_value])
```
**for**  i  **in**  range(deg):
```
```
  x\_new[:,i]=np.power(x,(i+1))
```

  x\_new[:,i]=x\_new[:,i]/np.max(x\_new[:,i])

```
**return**  x\_new
```
x_modified=modify_input(x,x.size,deg)
Y_pred=tf.add(tf.matmul(x_,W),b)
#algortihm
loss = tf.reduce_mean(tf.square(Y_pred -y_ ))
#training algorithm
optimizer = tf.train.GradientDescentOptimizer(0.05).minimize(loss)
#initializing the variables
init = tf.global_variables_initializer()
#starting the session session
sess = tf.Session()
sess.run(init)
epoch=12000
for step in range(epoch):

 \_, c=sess.run([optimizer, loss], feed\_dict={x\_: x\_modified, y\_: y})

```
  **if**  step%1000==0 :
```

    **print**  (&quot;loss: &quot; + str(c))

y_test=sess.run(Y_pred, feed_dict={x_:x_modified})

Finally we calculate the errors.

mse = np.sqrt(mean_squared_error(y,y_poly_pred))
r2 = r2_score(y,y_poly_pred)
print (mse)
print (r2)

1.1507521092081143

0.061440511342737425

Loss functions

We need to calculate how efficient our model is at capturing the patterns in the data. There are 2 common ways of doing this:

Mean Square Error
R square score (R2 score)

Let us understand the math behind these two:

Mean Square Error:

For every value of x, we have the actual value of y and the value of y that our line predicts. We find the difference between the two. Then we add the differences for each value of x. Finally we divide this by the number of values of x.

This equation has a problem though. Some times the difference will be positive and other times it will be negative. These values can cancel out and even though there may be large errors the output will show that there is no error. So to tackle this problem, we square each difference.

R2 score:

First we have to find the mean m of all the values of y:

Then we get the difference between each value of y and the mean. We square each difference and add them. Let this value be k.

Now we divide the mse by k and subtract the result from 1. This gives us the R2 score. The R2 score is a value between 0 and 1. A large R2 score means x correlates to y well and the line can predict the y value well.

Let us see how this looks in code. There are some inbuilt functions that handle the calculations for us:

mse = np.sqrt(mean_squared_error(y,y_pred))
r2 = r2_score(y,y_pred)
print (mse)
print (r2)

1.1832766119182259z

0.007636444138149345

Visualising the results

Now let us try to visualise the results.

First we find, the coefficients and the intercept of the quadratic equation generated.

print ("Model paramters:")
print (sess.run(W))
print ("bias:%f" %sess.run(b))

Model paramters:

[[ 1.1229055 ]

[-2.1566594 ]

[ 0.67295593]]

bias:0.128522

Using this we can find the equation itself

res = "y = f(x) = " + str(sess.run(b)[0])
for i, r in enumerate(sess.run(W)):

res = res + &quot; + {}\*x^{}&quot;.format(&quot;%.2f&quot; % r[0], i + 1)

print (res)

y = f(x) = 0.088324 + 1.23*x^1 + -1.65*x^2

Finally, we can visualise the function by plotting it. We plot a line graph of the equation.

plt.scatter(x, y, s=10)
sort the values of x before line plot
sort_axis = operator.itemgetter(0)
sorted_zip = sorted(zip(x,y_poly_pred), key=sort_axis)
x, y_poly_pred = zip(*sorted_zip)
plt.plot(x, y_poly_pred, color='red')
plt.show()

Part 2: Using real data

https://colab.research.google.com/drive/1S0wz7xquJ5-6MaREEMnxx-_HA7r6BVZw

In this part of the tutorial, we will be using some data (you can get it here: https://github.com/SiddhantAttavar/PolynomialRegression/blob/master/Position_Salaries.csv ) about position level and salary relationship in a company. As you can see as the level increases, so does the salary. However, the relationship is not linear.

First we import the data.

Importing the dataset
url = 'https://raw.githubusercontent.com/SiddhantAttavar/PolynomialRegression/master/Position\_Salaries.csv'
datas = pd.read_csv(url)
print (datas)

Position,Level,Salary

Business Analyst,1,45000

Junior Consultant,2,50000

Senior Consultant,3,60000

Manager,4,80000

Country Manager,5,110000

Region Manager,6,150000

Partner,7,200000

Senior Partner,8,300000

C-level,9,500000

CEO,10,1000000

The data is stored as a csv (comma separated values) file. In this file, each column is separated by a comma, which makes it easy to read.

In this case the x values are the level column and the y values are the salary column. We create the arrays using some functions in the pandas library.

X = datas.iloc[:, 1].values
Y = datas.iloc[:, 2].values
Y = Y[:, np.newaxis]

Now we can plot the data using a scatter plot.

plt.scatter(X, Y, s=10)
plt.show()

We can do Polynomial Regression for this data with degree 2. We will modify this later in the course.

deg = 2 #@param {type:"slider", min:1, max:20, step:1}
W = tf.Variable(tf.random_normal([deg,1]), name='weight')
#bias
b = tf.Variable(tf.random_normal([1]), name='bias')
X_=tf.placeholder(tf.float32,shape=[None,deg])
Y_=tf.placeholder(tf.float32,shape=[None, 1])
X_modified=modify_input(X,X.size,deg)
Y_pred=tf.add(tf.matmul(X_,W),b)
#algortihm
loss = tf.reduce_mean(tf.square(Y_pred -Y_ ))
#training algorithm
optimizer = tf.train.GradientDescentOptimizer(0.05).minimize(loss)
#initializing the variables
init = tf.global_variables_initializer()
#starting the session session
sess = tf.Session()
sess.run(init)
epoch=12000
for step in range(epoch):

 \_, c=sess.run([optimizer, loss], feed\_dict={X\_: X\_modified, Y\_: Y})

```
  **if**  step%1000==0 :
```

    **print**  (&quot;loss: &quot; + str(c))

Y_test=sess.run(Y_pred, feed_dict={X_:X_modified})

Now, we can find how well our model is performing

rmse = np.sqrt(mean_squared_error(Y,lin_poly.predict(poly.fit_transform(X))))
r2 = r2_score(y,y_poly_pred)
print (rmse)
print (r2)

1.1507521216059198

0.06144049111930627

After this we visualise the results. First, we get the coefficients and print the formula and then, we plot the equation

print ("Model paramters:")
print (sess.run(W))
print ("bias:%f" %sess.run(b))
res = "y = f(x) = " + str(sess.run(b)[0])
for i, r in enumerate(sess.run(W)):

res = res + &quot; + {}\*x^{}&quot;.format(&quot;%.2f&quot; % r[0], i + 1)

print (res)
plt.scatter(X, Y, s=10)
sort the values of x before line plot
sort_axis = operator.itemgetter(0)
sorted_zip = sorted(zip(X,Y_test), key=sort_axis)
X, Y_poly_pred = zip(*sorted_zip)
plt.plot(X, Y_poly_pred, color='red')
plt.show()

Lastly we predict what the salary for level 11 would be.

Predicting a new result with Polynomial Regression
lin_poly.predict(poly.fit_transform([[11.0]]))[0]
1121833.333333334

Overfitting

Under-fitting and over-fitting are 2 things that you must always try to avoid.

Under-fitting is when your model is not able to recognise the relationship between the 2 quantities. For example it may be found when trying to apply a linear model to a quadratic relationship. Common symptoms of this are high MSE and low R2 score.

On the other hand, overfitting is also a common issue. The model performs very well on the training data, but fails to perform on new, unseen data. In this case, the curve generated passes through all or nearly all the datapoints. The model fails to understand the overall pattern and cannot generalize.

There are 2 ways of eliminating these problems:

Providing more data: If you provide more data, the model is more likely to identify the general pattern
Finding the correct degree for the polynomial

Finding the correct degree

In our position vs salary example, we have very limited data, so adding more data is not possible. We have to find the correct degree. Here is a table of some degrees, their graph, their MSE, and their R2 score

Degree	Graph	MSE	R2 score	Equation
1
	163388.73	0.66

| | 2 | | 82212.12 | 0.91 | | | 3 | | 38931.5 | 0.9812097727913367 | | | 5 | | 4047.5 | 0.9997969027099755 | | | 10 | | 0.0008 | 1.0 | |

Over here the linear polynomial is an underfit since it fails to capture the pattern. It also has a high MSE and a low R2 score.

The degree 5 and degree 10 polynomials overfit the data. They have high scores. In fact the degree 10 polynomial has a R2 score of 1, which is the best possible. However, given data slightly off the curve, they will not be able to generalize.

The degree 2 and degree 3 polynomials are a good fit as they capture the pattern but do not overfit. Note that in general the best fits usually do not have a degree greater than 3.

Summary

In this tutorial, we learnt the following concepts:

Linear Regression
Generating datasets
Mean Square Error
R2 score
Polynomial Regression
Plotting scatter plots and line graphs
Importing datasets from csv files
Overfitting and underfitting

Hope you are able to use these concepts in your own projects.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
Polynomial Regression genrated Dataset.ipynb		Polynomial Regression genrated Dataset.ipynb
Polynomial Regression Blogpost.docx		Polynomial Regression Blogpost.docx
Polynomial Regression on imported database.ipynb		Polynomial Regression on imported database.ipynb
Polynomial_Regression_Tutorial.ipynb		Polynomial_Regression_Tutorial.ipynb
Position_Salaries copy.csv		Position_Salaries copy.csv
Position_Salaries.csv		Position_Salaries.csv
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Polynomial Regression in Python

sort the values of x before line plot

Importing the dataset

sort the values of x before line plot

Predicting a new result with Polynomial Regression

About

Releases

Packages

Languages

SiddhantAttavar/PolynomialRegression

Folders and files

Latest commit

History

Repository files navigation

Polynomial Regression in Python

sort the values of x before line plot

Importing the dataset

sort the values of x before line plot

Predicting a new result with Polynomial Regression

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages