-
Notifications
You must be signed in to change notification settings - Fork 0
/
Sklearn Predict sentdex tutorial forcast_out explaination
91 lines (81 loc) · 4.73 KB
/
Sklearn Predict sentdex tutorial forcast_out explaination
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
This code is from sentdex's machine learning youtube video. I will add some extra lines of code to explain the purpose of forecast_out.
In sklearn, you need to define "Features" and "Label". Features are the variables/descriptions of your target value.
Label is the target value you want to predict. In this Google stock prediction example, features include 'Adj. Close', 'HL_PCT', 'PCT_change', 'Adj. Volume'
Label is ‘Adj.Close’ which is the future stock price you are predicting.
We differentiate the ‘Adj.Close’ in Label vs. Features by shifting 10% days of the total days of prices in the dataset. For example, assuming you have 360 days of data, you will use 36 days of future stock close price as target value by observing the trend of previous stock movements.
If we change the code by shifting the forcast days by 2, you will see the differeces.
ORIGINAL DATAFRAME
Adj. Close HL_PCT PCT_change Adj. Volume
Date
2018-03-21 1094.00 1.964351 0.130884 1990515.0
2018-03-22 1053.15 3.254997 -2.487014 3418154.0
2018-03-23 1026.55 4.082607 -2.360729 2413517.0
2018-03-26 1054.09 4.619150 0.332191 3272409.0
2018-03-27 1006.94 6.645878 -5.353887 2940957.0
DATAFRAME AFTER SHIFTING FORCASTING-OUT BY 2
Adj. Close HL_PCT PCT_change Adj. Volume label
Date
2018-03-21 1094.00 1.964351 0.130884 1990515.0 1026.55
2018-03-22 1053.15 3.254997 -2.487014 3418154.0 1054.09
2018-03-23 1026.55 4.082607 -2.360729 2413517.0 1006.94
2018-03-26 1054.09 4.619150 0.332191 3272409.0 NaN
2018-03-27 1006.94 6.645878 -5.353887 2940957.0 NaN
DATAFRAME AFTER DROPNA
Adj. Close HL_PCT PCT_change Adj. Volume label
Date
2018-03-19 1100.07 2.768006 -1.582630 3076349.0 1094.00
2018-03-20 1095.80 2.110787 -0.236708 2709310.0 1053.15
2018-03-21 1094.00 1.964351 0.130884 1990515.0 1026.55
2018-03-22 1053.15 3.254997 -2.487014 3418154.0 1054.09
2018-03-23 1026.55 4.082607 -2.360729 2413517.0 1006.94
THIS IS X: FEATURES
Adj. Close HL_PCT PCT_change Adj. Volume
Date
2018-03-19 1100.07 2.768006 -1.582630 3076349.0
2018-03-20 1095.80 2.110787 -0.236708 2709310.0
2018-03-21 1094.00 1.964351 0.130884 1990515.0
2018-03-22 1053.15 3.254997 -2.487014 3418154.0
2018-03-23 1026.55 4.082607 -2.360729 2413517.0
THIS IS Y: LABEL (TARGET VALUE)
Date
2018-03-19 1094.00
2018-03-20 1053.15
2018-03-21 1026.55
2018-03-22 1054.09
2018-03-23 1006.94
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import quandl
from sklearn.model_selection import train_test_split
from sklearn import cross_validation
from sklearn.linear_model import LinearRegression
from sklearn import preprocessing
df=quandl.get("WIKI/GOOGL")
#print (df.tail())
df = df[['Adj. Open', 'Adj. High', 'Adj. Low', 'Adj. Close', 'Adj. Volume']]
df['HL_PCT'] = (df['Adj. High'] - df['Adj. Low']) / df['Adj. Close'] * 100.0
df['PCT_change'] = (df['Adj. Close'] - df['Adj. Open']) / df['Adj. Open'] * 100.0
df = df[['Adj. Close', 'HL_PCT', 'PCT_change', 'Adj. Volume']]
print (df.tail()) ##by printing the tail of this dataframe, you will see the last 5 days of stock data before shifting
forecast_col = 'Adj. Close'
df.fillna(value=-99999, inplace=True)
forecast_out = int(math.ceil(0.01 * len(df))) ###forcast out 10% days
df['label'] = df[forecast_col].shift(-forecast_out)##shift 'adj.close' price up by 10% of total trading days, this will be the target value you are predicting
df.dropna(inplace=True)##after shifting 'adj.close' column up, the bottom 10% rows of data will be NA
print (df.tail())##you will see NA on the bottom of the data frame in 'Label' column
X = df.drop(['label'], 1) #by dropping the 'label' column, you have all the 4 columns of data, these are your features
y = df['label'] #this is the target value, which is the 'adj.close' after shifting all the values up by 10% of total days
print (X.tail())
print (y.tail())
X = preprocessing.scale(X)
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=0.2)
linreg = LinearRegression().fit(X_train, y_train)
print('linear model coeff (w): {}'
.format(linreg.coef_))
print('linear model intercept (b): {:.3f}'
.format(linreg.intercept_))
print('R-squared score (training): {:.3f}'
.format(linreg.score(X_train, y_train)))
print('R-squared score (test): {:.3f}'
.format(linreg.score(X_test, y_test)))