Prediction of wind turbine power generation from real-time SCADA data

The models to be created for this problem are the models that will predict the power values expected to be produced by a wind turbine. Thus, by comparing the actual production of a wind turbine with these estimation results, it will be presented to the investor to what extent the turbine produces less than it should be. From this point of view, the investor will be able to realize that there is a performance problem related to the turbine and will be able to initiate root cause analysis.

The data set presented in the problem consists of real-time SCADA data. Each data value belongs only to the relevant time period and the input variables transmitted in the data set for the time period to be predicted are prepared to be used to predict the power generation result in the same time period.

In the shared data set, the real-time power generation amount (Power(kW)) of a wind turbine belonging to Enerjisa Üretim between 01.01.2019 and 14.08.2021 is given on a 10-minute basis.

Information presented in the dataset and its units

Column	Unit
Timestamp	()
Gearbox_T1_High_Speed_Shaft_Temperature	(°C)
Gearbox_T3_High_Speed_Shaft_Temperature	(°C)
Gearbox_T1_Intermediate_Speed_Shaft_Temperature	(°C)
Temperature Gearbox Bearing Hollow Shaft	(°C)
Tower Acceleration Normal	(mm/s²)
Gearbox_Oil-2_Temperature	(°C)
Tower Acceleration Lateral	(mm/s²)
Temperature Bearing_A	(°C)
Temperature Trafo-3	(°C)
Gearbox_T3_Intermediate_Speed_Shaft_Temperature	(°C)
Gearbox_Oil-1_Temperature	(°C)
Gearbox_Oil_Temperature	(°C)
Torque	(%)
Converter Control Unit Reactive Power	(kVAr)
Temperature Trafo-2	(°C)
Reactive Power	(kVAr)
Temperature Shaft Bearing-1	(°C)
Gearbox_Distributor_Temperature	(°C)
Moment D Filtered	(kNm)
Moment D Direction	(kNm)
N-set 1	(rpm)
Operating State	( )
Power Factor	( )
Temperature Shaft Bearing-2	(°C)
Temperature_Nacelle	(°C)
Voltage A-N	(V)
Temperature Axis Box-3	(°C)
Voltage C-N	(V)
Temperature Axis Box-2	(°C)
Temperature Axis Box-1	(°C)
Voltage B-N	(V)
Nacelle Position_Degree	(°)
Converter Control Unit Voltage	(V)
Temperature Battery Box-3	(°C)
Temperature Battery Box-2	(°C)
Temperature Battery Box-1	(°C)
Hydraulic Prepressure	(bar)
Angle Rotor Position	(°)
Temperature Tower Base	(°C)
Pitch Offset-2 Asymmetric Load Controller	(°)
Pitch Offset Tower Feedback	(°)
Line Frequency	(Hz)
Internal Power Limit	(kW)
Circuit Breaker cut-ins	( )
Particle Counter	( )
Tower Accelaration Normal Raw	(mm/s²)
Torque Offset Tower Feedback	(Nm)
External Power Limit	(kW)
Blade-2 Actual Value_Angle-B	(°)
Blade-1 Actual Value_Angle-B	(°)
Blade-3 Actual Value_Angle-B	(°)
Temperature Heat Exchanger Converter Control Unit	(°C)
Tower Accelaration Lateral Raw	(mm/s²)
Temperature Ambient	(°C)
Nacelle Revolution	( )
Pitch Offset-1 Asymmetric Load Controller	(°)
Tower Deflection	(ms)
Pitch Offset-3 Asymmetric Load Controller	(°)
Wind Deviation 1 seconds	(°)
Wind Deviation 10 seconds	(°)
Proxy Sensor_Degree-135	(mm)
State and Fault	( )
Proxy Sensor_Degree-225	(mm)
Blade-3 Actual Value_Angle-A	(°)
Scope CH 4	( )
Blade-2 Actual Value_Angle-A	(°)
Blade-1 Actual Value_Angle-A	(°)
Blade-2 Set Value_Degree	(°)
Pitch Demand Baseline_Degree	(°)
Blade-1 Set Value_Degree	(°)
Blade-3 Set Value_Degree	(°)
Moment Q Direction	(kNm)
Moment Q Filltered	(kNm)
Proxy Sensor_Degree-45	(mm)
Turbine State	( )
Proxy Sensor_Degree-315	(mm)

Preprocessing

#!/usr/bin/env python
# coding: utf-8
import numpy as np
import csv

import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import pickle
import time
import os

import pandas as pd

unzip dataset

import zipfile
path_to_zip_file="enerjisa-uretim-hackathon.zip"
with zipfile.ZipFile(path_to_zip_file, 'r') as zip_ref:
    zip_ref.extractall(path_to_zip_file[:-4])

take dataset files as a lit in "csvs"

def find_the_way(path,file_format):
    files_add = []

    for r, d, f in os.walk(path):
        for file in f:
            if file_format in file:
                files_add.append(os.path.join(r, file))  
    return files_add

path=path_to_zip_file[:-4]
csvs=find_the_way(path,'.csv')
csvs

['enerjisa-uretim-hackathon\\features.csv',
 'enerjisa-uretim-hackathon\\feature_units.csv',
 'enerjisa-uretim-hackathon\\power.csv',
 'enerjisa-uretim-hackathon\\sample_submission.csv']

replace nan and inf value with 0

features=pd.read_csv(csvs[0])
labels=pd.read_csv(csvs[2])

features.replace([np.inf, -np.inf], np.nan, inplace=True)
features=features.fillna(0)

create and add a new feature related with timeseries

ay_ve_gun=[]
for i in features["Timestamp"]:
    month=int(i[5:7])*100
    day=(int(i[8:10])//10+1)
    if day==4:
        day=3
    ay_ve_gun.append(month+day)

features["ay_ve_gun"]=ay_ve_gun

split labelled and unlabelled data

train_size=len(labels)
main=features[0:train_size]
submission=features[train_size:]

add labels to dataframe

main["Power(kW)"]=labels["Power(kW)"]

show unlabeled data which we will not use

submission

	Timestamp	Gearbox_T1_High_Speed_Shaft_Temperature	Gearbox_T3_High_Speed_Shaft_Temperature	Gearbox_T1_Intermediate_Speed_Shaft_Temperature	Temperature Gearbox Bearing Hollow Shaft	Tower Acceleration Normal	Gearbox_Oil-2_Temperature	Tower Acceleration Lateral	Temperature Bearing_A	Temperature Trafo-3	...	Blade-2 Set Value_Degree	Pitch Demand Baseline_Degree	Blade-1 Set Value_Degree	Blade-3 Set Value_Degree	Moment Q Direction	Moment Q Filltered	Proxy Sensor_Degree-45	Turbine State	Proxy Sensor_Degree-315	ay_ve_gun
136730	2021-08-15 00:00:00	60.068333	62.0	56.000000	58.000000	125.218666	60.000000	64.707336	54.348331	121.000000	...	9.493241	8.925109	9.014512	8.266594	-41.861877	-37.917656	5.739297	1.0	5.734730	802
136731	2021-08-15 00:10:00	60.000000	62.0	56.000000	57.036667	145.160309	59.279999	64.127480	58.098331	120.971664	...	7.507399	6.937748	7.022389	6.287027	-19.210815	-19.602339	5.720869	1.0	5.726634	802
136732	2021-08-15 00:20:00	60.000000	62.0	55.853333	57.000000	129.239914	59.000000	54.563091	60.360001	120.028336	...	8.065812	7.497398	7.581376	6.844808	-28.144068	-34.329105	5.727475	1.0	5.728649	802
136733	2021-08-15 00:30:00	60.000000	62.0	55.000000	57.000000	140.151611	59.000000	61.899250	61.715000	120.000000	...	8.132490	7.565773	7.654368	6.909220	-7.592476	-11.718444	5.728980	1.0	5.739824	802
136734	2021-08-15 00:40:00	60.000000	62.0	55.000000	57.000000	126.124702	59.000000	56.804501	62.698334	120.000000	...	9.546413	8.974770	9.064083	8.313858	-7.760864	-9.863355	5.736651	1.0	5.747692	802
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
154257	2021-12-14 23:10:00	65.811668	0.0	59.945000	62.808334	225.038239	65.300003	109.889709	61.000000	97.000000	...	15.820095	15.199166	15.235223	14.540556	-29.340843	-27.513502	5.746916	1.0	5.756082	1202
154258	2021-12-14 23:20:00	68.586670	0.0	62.084999	65.413330	229.905838	67.871666	106.016670	61.116665	97.000000	...	16.504293	15.876278	15.917643	15.207320	-31.925669	-30.197918	5.749150	1.0	5.755406	1202
154259	2021-12-14 23:30:00	63.746666	0.0	59.965000	64.051666	223.352631	64.461670	111.690208	61.293335	97.000000	...	15.331903	14.720088	14.768394	14.064686	-53.071564	-48.306511	5.751807	1.0	5.747936	1202
154260	2021-12-14 23:40:00	66.643333	0.0	60.678333	63.421665	227.704514	66.081665	119.716499	60.786667	97.000000	...	16.481724	15.887610	15.945046	15.230121	-28.747763	-23.844364	5.747686	1.0	5.757787	1202
154261	2021-12-14 23:50:00	65.593330	0.0	60.738335	64.731667	223.235413	65.891670	103.372475	60.395000	97.000000	...	16.198933	15.591414	15.635881	14.941538	-28.904552	-30.457935	5.753047	1.0	5.761520	1202

17532 rows × 78 columns

split two part labelled data as training (67%) and testing (33%)

train_size = int(len(main) * 0.67)
test_size = len(main) - train_size
train, test = main[0:train_size], main[train_size:]

save training and testing datasets as csvs

submission.to_csv("submission.csv",index=False)
train.to_csv("TT.csv",index=False)
test.to_csv("t.csv",index=False)

Machine Learning Step

from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from sklearn.model_selection import KFold
import sklearn

from sklearn.ensemble import BaggingRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import  LinearRegression
from sklearn.linear_model import BayesianRidge
from sklearn.linear_model import ElasticNet
from sklearn.linear_model import Lasso
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import LinearSVR
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.isotonic import IsotonicRegression
from sklearn.ensemble import VotingRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import StackingRegressor
from sklearn.linear_model import RidgeCV, LassoCV
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import DotProduct, WhiteKernel
kernel = DotProduct() + WhiteKernel()
from xgboost import XGBRegressor

List of Machine learning algorithms

estimators = [('ridge', RidgeCV()),
               ('lasso', LassoCV(random_state=42)),
               ('knr', KNeighborsRegressor(n_neighbors=20,
                                          metric='euclidean'))]
final_estimator = GradientBoostingRegressor(
    n_estimators=25, subsample=0.5, min_samples_leaf=25, max_features=1,
    random_state=42)
reg = StackingRegressor(
    estimators=estimators,
    final_estimator=final_estimator)
from sklearn.linear_model import TweedieRegressor
reg1 = GradientBoostingRegressor(random_state=1)
reg2 = RandomForestRegressor(random_state=1)
reg3 = LinearRegression()
ml_list={'LR':LinearRegression(),'DT':DecisionTreeRegressor(),
'BR':BayesianRidge(),
'EL':ElasticNet(),
'twd':TweedieRegressor(),
'LAS':Lasso(),
'rcv':RidgeCV(), 
'lcv':LassoCV(),'BAG':BaggingRegressor(),
'GBR':GradientBoostingRegressor(),
'RF':RandomForestRegressor(),
'KNN':KNeighborsRegressor(),
#'LRVR':LinearSVR(),'SVR':SVR(),
#'iso':IsotonicRegression(),
'vot':VotingRegressor(estimators=[('gb', reg1), ('rf', reg2), ('lr', reg3)]),
'stc' : StackingRegressor(
    estimators=estimators,
    final_estimator=final_estimator),'XGB':XGBRegressor()}

split dataframe as data (X) and label (y)

def data_and_label(name):
    
    df = pd.read_csv(name)
    df.replace([np.inf, -np.inf], np.nan, inplace=True)
    df=df.fillna(0)
    del df["Timestamp"]
    X =df[df.columns[:-1]]
    X=np.array(X)
    y=np.array(df[df.columns[-1]])
    return X,y
path='./csv/'

Evaluation - Calculate error score

def score_erros(altime,train_time,test_time,expected,predicted,class_based_results,i,cv,dname,ii):     
      
    mse = mean_squared_error(expected, predicted)
    mae=mean_absolute_error(expected, predicted)
    rmse=mean_squared_error(expected, predicted, squared=False)
    r2=r2_score(expected, predicted)    
    precision,recall,f_score=0,0,0 
    print ('%-10s %-3s %-3s %-10s  %-8s %-8s %-8s %-11s %-8s %-8s %-8s %-6s %-6s %-16s' % (dname,i,cv,ii[0:6],str(round((precision),2)),str(round((recall),2)),str(round((f_score),2)),str(round((mse),2)), str(round((mae),2)),
        str(round((rmse),2)), str(round((r2),2)),str(round((train_time),2)),str(round((test_time),2)),altime))
    lines=str(dname)+","+str(i)+","+str(cv)+","+str(ii)+","+str(round((precision),15))+","+str(round((recall),15))+","+str(round((f_score),15))+","+str(round((mse),15))+","+str(round((mae),15))+","+str(round((rmse),15))+","+ str(round((r2),15))+","+str(round((train_time),15))+","+str(round((test_time),15))+"\n"
    
    return lines,class_based_results,mae

ML Function

def ML(output,file,test_file,i):
    ths = open(output, "a")
    X_test,y_test=data_and_label(test_file)
    ths.write ("Dataset,T,CV,ML_alg,precision,recall,f_scor,mse,mae,rmse, r2  ,tra-T,test-T,total\n")


    fold=5
    repetition=1
    class_based_results= pd.DataFrame()
    target_names=[0,1]


    for ii in ml_list:
        mae_min=1000

        cv=0
        dataset=file[-20:-4]
        clf = ml_list[ii]
        second=time.time()
        X_train,y_train=data_and_label(file)
        clf.fit(X_train, y_train)  
        train_time=(float((time.time()-second)) )
        second=time.time()
        predicted=clf.predict(X_test)
        test_time=(float((time.time()-second)) )
        expected = y_test
        
        error=[]
        for j in range(len(y_test)):
            error.append(abs(float(y_test[j])-float(predicted[j])))
        error.sort()
        cep68 = round((error[round(68 * len(error) / 100)])**(1/2),2)
        cep95 = round((error[round(95 * len(error) / 100)])**(1/2),2)
        cep=str(cep68)+'   '+str(cep95)
        
        line,cb,mae=score_erros(cep,train_time,test_time,expected, predicted,class_based_results,i,cv,dataset,ii)

        filename=f".sav"
        filename=filename.replace('\\','_')
        pickle.dump(clf, open(filename, 'wb'))

        ths.write (line)
    ths.close()

tarining file

csvs=find_the_way("./",'TT')
csvs

['./TT.csv']

Results

print ('%-10s %-3s %-3s %-10s  %-8s %-8s %-8s %-11s %-8s %-8s %-8s %-6s %-6s %-16s' %
                   ("Dataset","T","CV","ML_alg",'prec','rec','f1',"mse","mae","rmse", "r2"  ,"T","t","CDF68    CDF95"))

for num,csv in enumerate(csvs):
    output="./results.csv" #OUTPUT
    test_file=csv.replace('TT','t') # TEST DATA# TEST DATA
    ML(output,csv,test_file,num)

Dataset    T   CV  ML_alg      prec     rec      f1       mse         mae      rmse     r2       T      t      CDF68    CDF95  
./TT       0   0   LR          0        0        0        1167041.09  989.28   1080.3   -0.01    2.23   0.01   34.42   40.37   
./TT       0   0   DT          0        0        0        24983.24    21.8     158.06   0.98     13.39  0.03   2.47   4.86     
./TT       0   0   BR          0        0        0        1168839.83  991.49   1081.13  -0.01    2.95   0.01   34.34   39.97   
./TT       0   0   EL          0        0        0        1167041.13  989.28   1080.3   -0.01    2.49   0.01   34.42   40.37   
./TT       0   0   twd         0        0        0        1167041.1   989.28   1080.3   -0.01    2.53   0.01   34.42   40.37   
./TT       0   0   LAS         0        0        0        1167041.15  989.28   1080.3   -0.01    2.12   0.01   34.42   40.37   
./TT       0   0   rcv         0        0        0        1167054.07  989.28   1080.3   -0.01    2.87   0.01   34.42   40.37   
./TT       0   0   lcv         0        0        0        1166893.56  991.61   1080.23  -0.01    4.54   0.01   34.37   39.94   
./TT       0   0   BAG         0        0        0        8307.35     14.41    91.14    0.99     57.68  0.34   2.2   4.96      
./TT       0   0   GBR         0        0        0        9996.79     52.03    99.98    0.99     149.85 0.12   6.76   13.03    
./TT       0   0   RF          0        0        0        5502.6      12.68    74.18    1.0      501.61 1.39   2.15   5.24     
./TT       0   0   KNN         0        0        0        557629.39   463.57   746.75   0.52     5.2    267.91 22.23   42.1    
./TT       0   0   vot         0        0        0        139551.09   341.29   373.57   0.88     761.01 2.17   20.05   23.42   
./TT       0   0   stc         0        0        0        519621.29   610.33   720.85   0.55     649.77 333.07 27.57   36.77   
./TT       0   0   XGB         0        0        0        9760.45     51.78    98.79    0.99     41.54  0.19   6.81   12.99

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

readme.md

readme.md

Prediction of wind turbine power generation from real-time SCADA data

Information presented in the dataset and its units

Preprocessing

unzip dataset

take dataset files as a lit in "csvs"

replace nan and inf value with 0

create and add a new feature related with timeseries

split labelled and unlabelled data

add labels to dataframe

show unlabeled data which we will not use

split two part labelled data as training (67%) and testing (33%)

save training and testing datasets as csvs

Machine Learning Step

List of Machine learning algorithms

split dataframe as data (X) and label (y)

Evaluation - Calculate error score

ML Function

tarining file

Results

Files

readme.md

Latest commit

History

readme.md

File metadata and controls

Prediction of wind turbine power generation from real-time SCADA data

Information presented in the dataset and its units

Preprocessing

unzip dataset

take dataset files as a lit in "csvs"

replace nan and inf value with 0

create and add a new feature related with timeseries

split labelled and unlabelled data

add labels to dataframe

show unlabeled data which we will not use

split two part labelled data as training (67%) and testing (33%)

save training and testing datasets as csvs

Machine Learning Step

List of Machine learning algorithms

split dataframe as data (X) and label (y)

Evaluation - Calculate error score

ML Function

tarining file

Results