In this project, I have tried to predict house prices using Linear Regression Model.
I have used Ames Housing Dataset. This dataset contains 79 variables describing almost every aspect of residential homes in Ames, Iowa, United States. Dataset contains categorical(nominal & ordinal), discrete and continuous variables and also have missing values for some variables for train & test dataset. So, first step is to pre-process the available data.
Since dataset contains categorical, discrete & continuous variables, so I have created a csv file containing the information for each variable. This file has helped me in handling missing values for each variables and also for handling categorical variables.
- Missing values for continuous & discrete variables are replaced by mean value of that variable while the same have been replaced by a new category for categorical variables. If you wish to change this methadology then change here
- Categorical variables in this dataset have two types i.e. nominal (doesn't have ordering sense) & ordinal (have ordering sense e.g. rating of furniture varying from poor to excelent).
If a nominal categorical variable holds large number of categories, then it will unnecessarily increase the complexity of the model. To overcome this problem, I have used Frequency Reduction technique to club all the categories having frequency less than the defined frequency. If you want to change this methadology then you can change it here. Nominal variables are then encoded to numerical values using sklearn LabelEncoder
Ordinal variables are dealt with manual identification of their order. This has been done here
Now, all the variables are in numeric form but categorical variables which are converted to numbers might mislead the model. So, it's important to convert transformed categorical variables to one-hot encoded vectors using sklearn OneHotEncoder
Now, we have transformed all the variables to their final form to feed into the model.
I have used the very basic linear regression model to predict the price of houses.
This problem can also be found on kaggle
You can use principal component analysis to find the relevant continuous features. I will update this portion after applying advanced regression techniques.