Machine Learning 101
-
Data Cleansing
- Missing Data
- fill missing columns with relevant values
- median for numeric fields
- most common for label for categorial fields
- (optional) look for linear correlation with other feature.
- (optional) create boolean feature for missing values.
- (optional) closest fit (WHAT??)
- fill missing columns with relevant values
- Noisy Data
- Outlier Detection
- Nearest Neighbours - remove by choosing outlier.
- Outlier Detection
- Data transformation.
- Change categories to boolean (0/1) columns.
- Change boolean columns to binary.
- (Optional) Create categories by grouping. (linear - age, Logarithmic - num of employees )
- Scaling
- linear ( Xi / max(X) )
- (Optional) balance data
- if we have one label with way more occurrences than other we should scale it.
- Missing Data
-
Feature selection
- variance filter - remove features with low variance
- filter methods
- select top 25% features with f_classif
- select top 25% features with mutual_information
- wrapper method
- run RFECV with fold = 3
- run RFECV with StratifiedKFold(2)
- Embedded methods
- decisions tree - pick 25% of the highets weight features
- Sum up all the features together.
-
Evaluating model
- Trials – trying out different models.
- Examination – after the trials we zoomed in on best models.
- Training – Training the chosen models
- Prediction - predicting based on the chosen models.