A regression model to predict calories burnt using multiple sensor readings.
- Object Oriented Design.
- Individual Train and Prediction Pipelines.
- User Friendly UI as well as API support to easily initial train/prediction pipelines.
- Azure Blob Storage For Files/Models and all intermediary data.
- End to End Log Capture in MongoDB.
- Combining Clustering and Regression Techniques For Better Results.
- Multiple Performance Metrics Capture in MongoDB to compare Model performance.
- End to End ML pipeline deployment with production ready code.
- K Means Clustering.
- DecisionTree Regressor.
- RandomForest Regressor.
- K Nearest Neighbors Regressor.
- SGD Regressor.
- XGBoost Regressor.
- Support Vector Regressor.
- Meadian Absolute Error.
- Mean Squared Error.
- R2 Error. (Main Metrics To Compare Model Performances and Select Best Model For Each Cluster.)
- Explained Variance Ratio.
- Mean Absolute Error.
Description Of Input File Attributes:
- Id: The customer ID
- ActivityDate: The date for which the activity is getting tracked.
- TotalSteps: Total Steps taken on that day.
- TotalDistance: Total distance covered.
- TrackerDistance: Distance as per the tracker
- LoggedActivitiesDistance: Logged
- VeryActiveDistance: The distance for which the user was the most active.
- ModeratelyActiveDistance: The distance for which the user was moderately active.
- LightActiveDistance: The distance for which the user was the least active.
- SedentaryActiveDistance: The distance for which the user was almost inactive.
- VeryActiveMinutes: The number of minutes for the most activity.
- FairlyActiveMinutes: The number of minutes for moderately activity.
- LightlyActiveMinutes: The number of minutes for the least activity
- SedentaryMinutes: The number of minutes for almost no activity
- Calories(Target): The calories burnt.
Apart from training files, we also require a "schema" file from the client, which contains all the relevant information about the training files such as: Name of the files, Length of Date value in FileName, Length of Time value in FileName, Number of Columns, Name of the Columns, and their datatype.
List of data validation performed before data preprocessing stage:
- Name Validation - We validate the name of the files based on the given name in the schema file. We have created a regex pattern as per the name given in the schema file to use for validation. After validating the pattern in the name, we check for the length of date in the file name as well as the length of time in the file name. If all the values are as per requirement, we move such files to "Good_Data_Folder" else we move such files to "Bad_Data_Folder."
- Number of Columns - We validate the number of columns present in the files, and if it doesn't match with the value given in the schema file, then the file is moved to "Bad_Data_Folder."
- Name of Columns - The name of the columns is validated and should be the same as given in the schema file. If not, then the file is moved to "Bad_Data_Folder".
- The datatype of columns - The datatype of columns is given in the schema file. This is validated when we insert the files into Database. If the datatype is wrong, then the file is moved to "Bad_Data_Folder".
- Null values in columns - If any of the columns in a file have all the values as NULL or missing, we discard such a file and move it to "Bad_Data_Folder".
- After initial set of validation, data is inserted to a single table 'Good_Data'
- Mongo Atlas is used to store all the data.
- Data Export from Db - The data in a stored database is exported as a CSV file to be used for model training.
- Data Preprocessing
a. Drop columns not useful for training the model. Such columns were selected while doing the EDA.
b. Replace the invalid values with numpy “nan” so we can use imputer on such values.
c. Check for null values in the columns. If present, impute the null values.
d. Scale the training and test data separately. - Clustering - KMeans algorithm is used to create clusters in the preprocessed data. The optimum number of clusters is selected by plotting the elbow plot, and for the dynamic selection of the number of clusters, we are using "KneeLocator" function. The idea behind clustering is to implement different algorithms.
To train data in different clusters. The Kmeans model is trained over preprocessed data and the model is saved for further use in prediction. - Model Selection - After clusters are created, we find the best model for each cluster. We are using 5 algorithm, "RandomForest Regressor", "XGBoost Regressor", "DecisionTree Regressor", "K-Nearest Neighbors" and "SGDRegressor". For each cluster, all five algorithms are passed with the best parameters derived from GridSearch. We calculate the Rsquared scores for both models and select the model with the best score. Similarly, the model is selected for each cluster. All the models for every cluster are saved for use in prediction.
- Data Export from Db - The data in the stored database is exported as a CSV file to be used for prediction.
- Data Preprocessing
a. Drop columns not useful for training the model. Such columns were selected while doing the EDA.
b. Replace the invalid values with numpy “nan” so we can use imputer on such values.
c. Check for null values in the columns. If present, impute the null values.
d. Scale the training data. - Clustering - KMeans model created during training is loaded, and clusters for the preprocessed prediction data is predicted.
- Prediction - Based on the cluster number, the respective model is loaded and is used to predict the data for that cluster.
- Once the prediction is made for all the clusters, the predictions along with the original names before label encoder are saved in a CSV file at a given location and the location is returned to the client