Welcome to my portfolio on the tasks I have done and been part of during my minor Applied Data Science at the Hague University of Applied Sciences.
Name: Hassan Ali Student number: 17021308
Jargon: | Describtion: |
---|---|
Raw data | Data from the Flock of birds system(FoB) X,Y,Z coordinates of each sensor |
Converted Data | Transformed data from sensor data to rotation angle between bones |
During the first few weeks of the minor i mostly worked on my python skills. I was able to finish the data camp courses and this helped me a lot during the project. See my DataCamp statement of accomplishments in here. Furtheremore, i did a lot of research on how to use certain codes and how to write object oritented python code. I would like to thank my felow students from programming background who helped me write better code.
The machine learning lectures helped me understand the bigger picture of machine learning and how to apply different techniques. To keep up with all the different terms in Datascience and machine learning in particular, i made a list of terms that were new to me. For this list you can find it here. For practice on what we learnt on machine learning lectures, i was able to make a classification on the left and right arm of converted data. For the this see this Reader.
I was also able to follow this online course on udemy. From this course i understood the basic concepts of deep neural networks. For the summary i wrote on this course, see this link. I was able to complete the entaire course except the business case. For the exercises i did during this course see this link. Apart from this course i read multiple artikles on convolutional neural networks and how to optimize the hyper-parameters. For further of neural networks that i have applied during this project see this chapter 7.
Project orth eyes is a collaboration between the Hague univeristy of Applied Sciences and Leiden University Medical Center (LUMC). The project focuses on Improving treatment and diagnosis of musculoskeletal system issues, and in particular issues related with shoulder disabilities. When treating patients with limitations in shoulder movement, physio therapist use protractor or sensor system to measure the limitations of the shoulder. The former is inaccurate while the later is time consuming.
The long term goal of the project is therefore to find an easy and accurate measurement system. The short term goals is: Is it possible to cluster patients in groups with similar complaints and / or similar diagnosis based on the flock of birds data? What parameters are used for this clustering?
This is not the first iteration of the project, There have been two other iterations of this project. After fact checking the research done by last group, we found that they made a sumptions on the columns(bone names) and labeling the exercises. That is why we would like redo their analysis using labeled data by LUMC physicians to categorize the patient groups.
Furthermore, at the beginning of the semister i was able to write down a coopration agreement that could help us a project group to finish the project succesfully. As mentioned in the agreement, we used Microsoft DevOps as a scrum tool. How scrum is used is explained in the Coopration agreement.
Here are the tasks that i did during this minor.
ID | Work Item Type | Title | State | Area Path | Changed Date |
191 | Task | Data: CNN datashape | To Do | Data Science | 12/16/2019 2:05 PM |
183 | Task | Data: Remove Idle | Doing | Data Science | 1/10/2020 9:53 AM |
175 | Task | Find the best architecture for CNN | Doing | Data Science | 12/20/2019 9:37 AM |
169 | Task | Log result of the CNN script | Done | Data Science | 12/12/2019 6:32 AM |
155 | Task | Understand how to train and improve RNN's | Done | Data Science | 12/3/2019 7:47 PM |
148 | Task | Listing reached goals | Done | Data Science | 11/19/2019 2:12 PM |
134 | Task | Check if the idle at the start and end of an exercise is an anomaly | Done | Data Science | 11/15/2019 9:07 AM |
121 | Task | creating dataframe of 650 columns for the ml model | Done | Data Science | 10/24/2019 9:01 PM |
120 | Task | read in filenames, files, and create metadata. | Done | Data Science | 10/24/2019 9:01 PM |
106 | Task | I would like to make a presentation on how the data is prepared for ml mdoels | Done | Data Science | 10/31/2019 9:57 AM |
101 | Task | Write and train a ML model on our data | Done | Data Science | 10/31/2019 9:55 AM |
81 | Task | Understand the conversion of the original data to csv. The convertion with matlab | Done | Data Science | 12/16/2019 1:13 PM |
69 | Task | What kind of parameters are (ideally) used by the doctors / researchers? | Done | Data Science | 10/12/2019 9:22 AM |
33 | Task | General projectplanning | Done | Data Science | 9/11/2019 2:08 PM |
31 | Task | Cooperation agreement | Done | Data Science | 9/11/2019 2:07 PM |
27 | Task | define the process used to clean the data | Done | Data Science | 9/16/2019 9:03 AM |
22 | Task | Read paper | Done | Data Science | 9/16/2019 9:03 AM |
10 | Task | Hassan | Done | Data Science | 9/6/2019 9:30 PM |
In this projcet we are using motion data optained from the Laboratorium for Kinematics en Neuromechanics (LK&N) of LUMC. The data is recorded using the flock of birds system (FoB), A six-degrees-of-freedom electromagnetic measurement system that measures the position and orientation data of targerts.
The sensors from the FoB are placed on fixed positions on a patient and the patient does exercises as instructed by a physician.The sensors then return the position (X,Y,Z coordinates) of the each sensor. This raw data is later converted to rotation angel relative to each bone by the LUMC.
See the figure below, made by Vincent, member of ortho eyes 2018/2019.
The dataset consists of patient groups (4 in total) with similar complaints and or diagnostics. Each patient group consists of multiple patients and each patient has done multiple exercises. There are 5 main exercises that all the patients have in common:
Abbreviation | Describtion |
---|---|
AB | Abduction |
AF | Anteflexion |
RF | Retroflexion |
EH | Endo/Exorotation coronal |
EL | Endo/Exorotation humerus |
The visualization below made by Raphi and Eddie helped me understand how the data is represented.
Steps in data cleaning are:
Type | What they are: |
---|---|
Removing idle | Removing stationary data at the start and end of exercises |
Splitting Double exercises | Detect and splitting of double exercises in one file. |
Detect wrongly named exercises | ie: if a file is named Incorrectly. |
After inspecting the sensor data, we noticed that almost every exercise contained an idle at the beginning and end of each exercise. An idle comes to exist between the time when a physician starting or stops the recording, and the patient actual starts or stops the exercise. In between these moments exists an almost stationary movement that is not part of the exercise.
Removing the idle was one of the tasks I did during this minor. To remove the idle I developed a script that detects when a the movement is below or above the mean of the data at the start or end of the exercise. More this, see this Reader
These are the data enrichments we did.
Type | What they are: |
---|---|
Default (n frames) | Taking n frames(exercise length) that are evenly spaced from each exercise |
Resample exercises | Reframing all the exercises into a fixed frames (exercise length) |
Frame generator | selecting more data points before and after each (n)frame, |
occupied space (360) | The space decribtion of an exercise. The movement of each exercise in 360 space |
A convolutional neural network is a Deep Learning algorithm which can classify images by assigning importance to various aspects of the image. By read this paper, which used Convolutional neural networks on classification of motion data, i was motivated in trying to use this technique. I used Convolutional neural networks in classifing patients in groups with similar complaints using the flock of birds data.
Since CNN works well with image data, I created an Image data from the resampled exercises. For more on this see data preparation. I used Tensorflow to build the model, because i had a course on these library and there are a lot of information about it online.
After trying other options, i found that classifying the patient groups on exercise level works best for CNN without any biases.
For preparing image data these are the steps I took.
- Use the resampled exercise data since these ones have fixed rows(frames)
- Generate image data for each exercises.
- Reshape the features that belong together into an RGB format
- Reshape the x,y,z coordinates of one body part into an RGB format
- Place a decoder [colour bar] between the bones for the model to better differentiate between them
- Normalize the data to be values between 0 and 1: this is good for the model
I also prepared the motion dataset into other shapes, for ininstance a dataset with velocity. However i was not able to build a model for each an every data representation. Arjun worked with me on this part, he helped me on preparing the different data representations. See this code for the data preparation.
The figure below is a representation of the model i configured to read the image data. The model consists of two main parts. The feature learning part and the classification part.
These are the different layers of the model and what they do:
Layer | Describtion |
---|---|
Conv2D | The convolutional leyer filters(convolves) the input data to get usefull information (feuture maps). |
Maxpooling | Maxpool layer reduces the dimensionality of the convolved data by selecting maximum values only. |
Dropout | During trainging percentage (40%) of the neurons will be deactivated to overcome overifitting and force the model to generalize. |
Flatten | Flatten the output of the last layer into a vector so that we can feed it into a dense layer. |
Fully connected layer (FC layer) | With fully connceted layer all the features are combined in order to predict the right patient group. |
Output | After the fully connceted layer, i used softmax function as activation function for the output layer. The softmax function interprets the final activation produced by the FC-layer as probabilities |
Kernal_size - Since the data does not have the same width and hight, i choose a regtangular kernal instead of the ussual square kernal size. Also an even kernal size was adviced by most documents i read.
Dropout Percentage - After trial and error i found that 40% to be good Percentage for the dropout layer.
Adam vs SGD - After trial and error I found adam to be a better optimization since it converges much faster to a global minimum.
Loss function - as adivised by Tesnorflow used the Sparse Categorical Crossentropy to compute the crossentropy loss between the labels and predictions.
See this link for the complete code of the model.
These are the results of the model.
Accuracy | precision | recal | F1-score |
---|---|---|---|
0.726 | 0.657 | 0.605 | 0.630 |
The performance as measured in the validation data shows that the model performed as expected. The validation loss decreased while the accuracy increased. This means that the model is not overfitting. The overal performanced of the model is measured by the test data. The performance of the model as measured by the test data is represented in the table below.
These are the predictions per class on the train dataset. I used excel to calculate these predictions.
For research we come up we with the main question below.
To what extend and in what way, can different supervised data science techniques be used on kinematic recordings to contribute to a more valid and more reliable diagnosis, made by a doctor, on shoulder disability.
Furthermore we come up with a list of sub-questions that would insist us answering our main question. For the subquestions, see the this link
Here is the link to the research papaer we wrote: Research-papaer
For the contributions to the research see my self reflection.
According to the research I did with my fellow students, i found that after extensive data cleaning, the logistic regression was not able to classify patient group 2 from 3. After i tried CNN on the same dataset, after cleaned it gave better results in differentiating patient group 2 from 3. That is why i recommend the next project group to look at CNN as an alternative model than logistic regression.
Given the research question mention above, after extensive data cleaning and data normalization, the logistic rigression gave and accuracy of 69% and precision and recall of:
This shows that logistic regression is able to classify(cluster) patients in groups with simialar complaints and or similar diagnosis based on the FoB flock of birds data.
These are the presentation i did during this minor.
See this Reader for all my git commits on the project.
For the self reflection see this link.