1. Task: Classify the given genetic variations/mutations based on evidence from text-based clinical literature
- Lets take a example where a person seems to have a symptoms of cancer and he visits a hospital where they basicaly remove the Cancer Tumor from person & then sequence those cancer tumor (genetic sequencing)
- So after doing genetic sequencing of cancer tumor we get gene, gene mutation/variations where a small variation in a gene can cause cancer by destroying entire genetic core system
- Note: All types of Mutation does not cause cancer only few of them cause
- So this is what we are going to do with help of features (gene , gene mutation ) we will predict 9 diffrent class labels and determine that out of which actually cause Cancer.
- Lets take a domain expert who understand about gene mutation, selects list of [genetic varaiations] he want to analyize. Genetic variations divide into 2 parts (1: gene {on which variation is happening} 2: varaiation {exact varations}) Both can be considered as [Random Categorical Variables]
- Domain Expert will now search/ collect all research papers/ evidence/ text that has been done on this gene variations
- Spend some time on analyzing Text [i.e. Evidence / Reseach work] amd after that Determine that patient belongs to which of the class out of (1,2,3,4....,9)
Information gathered from: https://www.kaggle.com/c/msk-redefining-cancer-treatment/discussion/35336#198462
- Training variants (ID,Gene,Variations,Classlabel)
- Training_Texr (ID, TEXT)
There are nine different classes a genetic mutation can be classified into => Multi class classification problem
- Multi-class Log loss
- Confusion matrix
Observation: Here we have done Stratified Train Test CV split so that we get equal amount of distribution in all 3 plots as we can see and by looking at this Histogram we can come into concluison that Class Label {7,4,1,2} consist of more than 60% - 70% of data and so this is an Imbalanced Dataset
unique_genes = train_df["Gene"].value_counts()
(240,)
s = sum(unique_genes.values)
h = unique_genes.values/s
c = np.cumsum(h)
plt.plot(c,label='Cumulative distribution of Genes')
plt.grid()
plt.legend()
plt.show()
Observation: Out of 225 unique genes only top 50 genes consist more than 70% of data and rest only 30%
unique_genes = train_df["Variation"].value_counts()
(1941,)
s = sum(unique_variation.values)
h = unique_variation.values/s
c = np.cumsum(h)
plt.plot(c,label='Cumulative distribution of Genes')
plt.grid()
plt.legend()
plt.show()