Personalized Cancer_Diagnosis

1. Task: Classify the given genetic variations/mutations based on evidence from text-based clinical literature

1.1 In detail explaination:

Lets take a example where a person seems to have a symptoms of cancer and he visits a hospital where they basicaly remove the Cancer Tumor from person & then sequence those cancer tumor (genetic sequencing)
So after doing genetic sequencing of cancer tumor we get gene, gene mutation/variations where a small variation in a gene can cause cancer by destroying entire genetic core system
Note: All types of Mutation does not cause cancer only few of them cause
So this is what we are going to do with help of features (gene , gene mutation ) we will predict 9 diffrent class labels and determine that out of which actually cause Cancer.

1.2. Workflow :

Lets take a domain expert who understand about gene mutation, selects list of [genetic varaiations] he want to analyize. Genetic variations divide into 2 parts (1: gene {on which variation is happening} 2: varaiation {exact varations}) Both can be considered as [Random Categorical Variables]
Domain Expert will now search/ collect all research papers/ evidence/ text that has been done on this gene variations
Spend some time on analyzing Text [i.e. Evidence / Reseach work] amd after that Determine that patient belongs to which of the class out of (1,2,3,4....,9)

Information gathered from: https://www.kaggle.com/c/msk-redefining-cancer-treatment/discussion/35336#198462

1.3 Data overview:

Source: https://www.kaggle.com/c/msk-redefining-cancer-treatment/data

Training variants (ID,Gene,Variations,Classlabel)
Training_Texr (ID, TEXT)

1.4 Type of Machine learning problem

There are nine different classes a genetic mutation can be classified into => Multi class classification problem

1.5 Performance Metric

Multi-class Log loss
Confusion matrix

2. Exploratory Data Analysis

2.1 Distribution of Train,Test,Cv data (64:20,16)

Distribution of yi in Train

Distribution of yi in Test

Distribution of yi in CV

Observation: Here we have done Stratified Train Test CV split so that we get equal amount of distribution in all 3 plots as we can see and by looking at this Histogram we can come into concluison that Class Label {7,4,1,2} consist of more than 60% - 70% of data and so this is an Imbalanced Dataset

2.1 Univariate Analysis

2.1.2 Univariate Analysis on Gene feature

Q.How many categories are there and How they are distributed?

unique_genes = train_df["Gene"].value_counts()

(240,)

Plot CDF for this distrbutions of gene feature

s = sum(unique_genes.values)
h = unique_genes.values/s
c = np.cumsum(h)
plt.plot(c,label='Cumulative distribution of Genes')
plt.grid()
plt.legend()
plt.show()

Observation: Out of 225 unique genes only top 50 genes consist more than 70% of data and rest only 30%

2.1.3 Univariate Analysis on Variation feature

Q.How many categories are there and How they are distributed?

unique_genes = train_df["Variation"].value_counts()

(1941,)

Plot CDF for this distrbutions of gene feature

s = sum(unique_variation.values)
h = unique_variation.values/s
c = np.cumsum(h)
plt.plot(c,label='Cumulative distribution of Genes')
plt.grid()
plt.legend()
plt.show()

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.gitignore		.gitignore
Case Study on Personalized Cancer Diagnosis .ipynb		Case Study on Personalized Cancer Diagnosis .ipynb
README.md		README.md
cancer.jpg		cancer.jpg
knn.png		knn.png
logistic.png		logistic.png
naive.png		naive.png
random.jpg		random.jpg
support.jpg		support.jpg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Personalized Cancer_Diagnosis

1. Task: Classify the given genetic variations/mutations based on evidence from text-based clinical literature

1.1 In detail explaination:

1.2. Workflow :

Information gathered from: https://www.kaggle.com/c/msk-redefining-cancer-treatment/discussion/35336#198462

1.3 Data overview:

Source: https://www.kaggle.com/c/msk-redefining-cancer-treatment/data

1.4 Type of Machine learning problem

1.5 Performance Metric

2. Exploratory Data Analysis

2.1 Distribution of Train,Test,Cv data (64:20,16)

Distribution of yi in Train

Distribution of yi in Test

Distribution of yi in CV

Observation: Here we have done Stratified Train Test CV split so that we get equal amount of distribution in all 3 plots as we can see and by looking at this Histogram we can come into concluison that Class Label {7,4,1,2} consist of more than 60% - 70% of data and so this is an Imbalanced Dataset

2.1 Univariate Analysis

2.1.2 Univariate Analysis on Gene feature

Q.How many categories are there and How they are distributed?

Plot CDF for this distrbutions of gene feature

2.1.3 Univariate Analysis on Variation feature

Q.How many categories are there and How they are distributed?

Plot CDF for this distrbutions of gene feature

3. Machine Learning models used

1. Multinomial Naive Bayes

2. K- Nearest Neigbour

3. Logistic Regression with Class Balancing

4. Logistic Regression without Class Balancing

5. Linear SVM

About

Releases

Packages

Languages

nihar-max/personalized_cancer_diagnosis

Folders and files

Latest commit

History

Repository files navigation

Personalized Cancer_Diagnosis

1. Task: Classify the given genetic variations/mutations based on evidence from text-based clinical literature

1.1 In detail explaination:

1.2. Workflow :

Information gathered from: https://www.kaggle.com/c/msk-redefining-cancer-treatment/discussion/35336#198462

1.3 Data overview:

Source: https://www.kaggle.com/c/msk-redefining-cancer-treatment/data

1.4 Type of Machine learning problem

1.5 Performance Metric

2. Exploratory Data Analysis

2.1 Distribution of Train,Test,Cv data (64:20,16)

Distribution of yi in Train

Distribution of yi in Test

Distribution of yi in CV

Observation: Here we have done Stratified Train Test CV split so that we get equal amount of distribution in all 3 plots as we can see and by looking at this Histogram we can come into concluison that Class Label {7,4,1,2} consist of more than 60% - 70% of data and so this is an Imbalanced Dataset

2.1 Univariate Analysis

2.1.2 Univariate Analysis on Gene feature

Q.How many categories are there and How they are distributed?

Plot CDF for this distrbutions of gene feature

2.1.3 Univariate Analysis on Variation feature

Q.How many categories are there and How they are distributed?

Plot CDF for this distrbutions of gene feature

3. Machine Learning models used

1. Multinomial Naive Bayes

2. K- Nearest Neigbour

3. Logistic Regression with Class Balancing

4. Logistic Regression without Class Balancing

5. Linear SVM

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages