-
This dataset is a knowledge database of disease-symptom associations generated by an automated method based on information in textual discharge summaries of patients at New York-Presbyterian Hospital admitted during 2004. The dataset can be found here.
-
The first column shows the disease, the second the number of discharge summaries containing a positive and current mention of the disease, and the associated symptom.
Data extraction and cleaning
: Basic cleaning, segmentation of columns and string formatting were performed in Excel.Data preprocessing
: Data preprocessing tasks performed include:- Spelling mistakes in the names of diseases or symptoms or their codes was rectified
- The codes which were given to diseases and symptoms were removed as they were irrelevant for our task
- A cumulative list of all symptoms was made
- Each symptom was assigned a Boolean value of 0 or 1 for each disease, according to whether the symptom occurs with the disease or not
Data visualization
: Built correlation heatmaps for relationship between the symptoms and relationship between the diseasesModel Building
: Used 2 algorithms for this dataset and compared the results to evaluate which one yielded better results: Multinomial Naive Bayes Classifier and Decision Tree.
Find the detailed documentation here.
The results of all the tasks can be viewed by running this code in Google Collab or in the detailed documentation above.
The entire decision tree is too big to be inserted here, so only a part of it is shown here. The entire decision tree can be found here.
Mihir Gandhi - mihir-m-gandhi
Jasdeep Singh Grover - jasdeep100
Hardik Chodvadiya - willyhardik
Amit Dave - amitdave1998
This project is licensed under the MIT - see the LICENSE file for details.