Given a medical diagnosis, identifying medical conditions within the text and mapping them to standardized medical encodings.
The data directory contains:
- The disease mentions from the text files stored in entities.tsv.
- Text files containing the medical textual data in the text directory.
The data is taken from the English version of multilingual resources of the DisTEMIST 2022 task: https://zenodo.org/record/6532684
The pre-processing stage involves:
-
Splitting medical text in each file into sentences.
-
Tokenizing the sentences into words/tokens.
-
Calculating IOB tags for the tokens for named entity recognition (NER) task.
-
Code: Pre-processing.ipynb
-
Two Types of Models are built:
- The entire clinical case / document is given as input
- Sentence based Tokenization and the sentences are given as input
-
The basic models used are :
-
Disease mentions identification is built as a Token classification problem.
-
Code: Entities_NER.ipynb
-
The disease mentions are linked to SNOMED CT codes.
-
The models used are:
-
Code: EL.ipynb (SapBERT), EL_roberta.ipynb (Roberta-Large), EL_pubmedbert.ipynb (PubMedBERT)