Skip to content

Use of Probabilistic Topic Models to Medical Semantic Indexing in Spanish

License

Notifications You must be signed in to change notification settings

librairy/mesinesp2

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Supervised Graph-based Topic Model for Medical Semantic Indexing in Spanish

                                                                 GitHub Issues License Data-DOI

Task

The MESINESP2 task of the 9th BioASQ Workshop aims to create an automatic semantic indexing system for Spanish medical documents based on structured vocabularies. In particular, texts have to be annotated with DeCS headings. These health sciences descriptors are a trilingual and structured vocabulary created by BIREME to serve as a unique language in indexing articles from scientific journals, books, conference proceedings, technical reports, and other types of materials, as well as for searching and retrieving subjects from scientific literature from information sources available on the Virtual Health Library (VHL) such as LILACS, MEDLINE, among others.

Three types of documents are proposed for the task: scientific literature, clinical trails and patents. They are not long texts and usually have assigned several categories. On average, scientific articles contain 1,332 characters and 10 categories, clinical trails contain 7,283 characters and 15 categories, and patents contain 1,640 characters and 10 categories.

Proposal

A probabilistic topic-based representation of DeCS categories created from previously annotated texts. Each category is described by a density distribution over the vocabulary used in the training texts. The generated topic model allows inferring the presence of DeCS categories in texts not used during training.

Challenges

The characteristics of the documents proposed for the taks and the assumptions of the probabilistic topic models lead to several challenges: (1) Since texts are not long, word frequency may not be adequate to measure relevance, and topic models are based on bags of words (i.e., word order does not matter, but word repetition does); (2) short text-oriented topic models assume the presence of only one topic in the text, however the documents proposed for the task may have more than one category; (3) the topic creation must be supervised to force each topic to map to a DeCS category since categories must match the DeCS headers; and finally (4) topic inference should consider only the most relevant ones, i.e. one or several, since each text may have several categories.

Corpora

A Solr index has been created to process and annotate the texts proposed for the task. The structure of the documents is as follows:

  • id: unique identifier
  • title_s: document name (string)
  • abstract_t: text paragraph (terms)
  • journal_s: publication journal (string)
  • size_i: number of characters (integer)
  • year_i: publication date (integer)
  • db_s: document database (string)
  • codes: list of DeCS categories (list-of-string)
  • scope_s: training, development or test (string)
  • diseases: list of diseases retrieved from the abstract (list-of-string)
  • medications: list of medications retrieved from the abstract (list-of-string)
  • procedures: list of procedures retrieved from the abstract (list-of-string)
  • symptoms: list of symptoms retrieved from the abstract (list-of-string)
  • sentences: list of list of words after pre-processing the abstract (list-of-string)
  • tokens_t_: base text for creating word-bags (terms)

This is an example of document:

{
        "id":"ibc-ET1-3794",
        "title_s":"Caso clínico: Manejo clínico de la hiperprolactinemia secundaria al tratamiento de un episodio maníaco con características psicóticas y mixtas en una paciente con un inicio posparto de trastorno bipolar tipo I",
        "abstract_t":"Se presenta el caso de una paciente que ingresa por un primer episodio maníaco con sintomatología psicótica y mixta. El tratamiento inicial instaurado permitió un control parcial de los síntomas agudos y ocasionó una intensa elevación de los niveles séricos de prolactina. Ante esta situación, se planteó una solución terapéutica basada en la evidencia",
        "journal_s":"Psiquiatr. biol. (Internet)",
        "size_i":352,
        "year_i":2015,
        "db_s":"IBECS",
        "codes":["D006966",
          "D001714",
          "D005260",
          "D000068105",
          "D011388",
          "D006801",
          "D011570",
          "D011618",
          "D049590"],
        "scope_s":"Development",
        "diseases":["maníaco_con_sintomatología_psicótica"],
        "medications":["prolactina"],
        "sentences":["presentar",
          "casar",
          "paciente",
          "...."],
        "tokens_t":" ingresar ingresar ..",
        "_version_":1699080371235192832}

Algorithms

more details coming soon.

Results

Our models are publicly available as Web REST services through Docker images. The service can be started by docker run -p 8080:7777 <model-as-a-service name> and a Swagger-based interface is available at http://localhost:8080.

Algorithm Reference Bag-of-Words Model-as-a-Service Precision Recall F-Measure
LabeledLDA Ramage et. al (2009) Frequency librairy/llda-mesinesp:latest* TBD TBD TBD
TR-LLDA novel TextRank + lineal normalization librairy/tr-llda-mesinesp:latest* TBD TBD TBD
TR?-LLDA novel TextRank + ? normalization librairy/tr?-llda-mesinesp:latest* TBD TBD TBD
R-LLDA novel Rake + lineal normalization librairy/r-llda-mesinesp:latest* TBD TBD TBD
R?-LLDA novel Rake + ? normalization librairy/r?-llda-mesinesp:latest* TBD TBD TBD

* not available yet

About

Use of Probabilistic Topic Models to Medical Semantic Indexing in Spanish

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages