This repository provides the code of experiments for entity based retrieval taking semantic relations of entities into account.
We utilized the following implemented scorer from GATE Mímir:
- BM25
- TF-IDF (CF-IDF)
As we utilized an annotation index with URIs, both ranking algorithms already work concept based and not term based.
We added a variation of CF-IDF (called CF-IDF exact) to only consider hits with URIs occurring in the query. This gives more credit to matches with URIs in case the entity linking did not work properly and could only extract a semantic type but not a concept in an ontology.
In addition, we implemented the following entity based retrieval models:
For the node level computation in HCF-IDF, we utilized slib with a slight extension (see descriptions below):
-
SML extension(node level computation for HCF-IDF)
-
slibAPI (Wrapper)
-
Mímir Search API (Wrapper)
prerequisites: The following instructions have been tested with Java 8 and GATE version 8.6.1. Please download and install GATE 8.6. Please also download, unzip and load the Ontology plugin. To load the plugin, start GATE and open the Plugin Manager (the jigsaw icon) in the menu. Then select "open from folder" and navigate to the folder with the Ontology plugin.
Please download the pipelines from Zenodo:
To load a new application in the GATE UI, do a right-click on "Application" and "Restore Application from file". Then navigate to the application.xgapp file of the BEFCHina or BioCADDIE folder. Create a new corpus and add a metadata file you want to annotate, e.g., download the BEFChina datasets and annotate them (run the pipeline).
prerequisites: The following instructions have been tested with Java 8 and Maven 3.3.9
Download the slib project. Further information about semantic similarity measures implemented in slib can be found on the webpage: http://www.semantic-measures-library.org/
In order to compute the node level for the HCF-IDF we adjusted the Semantic Measures Engine in the slib-sml subproject. As there is no plugin API in slib, add the methods given in slib.sml.sm.core.engine.SM_engine
to the respective file in slib/sml.
Build and install the SML library with Maven.
The slibAPI is wrapper project for the SML library and loads all ontologies as external knowledge sources. Download and install the slibAPI. Create a resource/vocabs
folder under the root folder slibAPI
. Download the following ontologies in owl format from OBO Foundry and store it in the vocabs
folder:
- bfo
- bto
- chebi
- cl
- doid
- envo
- flopo
- go
- hp
- ino
- mod
- ncbitaxon
- ncit
- obi
- pato
- po
- ppo
- rex
- symp
- to
- uberon (core ontology - uberon.owl + Uberon extended - ext.owl)
Build and install the slibAPI project.
Download and install GATE Mímir version 6.2. Further information on the requirements of GATE Mímir and a user guide can be found on their webpage: https://gate.ac.uk/mimir/.
Install the scorer plugins provided in the GATE Mímir/plugins
folder as described in Mímir's user guide.
All ontologies needs to be loaded during Mímir's start. Therefore, initialize the slibAPI/SML class in the MimirScorerService. Have a look at the provided file in the Mimir6.2/webapp
folder
Now index a corpus with GATE Mímir, e.g., download the BEFChina datasets and annotate them with the OrganismTagger and the BiodivTagger. For testing purposes, one can also use any text files and annotate them with the default GATE ANNIE pipeline (extracts e.g., Location, Person, Date and Time).
We provide an index template for the annotations obtained from the OrganismTagger and BiodivTagger.
The Mímir Search API project is a wrapper for Mímir and provides more convenient access for developers to the Mímir's search.
Download and install the Mímir search API project. In the MimirSearch.java
file adjust the index URL.
The MimirTest project provides the code we ran to evaluate entity expansion and entity-based ranking functions on two test collections. It can be also used as example code for own search purposes and evaluations.
The evaluation results are available at Zenodo:
The code in this project is distributed under the terms of the GNU LGPL v3.0.