GNormPlus

Introduction

GNormPlus has been developed at the NLM and is described in [1] and can be downloaded here. It is a tool to recognize gene mentions in the scientific biomedical literature and assign ID from NCBI Gene to them. This repository has the following purposes:

Add code changes to output entity mentions tagged as FamilyName.
Refactor the code to allow multi-threaded processing.
Refactor the code to allow its usage embedded into another Java program while the resources (gene and species dictionaries, GNR model etc.) are only loaded once at the beginning
Create a Maven artifact to be distributed to Maven Central for usage in a JULIE Lab UIMA JCoRe component.

Code changes to output FamilyNames

There are a few places in the code marked with a comment containing "Erik Faessler". Two of those changes lead to the output of the FamilyName entities. Those entities do not receive an ID from NCBI Gene.

Refactoring to allow multi-threaded processing

In its original form, the GNormPlus code completely operates on static fields in the GNormPlus class. The new class GNPProcessingData was created to hold all the data fields that are manipulated during the processing of a batch of files. For each file to be processed, an instance of GNPProcessingData is created which avoids race conditions on field access. Of course, the access to the static fields had to be changed to access to the GNPProcessingData instance. Thus, the GNPProcessingData instance is passed to the classes that realize the individual tasks (GNR, species assignment, mapping etc.).

Refactoring to allow usage from other Java programs without reloading the resources

The normal GNormPlus workflow consists of

loading the dictionaries and models
iterating over the files in the input directory
processing each file and writing its output
ending the program

However, we would like to keep the program running as new documents come in without a definitive knowledge about how many document there are and where to find them. This would allow us to use GNormPlus in a UIMA component. Thus, there is now a loadResources() method in the GNormPlus class that should be called once at the beginning of some processing. After that call, the dictionaries are loaded and GNormPlus is ready to process incoming documents. This can be done via the processFile() method. When delivering files with different names, parallel processing is possible.

[1] Wei, C. H., Kao, H. Y., & Lu, Z. (2015). GNormPlus: An Integrative Approach for Tagging Genes, Gene Families, and Protein Domains. BioMed Research International, 2015. https://doi.org/10.1155/2015/918710

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
convertBioCDocClass.py		convertBioCDocClass.py
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GNormPlus

Introduction

Code changes to output FamilyNames

Refactoring to allow multi-threaded processing

Refactoring to allow usage from other Java programs without reloading the resources

About

Releases 4

Packages

Languages

License

JULIELab/gnormplus

Folders and files

Latest commit

History

Repository files navigation

GNormPlus

Introduction

Code changes to output FamilyNames

Refactoring to allow multi-threaded processing

Refactoring to allow usage from other Java programs without reloading the resources

About

Resources

License

Stars

Watchers

Forks

Releases 4

Packages 0

Languages

Packages