Applying machine learning techniques to characterising and naming lncRNA genes. Although far less understood, it is estimated that there are far more long non-coding RNA (lncRNA) genes than protein-coding genes. This means that in depth manual inspection of annotations, as performed currently on protein-coding genes, cannot scale up to lncRNA. The aim of this project is to examine existing lncRNA annotations produced by RefSeq or Ensembl HAVANA and determine consistent annotations that are therefore worthy of an HGNC (HUGO Gene Nomenclature Committee) approved gene symbol and name. We are currently compiling a hand curated dataset to serve for training, so this project will focus on machine learning, although some background knowledge in molecular biology will be useful in feature design.
Not quite, two automated pipelines (Ensembl/GENCODE and RefSeq) are already doing so. The purpose of this exercise is to compare these two sets of calls and determine via ML which calls are incorrect.
Not yet, we are still curating the calls manually, but you can already examine the calls via RESTful APIs at Ensembl or RefSeq.