Modify README

bioinf · Dec 13, 2022 · 1f0cf24 · 1f0cf24
1 parent d93beb8
commit 1f0cf24
Show file tree

Hide file tree

Showing 2 changed files with 31 additions and 21 deletions.
diff --git a/README.md b/README.md
@@ -1,11 +1,5 @@
-# uORF Annotator v. 0.7
-*uORF Annotator* is the tool for annotation of the functional impact of genetic variants in upstream open reading frames (uORFs) in the human genome, which are predicted by [uBert model](https://github.com/skoblov-lab/uBERTa).
-
-New in v. 0.7:
-* annotation of `stop_gained` variants with potentional activating effect on main ORF (`overlap_removed`)
-* generation of a separate VCF file as a companion for bED visualization
-* re-organization of TSV output and VCF fields
-* gnomAD constraint metrics annotation added under `-gc` option. 
+# uORF Annotator v. 1.0
+*uORF Annotator* is the tool for annotation of the functional impact of genetic variants in upstream open reading frames (uORFs) in the human genome, which were manually annotated based on publicly available Ribo-seq and other data types in 3641 OMIM genes.
 
 ## Conda environment
 Install all dependencies from `requirements.yml` as new conda environment.
@@ -32,8 +26,8 @@ python uORF_annotator.py \
 ```
 ## Output formats specification
 ### tab-separated (tsv) file
-Each row represents annotation of a single variant in particular uORF (per uORF annotation).
-#### Fields in the TSV output
+Each row represents annotation of a single variant in particular uORF (per uORF annotation). Fields in the file have the following content:
+
 1) #CHROM - contig name  
 2) POS - position  
 3) REF - reference allele
@@ -53,7 +47,23 @@ Each row represents annotation of a single variant in particular uORF (per uORF
 17) LOEUF score (if gnomAD constraint annotation is provided)
 18) INFO - old INFO field from inputed `.vcf` file  
 
-### The Variant Call Format (vcf)
-Add uBERT field to INFO fields of input vcf file.
-#### FORMAT:
-ORF_START|ORF_END|ORF_SYMB|ORF_CONSEQ|overlapped_CDS|utid|overlapping_type|dominance_type|codon_type
+### Varinat call format (VCF)
+
+The generated VCF output contains all variants affecting uORF sequences. Each variant is annotated with the following INFO fields: `uORFs`, `uORFs_ATG`, `uORFs_eff`. The description of fields is given below:
+
+* `uORFs` - a full consequence annotation for each variant-uORF combination. Format: 'ORF_START|ORF_END|ORF_SYMB|ORF_CONSEQ|main_cds_effect|in_known_CDS|in_known_ORF|utid|overlapping_type|dominance_type|codon_type'
+* `uORFs_ATG` - a flag indicating if a variant falls within ATG-starting uORF.
+* `uORFs_eff` - a short notation of main CDS effect. ext - N-terminal extension, overl - out-of-frame overlap, activ - overlap removal with possible main ORF activation, unaff - no effect on main CDS.
+
+### BED format
+
+A BED file generated by the *uORF Annotator* contains all uORFs affected by variants that alter the uORF length. The BED file contains one entry for each affected uORF, and one entry for each variant-uORF combination that leads to changes in anticipated length of uORF product. Color legend:
+* Grey features - cases when the variant does not change the overlap between uORF and main CDS.
+* Orange features - cases when (a) uORF-truncating variant eliminates the existing overlap between uORF and main CDS; or (b) variant leads to the production of a chimeric protein product of the gene, possessing an extension at the N-terminus resulting from uORF translation
+* Red features - cases where variant leads to the appearance of a new overlapping segment between uORF and main gene CDS, with the two sequences translated in different frames.
+
+## Additional files
+
+This repository contains two additional files:
+1) `Annotated_uORFs_and_alt.CDS.starts_v4.8.bed` - Manually annotated alternative open reading frames (including non-overlapping and overlapping uORFs, CDS_extensions and CDS_truncations) found in 3641 human genes from the OMIM database . BED-file, field "name" contains information about 'gene_name|isoform|type_of_ORF|start_codon'.
+2) `High-confidence_uORFs_v2.bed` - List of high confidence uORFs found in 3641 human genes from the OMIM database. In this list, we included only uORFs predicted in at least two out of four different studies (the present study, Ji et al. (PMID: 26687005), McGillivray et al. (PMID: 29562350), Scholz et al. (PMID: 31513641)). BED-file, field "name" contains information about 'gene_name|isoform|type_of_uORF|type_of_start_codon(ATG/non-ATG)'.
diff --git a/uORF_annotator.py b/uORF_annotator.py
@@ -1009,7 +1009,7 @@ def main(input_vcf, bed, fasta, gtf, output, utr_only, gnomad_constraint) -> Non
 
                 # add fields from 4-field bed file
                 df['INFO_new'] = df['INFO_new'].str.cat(df['bed_anno'], sep = '|')
-                df['INFO_new'] = df['INFO_new'].str.replace(',uBERT_uORFs=', ',')
+                df['INFO_new'] = df['INFO_new'].str.replace(',uORFs=', ',')
 #                df['INFO_new'] = df['INFO_new'].astype(str) + ';'
                 atg_binarizer = {True: 'yes', False: 'no'}
                 # set main VCF fields
@@ -1018,11 +1018,11 @@ def main(input_vcf, bed, fasta, gtf, output, utr_only, gnomad_constraint) -> Non
                 # per variant multi-u-transcript annotation
                 df = df.groupby(['#CHROM', 'POS', 'REF', 'ALT', 'INFO'], sort=False)['INFO_new'].apply(','.join)
                 df = df.reset_index()
-                df['INFO_new'] = [f'{x};uBERT_ATG={atg_binarizer["|ATG" in x]}' for x in df['INFO_new']]
+                df['INFO_new'] = [f'{x};uORFs_ATG={atg_binarizer["|ATG" in x]}' for x in df['INFO_new']]
                 mce_short = {'': 'none', 'unassigned': 'none', 'main_CDS_unaffected': 'unaff', 'N-terminal_extension': 'ext', \
                         'out-of-frame_overlap': 'overl', 'overlap_removal': 'activ'}
                 main_eff_list = ['&'.join(set([mce_short[x.split('|')[-7]] for x in y.split(',')])) for y in df['INFO_new']]
-                df['INFO_new'] = [f'{x};uBERT_eff={y}' for x, y in zip(df['INFO_new'], main_eff_list)]
+                df['INFO_new'] = [f'{x};uORFs_eff={y}' for x, y in zip(df['INFO_new'], main_eff_list)]
 
 
                 # add rest obligate VCF-fields
@@ -1041,10 +1041,10 @@ def main(input_vcf, bed, fasta, gtf, output, utr_only, gnomad_constraint) -> Non
 
                 # add header ##INFO uORF_annotator (uBERT) line
                 h += \
-                f'##INFO=<ID=uBERT_uORFs,Number=.,Type=String,Description="Consequence uORF_annotator from uBERT. ' + \
+                f'##INFO=<ID=uORFs,Number=.,Type=String,Description="Consequence uORF_annotator from uBERT. ' + \
                 f'Format: ORF_START|ORF_END|ORF_SYMB|ORF_CONSEQ|main_cds_effect|in_known_CDS|in_known_ORF{bed_4col_info}">\n' + \
-                f'##INFO=<ID=uBERT_ATG,Number=.,Type=String,Description="A flag indicating if a variant falls within ATG-starting uORF.">\n' + \
-                f'##INFO=<ID=uBERT_eff,Number=.,Type=String,Description="Short notation of main CDS effect. ext - N-terminal extension, ' + \
+                f'##INFO=<ID=uORFs_ATG,Number=.,Type=String,Description="A flag indicating if a variant falls within ATG-starting uORF.">\n' + \
+                f'##INFO=<ID=uORFs_eff,Number=.,Type=String,Description="Short notation of main CDS effect. ext - N-terminal extension, ' + \
                 'overl - out-of-frame overlap, activ - overlap removal with possible main ORF activation, unaff - no effect on main CDS">'
                 # write VCF-header and VCF-body in output file
                 with open(f'{output}.vcf', 'a') as w:
@@ -1063,4 +1063,4 @@ def main(input_vcf, bed, fasta, gtf, output, utr_only, gnomad_constraint) -> Non
                 sp.run(f'sort -k1,1 -k2,2n -k3,3n {tmp_atg_bed.name} | uniq - >> {atg_bed}', shell=True)
                 sp.run(f'sort -k1,1 -k2,2n -k3,3n {tmp_non_atg_bed.name} | uniq - >> {non_atg_bed}', shell=True)
         # run analysis
-        main()
+        main()