Skip to content

Commit

Permalink
Modify README
Browse files Browse the repository at this point in the history
  • Loading branch information
mrbarbitoff committed Dec 13, 2022
1 parent d93beb8 commit 1f0cf24
Show file tree
Hide file tree
Showing 2 changed files with 31 additions and 21 deletions.
38 changes: 24 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,5 @@
# uORF Annotator v. 0.7
*uORF Annotator* is the tool for annotation of the functional impact of genetic variants in upstream open reading frames (uORFs) in the human genome, which are predicted by [uBert model](https://github.com/skoblov-lab/uBERTa).

New in v. 0.7:
* annotation of `stop_gained` variants with potentional activating effect on main ORF (`overlap_removed`)
* generation of a separate VCF file as a companion for bED visualization
* re-organization of TSV output and VCF fields
* gnomAD constraint metrics annotation added under `-gc` option.
# uORF Annotator v. 1.0
*uORF Annotator* is the tool for annotation of the functional impact of genetic variants in upstream open reading frames (uORFs) in the human genome, which were manually annotated based on publicly available Ribo-seq and other data types in 3641 OMIM genes.

## Conda environment
Install all dependencies from `requirements.yml` as new conda environment.
Expand All @@ -32,8 +26,8 @@ python uORF_annotator.py \
```
## Output formats specification
### tab-separated (tsv) file
Each row represents annotation of a single variant in particular uORF (per uORF annotation).
#### Fields in the TSV output
Each row represents annotation of a single variant in particular uORF (per uORF annotation). Fields in the file have the following content:

1) #CHROM - contig name
2) POS - position
3) REF - reference allele
Expand All @@ -53,7 +47,23 @@ Each row represents annotation of a single variant in particular uORF (per uORF
17) LOEUF score (if gnomAD constraint annotation is provided)
18) INFO - old INFO field from inputed `.vcf` file

### The Variant Call Format (vcf)
Add uBERT field to INFO fields of input vcf file.
#### FORMAT:
ORF_START|ORF_END|ORF_SYMB|ORF_CONSEQ|overlapped_CDS|utid|overlapping_type|dominance_type|codon_type
### Varinat call format (VCF)

The generated VCF output contains all variants affecting uORF sequences. Each variant is annotated with the following INFO fields: `uORFs`, `uORFs_ATG`, `uORFs_eff`. The description of fields is given below:

* `uORFs` - a full consequence annotation for each variant-uORF combination. Format: 'ORF_START|ORF_END|ORF_SYMB|ORF_CONSEQ|main_cds_effect|in_known_CDS|in_known_ORF|utid|overlapping_type|dominance_type|codon_type'
* `uORFs_ATG` - a flag indicating if a variant falls within ATG-starting uORF.
* `uORFs_eff` - a short notation of main CDS effect. ext - N-terminal extension, overl - out-of-frame overlap, activ - overlap removal with possible main ORF activation, unaff - no effect on main CDS.

### BED format

A BED file generated by the *uORF Annotator* contains all uORFs affected by variants that alter the uORF length. The BED file contains one entry for each affected uORF, and one entry for each variant-uORF combination that leads to changes in anticipated length of uORF product. Color legend:
* Grey features - cases when the variant does not change the overlap between uORF and main CDS.
* Orange features - cases when (a) uORF-truncating variant eliminates the existing overlap between uORF and main CDS; or (b) variant leads to the production of a chimeric protein product of the gene, possessing an extension at the N-terminus resulting from uORF translation
* Red features - cases where variant leads to the appearance of a new overlapping segment between uORF and main gene CDS, with the two sequences translated in different frames.

## Additional files

This repository contains two additional files:
1) `Annotated_uORFs_and_alt.CDS.starts_v4.8.bed` - Manually annotated alternative open reading frames (including non-overlapping and overlapping uORFs, CDS_extensions and CDS_truncations) found in 3641 human genes from the OMIM database . BED-file, field "name" contains information about 'gene_name|isoform|type_of_ORF|start_codon'.
2) `High-confidence_uORFs_v2.bed` - List of high confidence uORFs found in 3641 human genes from the OMIM database. In this list, we included only uORFs predicted in at least two out of four different studies (the present study, Ji et al. (PMID: 26687005), McGillivray et al. (PMID: 29562350), Scholz et al. (PMID: 31513641)). BED-file, field "name" contains information about 'gene_name|isoform|type_of_uORF|type_of_start_codon(ATG/non-ATG)'.
14 changes: 7 additions & 7 deletions uORF_annotator.py
Original file line number Diff line number Diff line change
Expand Up @@ -1009,7 +1009,7 @@ def main(input_vcf, bed, fasta, gtf, output, utr_only, gnomad_constraint) -> Non

# add fields from 4-field bed file
df['INFO_new'] = df['INFO_new'].str.cat(df['bed_anno'], sep = '|')
df['INFO_new'] = df['INFO_new'].str.replace(',uBERT_uORFs=', ',')
df['INFO_new'] = df['INFO_new'].str.replace(',uORFs=', ',')
# df['INFO_new'] = df['INFO_new'].astype(str) + ';'
atg_binarizer = {True: 'yes', False: 'no'}
# set main VCF fields
Expand All @@ -1018,11 +1018,11 @@ def main(input_vcf, bed, fasta, gtf, output, utr_only, gnomad_constraint) -> Non
# per variant multi-u-transcript annotation
df = df.groupby(['#CHROM', 'POS', 'REF', 'ALT', 'INFO'], sort=False)['INFO_new'].apply(','.join)
df = df.reset_index()
df['INFO_new'] = [f'{x};uBERT_ATG={atg_binarizer["|ATG" in x]}' for x in df['INFO_new']]
df['INFO_new'] = [f'{x};uORFs_ATG={atg_binarizer["|ATG" in x]}' for x in df['INFO_new']]
mce_short = {'': 'none', 'unassigned': 'none', 'main_CDS_unaffected': 'unaff', 'N-terminal_extension': 'ext', \
'out-of-frame_overlap': 'overl', 'overlap_removal': 'activ'}
main_eff_list = ['&'.join(set([mce_short[x.split('|')[-7]] for x in y.split(',')])) for y in df['INFO_new']]
df['INFO_new'] = [f'{x};uBERT_eff={y}' for x, y in zip(df['INFO_new'], main_eff_list)]
df['INFO_new'] = [f'{x};uORFs_eff={y}' for x, y in zip(df['INFO_new'], main_eff_list)]


# add rest obligate VCF-fields
Expand All @@ -1041,10 +1041,10 @@ def main(input_vcf, bed, fasta, gtf, output, utr_only, gnomad_constraint) -> Non

# add header ##INFO uORF_annotator (uBERT) line
h += \
f'##INFO=<ID=uBERT_uORFs,Number=.,Type=String,Description="Consequence uORF_annotator from uBERT. ' + \
f'##INFO=<ID=uORFs,Number=.,Type=String,Description="Consequence uORF_annotator from uBERT. ' + \
f'Format: ORF_START|ORF_END|ORF_SYMB|ORF_CONSEQ|main_cds_effect|in_known_CDS|in_known_ORF{bed_4col_info}">\n' + \
f'##INFO=<ID=uBERT_ATG,Number=.,Type=String,Description="A flag indicating if a variant falls within ATG-starting uORF.">\n' + \
f'##INFO=<ID=uBERT_eff,Number=.,Type=String,Description="Short notation of main CDS effect. ext - N-terminal extension, ' + \
f'##INFO=<ID=uORFs_ATG,Number=.,Type=String,Description="A flag indicating if a variant falls within ATG-starting uORF.">\n' + \
f'##INFO=<ID=uORFs_eff,Number=.,Type=String,Description="Short notation of main CDS effect. ext - N-terminal extension, ' + \
'overl - out-of-frame overlap, activ - overlap removal with possible main ORF activation, unaff - no effect on main CDS">'
# write VCF-header and VCF-body in output file
with open(f'{output}.vcf', 'a') as w:
Expand All @@ -1063,4 +1063,4 @@ def main(input_vcf, bed, fasta, gtf, output, utr_only, gnomad_constraint) -> Non
sp.run(f'sort -k1,1 -k2,2n -k3,3n {tmp_atg_bed.name} | uniq - >> {atg_bed}', shell=True)
sp.run(f'sort -k1,1 -k2,2n -k3,3n {tmp_non_atg_bed.name} | uniq - >> {non_atg_bed}', shell=True)
# run analysis
main()
main()

0 comments on commit 1f0cf24

Please sign in to comment.