diff --git a/README.md b/README.md index 2de026d..07921d1 100644 --- a/README.md +++ b/README.md @@ -8,7 +8,7 @@ conda env create -f requirements.yml ``` ## Required input data * VCF file of variants for further annotation -* BED file with available uORFs (`sorted.v3.bed` in this repository) +* BED file with available uORFs (`sorted.v4.bed` in this repository) * GTF file with genomic features annotation * \[optional\] TSV file with gene-level gnomAD constraint statistics @@ -26,7 +26,8 @@ python uORF_annotator.py \ ``` ## Output formats specification ### tab-separated (tsv) file -Each row represents annotation of a single variant in particular uORF (per uORF annotation). Fields in the file have the following content: + +Two TSV outputs are generated - one for ATG-started uORFs and one - for non-ATG-started ones. Each row represents annotation of a single variant in particular uORF (per uORF annotation). Fields in the file have the following content: 1) #CHROM - contig name 2) POS - position @@ -52,16 +53,21 @@ Each row represents annotation of a single variant in particular uORF (per uORF The generated VCF output contains all variants affecting uORF sequences. Each variant is annotated with the following INFO fields: `uORFs`, `uORFs_ATG`, `uORFs_eff`. The description of fields is given below: * `uORFs` - a full consequence annotation for each variant-uORF combination. Format: 'ORF_START|ORF_END|ORF_SYMB|ORF_CONSEQ|main_cds_effect|in_known_CDS|in_known_ORF|utid|overlapping_type|dominance_type|codon_type' -* `uORFs_ATG` - a flag indicating if a variant falls within ATG-starting uORF. -* `uORFs_eff` - a short notation of main CDS effect. ext - N-terminal extension, overl - out-of-frame overlap, activ - overlap removal with possible main ORF activation, unaff - no effect on main CDS. +* `uORFs_ATG` - a flag indicating if a variant falls within at least one ATG-starting uORF. +* `uORFs_eff` - a short notation of how a change in uORF structure resulting from a variant affects the main coding part (СDS) of a gene. ext - N-terminal extension, overl - out-of-frame overlap, activ - overlap removal with possible main ORF activation, unaff - no effect on main CDS. If the variant falls into more than one uORF, the effects on them are listed through & ### BED format -A BED file generated by the *uORF Annotator* contains all uORFs affected by variants that alter the uORF length. The BED file contains one entry for each affected uORF, and one entry for each variant-uORF combination that leads to changes in anticipated length of uORF product. Color legend: +*uORF Annotator* generates two BED files with uORFs affected by variants that alter the uORF length (one file contains ATG uORFs and the other contains non-ATG-started uORFs). Both BED files contain two entries for each affected uORF: +1) initial uORF, its `name` field format: uORF_unique_number-gene_name|uORF_type|start_codon_type(ATG/non-AT), filled with black color; +2) resulting uORF after introduction of a variant, its `name` field format: uORF_unique_number-gene_name|variant|variant_type|main_CDS_effect, filled with different colors depending on the effect. + +Color legend: * Grey features - cases when the variant does not change the overlap between uORF and main CDS. * Orange features - cases when (a) uORF-truncating variant eliminates the existing overlap between uORF and main CDS; or (b) variant leads to the production of a chimeric protein product of the gene, possessing an extension at the N-terminus resulting from uORF translation * Red features - cases where variant leads to the appearance of a new overlapping segment between uORF and main gene CDS, with the two sequences translated in different frames. + ## Supplementary data files This repository contains two additional files: