Based on Rfam 14.8 (May 2022). See releases for previous versions.
This repository contains the code and data for analysing the taxonomic distribution of the Rfam families. The goal is to identify domain-specific subsets of Rfam covariance models for annotating bacterial, eukaryotic, and other genomes with the Infernal software.
The code uses the Rfam public MySQL database to compare the taxonomic domains of sequences from the manually curated seed alignments and the automatically identified full region hits.
📂 The results are organised in several files in the domains folder. Each file contains seven columns:
Family
= Rfam accession (e.g. RF00001)Domain
= Taxonomic domain where the family is found (:grey_exclamation: this is the most important column)Seed domains
= All taxonomic domains from the seed alignmentFull region domains
= All taxonomic domains from full region hitsRfam ID
= Rfam identifier (e.g. 5S_rRNA)Description
= Family descriptionRNA type
= One of Rfam RNA types.
Domain
can be:
- a single domain (for example, Bacteria or Eukaryota) if the majority of hits (>=90%) are from the same domain both in seed and full region hits;
<seed domain>/<full region domain>
- if seed and full region domains are not the same, then both are listed. For example, Viruses/Eukaryota means that the seed alignment contains mostly Viruses and the full region hits contain mostly Eukaryotes);Mixed
- if there is no single domain where the family occurs. For example, 5S rRNA RF00001 is expected to be found in Bacteria, Archaea, and Eukaryota.<seed region domain>/Mixed
orMixed/<full region domain>
- For example, Bacterial SSU RF00177 has only Bacteria in the seed alignment but the full region hits also contain Eukaryota because the mitochondrial and plastid SSU is similar to the bacterial SSU and is expected to match the bacterial model.
✅ View summary with the number of families observed in each domain.
The latest version of the files can be retrieved directly from GitHub using the following URL format:
- https://raw.githubusercontent.com/Rfam/rfam-taxonomy/master/domains/all-domains.csv
- https://raw.githubusercontent.com/Rfam/rfam-taxonomy/master/domains/bacteria.csv
- https://raw.githubusercontent.com/Rfam/rfam-taxonomy/master/domains/archaea.csv
- https://raw.githubusercontent.com/Rfam/rfam-taxonomy/master/domains/viruses.csv
It is also possible to download the data and use it locally or regenerate the files (see the Installation section below).
-
If you are interested in a subset of Rfam families that match Bacteria, you can use the bacteria.csv file. For example, the following command generates a
bacteria.cm
file with a subset of Rfam covariance models that can be used with the Infernal cmscan program:curl https://raw.githubusercontent.com/Rfam/rfam-taxonomy/master/domains/bacteria.csv | \ cut -f 1,1 -d ',' | \ tail -n +2 | \ cmfetch -o bacteria.cm -f Rfam.cm.gz -
where cmfetch is part of the Infernal suite and
Rfam.cm.gz
can be downloaded fromftp://ftp.ebi.ac.uk/pub/databases/Rfam/CURRENT/Rfam.cm.gz
. -
You can also further process the all-domains.csv file. For example, to eliminate any families that find hits outside Bacteria, you can focus on rows where the second column is
Bacteria
and the third and the fourth columns containBacteria (100.0%)
. Note that such a subset would ignore many important RNA families that detect some contamination in eukaryotic sequences.
Clone or download this repository and run the following commands:
virtualenv ENV
source ENV/bin/activate
pip install -r requirements.txt
After each Rfam release, the data in this repo need to be updated locally and pushed to GitHub.
-
Generate new data
# when running for the first time (needs to run in this order): python rfam-taxonomy.py --precompute-full python rfam-taxonomy.py --precompute-seed # after precompute is done, run: python rfam-taxonomy.py # to see additional options: python rfam-taxonomy.py --help
-
Review the changes
The results must be manually reviewed before committing the new files by checking the difference between the old and the new versions using git.
It is normal for the values in the 3rd and 4th columns to change but
Domain
, the 2nd column, should stay stable unless the affected family has been significantly updated. -
Update release info in Readme
-
Create new GitHub release
Feel free to create GitHub issues to ask questions or provide feedback. Pull requests are also welcome.