This repository contains the pipelines and resulting data from the analyses of drug resistance patterns in the UK. The analyses are described in the article below:
Zhukova A., Dunn D., Tostevin A., Gascuel O., on behalf of the UK HIV Drug Resistance Database & the Collaborative HIV, Anti-HIV Drug Resistance Network "Modelling drug resistance emergence and transmission in HIV-1 in the UK" Viruses 2023, 15(6), 1244 doi:10.3390/v15061244
We used the data from the UK HIV Drug Resistance Database [Dunn and Pillay 2007]: protease (PR) and reverse transcriptase (RT) sequences extracted during the resistance tests and the corresponding metadata (e.g., treatment status of the patient before the test: treatment-experienced, -naive, or unknown; and the date of the test). Such data can be requested from the Database.
The Snakemake [Köster et al., 2012] pipepiles for data analyses can be found in the snakemake.
They include:
1. Subtyping pipeline Snakefile_subtype
This pipeline first reads and reformats the input data provided by the UK HIV Drug Resistance database (sequences and metadata). It then subtypes and aligns the sequences with jpHMM [Schultz et al. 2006] against the reference pol alignment from the Los Alamos HIV database [Kuiken et al. 2003].
The input data is assumed to be placed in the data/input folder and contain:
- a fasta file sequences.fasta.xz with the input sequences;
- a comma-separated metadata file with the following columns:
- patientindex --id of the patient, allowing to identify sequence samples from the same patient;
- sequence_id -- id of the sequence, corresponding to the ids in the fasta file. This is assumed to be the second column.
- sampledate_my -- sequence sampling date in MM/YYYY format;
- treatmentstatus -- treatment status before the sample: 1 (naive), 2 (experienced), or 3 (unknown).
We do not provide the above two files. They should be requested from the UK HIV Drug Resistance Database.
- the reference pol alignment: provided in data/input/HIV1_REF_2010_pol_DNA.fa.xz.
The pipeline subtypes and aligns the sequences with jpHMM [[Schultz et al. 2006](https://doi.org/10.1 186/1471-2105-7-265)] against the reference POL alignment from the Los Alamos HIV database [Kuiken et al. 2003] (provided in HIV1_REF_2010_pol_DNA.fa.xz).
This pipeline outputs:
- an alignment file aln.uk.fa.xz (in the data/aln folder, not provided)
- a tab-delimited metadata file data/metadata/metadata.uk.tab
with the following columns:
- id -- sequence id (the first column);
- patientindex -- patient id ;
- sampledate_my -- sequence sampling date in MM/YYYY format;
- treatmentstatus -- treatment status before the sample: 'naive' or 'experienced'; if the value is left blank it means that the status is unknown;
- subtype_jpHMM -- sequence subtype.
snakemake --snakefile Snakefile_subtype --keep-going --cores 4 --use-singularity --singularity-args "--home ~"
2. B and C dataset preparation pipeline Snakefile_aln
This pipeline extracts alignments corresponding to subtypes B and C (adding 5 sequences of other pure subtypes as outgroups), keeping one sequence per patient (the first sample); extracts the DRM information for their sequences with sierrapy; and removes the DRM positions from these alignments.
- an alignment file aln.uk.fa.xz (in the data/aln folder, not provided);
- a tab-delimited metadata file data/metadata/metadata.uk.tab
with (at least) the following columns:
- id -- sequence id (the first column);
- patientindex -- patient id ;
- sampledate_my -- sequence sampling date in MM/YYYY format;
- subtype_jpHMM -- sequence subtype.
- alignment files aln.nodrm.fa.xz without DRM positions for subtypes B and C (in the data/aln/B and data/aln/C folders, not provided);
- outgroup files (containing five outgroup sequence ids, one per line) for B and C, data/aln/B/outgroup.tab and data/aln/C/outgroup.tab;
- tab-delimited DRM metadata files for B and C,
data/metadata/B/metadata.drm.tab and data/metadata/C/metadata.drm.tab,
with the following columns:
- inputSequence -- sequence id (the first column);
- drm columns, e.g. PR_L10F -- the value is either 'sensitive' (if the amino acid is position PR:10 is known and not F), 'resistant' (if it's F) or blank (if the amino acid is unknown).
snakemake --snakefile Snakefile_aln --keep-going --cores 4 --use-singularity --singularity-args "--home ~"
3. Time-scaled tree (for B and C) reconstruction pipeline Snakefile_trees
The pipeline reconstructs phylogenetic trees for B and C with RAxML-NG [Stamatakis, 2014] (model: GTR+G4+FO+IO), roots it with an outgroup (which is removed after rooting) and dated with LSD2 [To et al., 2015] (with outlier removal).
- alignment files aln.nodrm.fa.xz without DRM positions for subtypes B and C (in the data/aln/B and data/aln/C folders, not provided);
- outgroup files (containing five outgroup sequence ids, one per line) for B and C: data/aln/B/outgroup.tab and data/aln/C/outgroup.tab;
- tab-delimited metadata file data/metadata/metadata.uk.tab
with (at least) the following columns:
- id -- sequence id (the first column);
- sampledate_my -- sequence sampling date in MM/YYYY format.
- rooted phylogenetic trees for B and C in newick format (without outgroup): data/trees/B/rooted_raxml.nwk and data/trees/C/rooted_raxml.nwk;
- LSD2-compatible sampling date file: data/metadata/lsd2.dates;
- time-scaled trees for B and C in newick format (with named nodes and associated date and CI metadata, without outliers): data/timetrees/B/raxml.lsd2.nwk and data/timetrees/C/raxml.lsd2.nwk;
- information on mutation rates and outliers, associated with the time-scaled trees for B and C: data/timetrees/B/raxml.lsd2.log and data/timetrees/C/raxml.lsd2.log.
snakemake --snakefile Snakefile_trees --keep-going --cores 4 --use-singularity --singularity-args "--home ~"
4. Data set statistics pipeline Snakefile_data_stats
The pipeline calculated various statistics for B and C datasets (Tables 1 and A1 of the article), as well as lists of DRMs with prevalence > 0.5% in B and C (only teh first sequence is considered for each patient), reformats the DRM metadata by position and extracts information on their polymorphic status and associated ARVs.
- tab-delimited metadata file data/metadata/metadata.uk.tab
with (at least) the following columns:
- id -- sequence id (the first column);
- patientindex -- patient id ;
- sampledate_my -- sequence sampling date in MM/YYYY format;
- subtype_jpHMM -- sequence subtype.
- rooted phylogenetic trees for B and C in newick format: data/trees/B/rooted_raxml.nwk and data/trees/C/rooted_raxml.nwk;
- time-scaled trees for B and C in newick format (with named nodes and associated date and CI metadata, without outliers): data/timetrees/B/raxml.lsd2.nwk and data/timetrees/C/raxml.lsd2.nwk;
- tab-delimited DRM metadata files for B and C,
data/metadata/B/metadata.drm.tab and data/metadata/C/metadata.drm.tab,
with the following columns:
- inputSequence -- sequence id (the first column);
- drm columns, e.g. PR_L10F -- the value is either 'sensitive' (if the amino acid is position PR:10 is known and not F), 'resistant' (if it's F) or blank (if the amino acid is unknown).
- lists of DRMs with prevalence >0.5% for B and C: data/metadata/B/common_drms.txt and data/metadata/C/common_drms.txt;
- lists of DRM positions with DRMs with prevalence >0.5% for B and C: data/metadata/B/common_drms.txt and data/metadata/C/common_drms.txt;
- tab-delimited DRM metadata files for B and C, organised by position and containing information on DRMs with prevalence > 0.5%:
data/metadata/B/metadata.drm.by_position.tab
and data/metadata/C/metadata.drm.by_position.tab,
with the following columns:
- inputSequence -- sequence id (the first column);
- drm position columns, e.g. RT_K219ENQ -- the value is either 'sensitive' (if the amino acid is position RT:219 is known and is not E, N or Q), RT_K219E (if it's E), RT_K219N (if it's N), RT_K219Q (if it's Q) or blank (if the amino acid is unknown);
- tab-delimited table containing ARV metadata for DRMs with prevalence > 0.5% data/metadata/arv_metadata.tab, with following columns:
- mutation -- DRM (if several DRMs are present on the same position, the name is aggregated, e.g. RT_K219ENQ);
- drug class -- NNRTI, NRTI or PI;
- drug full name
- drug abbreviation
- year -- year of drug acceptance;
- tab-delimited table metadata on polymorphicity for DRMs with prevalence > 0.5% data/metadata/drm_types.tab, with following columns:
- DRM -- DRM;
- type -- polymorphic or non-polymorphic;
- Table 1 of the article (with data set stats): data/metadata/Table1.tab
- Table A1 of the article (with stats on multiple drm counts): data/metadata/TableA1.tab
snakemake --snakefile Snakefile_data_stats --keep-going --cores 4 --use-singularity --singularity-args "--home ~"
5. DRM pipeline Snakefile_drm
The pipeline reconstructs and visualizes ancestral characters for common DRMs with PastML [Ishikawa, Zhukova et al. 2019].
- time-scaled trees for B and C in newick format (with named nodes and associated date and CI metadata, without outliers): data/timetrees/B/raxml.lsd2.nwk and data/timetrees/C/raxml.lsd2.nwk;
- lists of DRM positions with DRMs with prevalence >0.5% for B and C: data/metadata/B/common_drms.txt and data/metadata/C/common_drms.txt;
- tab-delimited DRM metadata files for B and C, organised by position and containing information on DRMs with prevalence > 0.5%:
data/metadata/B/metadata.drm.by_position.tab
and data/metadata/C/metadata.drm.by_position.tab,
with the following columns:
- inputSequence -- sequence id (the first column);
- drm, e.g. PR_L10F -- the value is either 'sensitive' (if the amino acid is position PR:10 is known and not F), 'resistant' (if it's F) or blank (if the amino acid is unknown).
- tab-delimited table containing ARV metadata for DRMs with prevalence > 0.5% data/metadata/arv_metadata.tab,
with (at least) the following columns:
- mutation -- DRM (if several DRMs are present on the same position, the name is aggregated, e.g. RT_K219ENQ);
- drug class -- NNRTI, NRTI or PI;
- drug abbreviation
- year -- year of drug acceptance;
- tab-delimited table metadata on polymorphicity for DRMs with prevalence > 0.5% data/metadata/drm_types.tab, with following columns:
- DRM -- DRM;
- type -- polymorphic or non-polymorphic;
- DRM visualizations for B and C (all common DRMs together): data/metadata/B/acr/map.consensus.raxml.lsd2.html and data/metadata/C/acr/map.consensus.raxml.lsd2.html;
- visualization for RT:T215DFSY in B subtype: data/metadata/B/acr/map.RT_T215DFSY.raxml.lsd2.html
- ACR-related files, found in the folders: data/acr//pastml/<DRM_position>.raxml.lsd2/ , in particular marginal probability tables
- PastML tables containing marginal probabilities of ACR for different common DRMs: data/acr//pastml/<DRM_position>.raxml.lsd2/marginal_probabilities.character_<DRM_position>.model_F81.tab
snakemake --snakefile Snakefile_drm --keep-going --cores 4 --use-singularity --singularity-args "--home ~"
6. DRM statistics pipeline Snakefile_drm_stats
The pipeline calculates various statistics for common DRMs.
- time-scaled trees for B and C in newick format (with named nodes and associated date and CI metadata, without outliers): data/timetrees/B/raxml.lsd2.nwk and data/timetrees/C/raxml.lsd2.nwk;
- lists of DRM positions with DRMs with prevalence >0.5% for B and C: data/metadata/B/common_drms.txt and data/metadata/C/common_drms.txt;
- tab-delimited DRM metadata files for B and C, containing information on DRMs with prevalence > 0.5%:
data/metadata/B/metadata.drm.tab
and data/metadata/C/metadata.drm.tab,
with the following columns:
- inputSequence -- sequence id (the first column);
- drm position columns, e.g. RT_K219ENQ -- the value is either 'sensitive' (if the amino acid is position RT:219 is known and is not E, N or Q), RT_K219E (if it's E), RT_K219N (if it's N), RT_K219Q (if it's Q) or blank (if the amino acid is unknown);
- tab-delimited table containing ARV metadata for DRMs with prevalence > 0.5% data/metadata/arv_metadata.tab,
with (at least) the following columns:
- mutation -- DRM (if several DRMs are present on the same position, the name is aggregated, e.g. RT_K219ENQ);
- drug class -- NNRTI, NRTI or PI;
- drug abbreviation
- year -- year of drug acceptance;
- tab-delimited table metadata on polymorphicity for DRMs with prevalence > 0.5% data/metadata/drm_types.tab, with following columns:
- DRM -- DRM;
- type -- polymorphic or non-polymorphic;
- a tab-delimited metadata file data/metadata/metadata.uk.tab
with the following columns:
- id -- sequence id (the first column);
- patientindex -- patient id ;
- sampledate_my -- sequence sampling date in MM/YYYY format;
- treatmentstatus -- treatment status before the sample: 'naive' or 'experienced'; if the value is left blank it means that the status is unknown;
- subtype_jpHMM -- sequence subtype;
- PastML tables containing marginal probabilities of ACR for different common DRMs: data/acr//pastml/<DRM_position>.raxml.lsd2/marginal_probabilities.character_<DRM_position>.model_F81.tab
- ACR visualizations for common B and C DRMs: data/acr/B/map.consensus.raxml.lsd2.html and data/acr/C/map.consensus.raxml.lsd2.html
- ACR visualizations for RT:T215DFSY in B: data/acr/B/map.RT_T215DFSY.raxml.lsd2.html;
- tab-delimited table with DRM loss statistics from [Castro et al. 2013]: data/input/Castro.tab.
- Tables 2 and 3 of the article (in .tex format): data/metadata/B/raxml.lsd2_drm_stats.tex and data/metadata/C/raxml.lsd2_drm_stats.tex;
- Table 4 of the article: data/metadata/Table4.tab
- Supp Mat Tables on DRM TDR/ADR/loss count statistics over time (in .tex format): data/metadata/Supp_DRM_Tables.tex
snakemake --snakefile Snakefile_drm_stats --keep-going --cores 4 --use-singularity --singularity-args "--home ~"