-
Notifications
You must be signed in to change notification settings - Fork 0
05. Basic ATAC analyses
The snakemake pipeline requires only two files: a) the .snakefile
, containing all the rules that will be run; b) the configuration.yaml
file, in which the user can define and customize all the parameters for the different pipeline steps.
To partially avoid unexpected errors during the execution of the pipeline, a so called 'dry-run' is strongly recommended. Indeed, adding a -n
at the end of the snakemake running command will allow snakemake to check that all links and file/parameters dependencies are satisfied before to run the "real" processes. This command will therefore help the debugging process.
snakemake \
-s </target/folder>/snakeATAC/workflow/snakeATAC_analyses.snakefile \
--configfile </target/folder>/snakeATAC/config/snakeATAC_analyses_config.yaml \
--cores 12 \
-n
If no errors occur, the pipeline can be run with the same command line without the terminal -n
:
Notice that the absence of errors does not mean that the pipeline will run without any issues; the "dry-run" is only checking whether all the resources are available.
Furthermore, there is the possibility to run the pipeline only partailly. An example of usage could be if someone wants to have a look to the fast quality controls (fastQC and multiQC reports) before to perform the alignment. To do that, it is sufficient to run a dry-run (-n
mode); pick the name of the rule at which you want the pipeline to stop; type the following command:
snakemake \
-s </target/folder>/snakeATAC/workflow/snakeATAC_analyses.snakefile \
--configfile </target/folder>/snakeATAC/config/snakeATAC_analyses_config.yaml \
--cores 12 \
--until stop_rule_name
The snakefile are contained all the condition checks and rules (processing steps) that will be performed by the pipeline. In the following schematic mapping the main steps performed by the pipeline are depicted.
Briefly, first of a quality control (fastQC) of the raw fastq data is performed. Then, bwa-mem2
is used to align the paired-end sequences onto the genome of reference generating temporary BAM files that are further filtered for mapping quality score and deprived of mitochondrial reads. From this alignment files duplicated reads are removed (or marked) by PICARD
/GATK4
and then shifted to correct the Tn5 nick repair bias (Yan F., et al., Genome Biol. 2020). A quality control for each alignment and a multiQC report are generated, as well samtools flagstat metrics and deeptools fragment size distribution plots (see 03_BAM_dedup / 03_BAM_mdup paragraph for more details).
Signal coverage is then normalized as RPM (Reads Per Million: coverage / (total_reads/106)), and peaks are called by MACS3
which will generate narrowPeak files.
Moreover, deeptools multiBigwigSummary
will be used to generate a whole genome score matrix useful to infer the overall correlation/variability of the samples. This matrix is the input of deeptools plotCorrealtion
and poltPCA
(see 06_Overall_quality_and_info paragraph for more details).
Another analyses performed to determine sample variability is the quantification of the signal at any peak called in any sample. For this analysis, all the narrowPeak files are concatenated and merged together by bedtools merge and the score at each peak for each sample computed by multiBigwigSummary
. The resulting matrix may be used to generate two clustered heatmaps: one on the log1p of the raw scores, and one on the z-score (on rows).
More details on configuration parameters and structure/interpretation of the results can be found in the next paragraphs.
The configuration file is a yaml-formatted file containing all the parameters that are passed to different steps of the pipelines such as the directory with the input files, reference genome, threads of the tools, etc.
Hereafter, the meaning of the different parameters required for the basic analyses are described.
Parameter | Description |
---|---|
workflow_configuration: | |
runs_directory | The full path to the directory were the input fastq files are contained, e.g. /home/user/ATAC/00_runs/ . Importantly, the name of the files, deprived of the read suffix (e.g., _R1/_R2) and file extension (e.g., .fastq.gz) will be used as sample name. |
output_directory | The full path to the folder in which the results should be stored, e.g. "/home/user/ATAC/" . |
bam_features: | |
bam_suffix | Default: ".bam" . String with the suffix of the source bam files. |
skip_bam_filtering | Default: True . True/False to indicate whether the bam MAPQ filtering and MarkDuplicates should be skipped. Useful to save computation when the bam files have been generated by the Reads mapping pipeline. The bams will be linked in a folder, while the .bai index and the flagstat are re-computed. |
remove_other_chromosomes_pattern | Default: "CMV|HBV|HTLV|HPV|SV40|MCV|KSHV|Un|random|HCV|HIV|EBV" . By default only the mitochondrial chromosome is removed. However, with this option, other chromosomes can be removed indicating the pattern to use. |
umi_present | Default: False . True/False to indicate whether the data contain UMIs (ignored for single-end data). |
remove_duplicates | Default: True (recommended). True/False to define whether remove the duplicates from the bam files (if true the tag in the bams will be _dedup instead of _mdup). |
MAPQ_threshold | Default: 20 . All reads with a mapping quality (MAPQ) score lower than this value will be filtered out from the bam files. |
minFragmentLength | Default: 0 (bp). Number indicating the minimal length (in base-pairs, bp) required for a fragment to be retrained during signal normalization. |
minFragmentLength | Default: 2000 (bp). Number indicating the maximal length (in base-pairs, bp) required for a fragment to be retrained during signal normalization. |
genomic_annotations: | |
genome_id | No default. A string indicating the genome id (used only for the CNV correction; e.g., "hg19", "hg38", ...). |
genome_fasta | The full path to a fasta-formatted file containing the sequence of there reference genome into which the reads will be aligned. If the index (.fai) file is not present in the same folder, it will be build by the pipeline during its execution. The reference genomes can be downloaded, for instance, from the UCSC golden path. |
blacklist | The full path to a BED-formatted file containing the regions corresponding to blacklisted regions (regions masked for normalization and peak calling). Blacklisted regions for the most common genome assemblies can be downloaded from the Boyle lab's git-page. |
effective_genomeSize | Effective genome size used to normalize the ATAC-seq signal; e.g. for Hg38: 2900338458. A table for the most common genome assemblies is available in this repository at SPACCa/resources/chromosome_sizes_for_normalization.yaml. |
ignore_for_normalization | A string of space separated chromosome names that should be excluded for ChIP-seq signal normalization, e.g. "X Y MT M KI270728.1". A table for the most common genome assemblies is available in this repository at snakeATAC/resources/chromosome_sizes_for_normalization.yaml. |
peak_calling: | |
qValue_cutoff | Default: 0.001 . False Discovery Ratio (FDR) (q-value) cutoff used by MACS3 to filter the significant peaks. |
call_summits | Default: True . Boolean to indicate whether summits should be called by MACS3. |
FRiP_threshold | Default: 20 (%). This value will be used to label the FRiP (Fraction of Reads in Peaks) score as good or bad in the summary table for each single sample. A FRiP above the 20% (FRiP = 0.2) is considered a good score for ATAC-seq peaks by the ENCODE guidelines. |
peak_score_heatmap | Computation of this heatmap can be quite heavy and clustering sometimes not possibile. For this reason the sub-parameter plot_heatmap: False allows for the skipping of this step. If True, the parameters raw_heatmap_color and zScore_heatmap_color can be used to define the color scale for the raw and z-score normalized heatmaps, respectively. These values are passed to matplotlib/seaborn. Therefore, available options (see matplotlib page for examples) are: 'Accent', 'Blues', 'BrBG', 'BuGn', 'BuPu', 'CMRmap', 'Dark2', 'GnBu', 'Greens', 'Greys', 'OrRd', 'Oranges', 'PRGn', 'Paired', 'Pastel1', 'Pastel2', 'PiYG', 'PuBu', 'PuBuGn', 'PuOr', 'PuRd', 'Purples', 'RdBu', 'RdGy', 'RdPu', 'RdYlBu', 'RdYlGn', 'Reds', 'Set1', 'Set2', 'Set3', 'Spectral', 'Wistia', 'YlGn', 'YlGnBu', 'YlOrBr', 'YlOrRd', 'afmhot', 'autumn', 'binary', 'bone', 'brg', 'bwr', 'cividis', 'cool', 'coolwarm', 'copper', 'cubehelix', 'flag', 'gist_earth', 'gist_gray', 'gist_heat', 'gist_ncar', 'gist_rainbow', 'gist_stern', 'gist_yarg', 'gnuplot', 'gnuplot2', 'gray', 'hot', 'hsv', 'icefire', 'inferno', 'jet', 'magma', 'mako', 'nipy_spectral', 'ocean', 'pink', 'plasma', 'prism', 'rainbow', 'rocket', 'seismic', 'spring', 'summer', 'tab10', 'tab20', 'tab20b', 'tab20c', 'terrain', 'twilight', 'twilight_shifted', 'viridis', 'vlag', 'winter'. |
quality_controls | |
fragmentSize_window_length | Default: 1000 (bp). Size, in bp, of the sliding window used to compute the fragment size distribution by deeptools bamPEFragmentSize. |
multiBigwigSummary_binning_window_size | Default: 10000 (bp). Size of the bins, in bp, by which the genome should be subdivided in order to compute the average score matrix by deeptools multiBigwigSummary. This matrix is used to compute sample correlation and Principal Component Analyses (PCA). |
correlation_heatmap_color | Default: "Blues" . A string indicating the color gradient pattern to use for the correlation heatmaps. This value is passed to matplotlib/seaborn. Therefore, available options (see matplotlib page for examples) are the following: 'Accent', 'Blues', 'BrBG', 'BuGn', 'BuPu', 'CMRmap', 'Dark2', 'GnBu', 'Greens', 'Greys', 'OrRd', 'Oranges', 'PRGn', 'Paired', 'Pastel1', 'Pastel2', 'PiYG', 'PuBu', 'PuBuGn', 'PuOr', 'PuRd', 'Purples', 'RdBu', 'RdGy', 'RdPu', 'RdYlBu', 'RdYlGn', 'Reds', 'Set1', 'Set2', 'Set3', 'Spectral', 'Wistia', 'YlGn', 'YlGnBu', 'YlOrBr', 'YlOrRd', 'afmhot', 'autumn', 'binary', 'bone', 'brg', 'bwr', 'cividis', 'cool', 'coolwarm', 'copper', 'cubehelix', 'flag', 'gist_earth', 'gist_gray', 'gist_heat', 'gist_ncar', 'gist_rainbow', 'gist_stern', 'gist_yarg', 'gnuplot', 'gnuplot2', 'gray', 'hot', 'hsv', 'icefire', 'inferno', 'jet', 'magma', 'mako', 'nipy_spectral', 'ocean', 'pink', 'plasma', 'prism', 'rainbow', 'rocket', 'seismic', 'spring', 'summer', 'tab10', 'tab20', 'tab20b', 'tab20c', 'terrain', 'twilight', 'twilight_shifted', 'viridis', 'vlag', 'winter'. |
plotFingerprint | Parameters for the plotting of the Lorenz curves:
|
The output is organized in multiple folders including:
- 01_BAM_filtered: when the reads are aligned onto the reference genome by bwa, the resulting files are filtered for mapping quality (MAPQ) and the mithocondrial (suffix: woMT) reads are removed before sorting. Flagstat metrics is generated for each file and stored in the homonym folder. Then, PICARD/GATK4 is used to remove (suffix: dedup) or mark (suffix: mdup) duplicates in the BAM files. The PICARD metrics is stored in the "metrics" folder.
- 02_BAM_fastQC: fastqc is run on filtered and deduplicated bams and the results for each sample are stored in this folder.
- 03_Normalization: The Tn5 nick reparation bias is corrected by converting bams to BEDPE (BED Paired-End), unpaired reads and reads mapping in blacklisted regions are removed. Then fragments are shifted and converted to bedGraphs representing the genome coverage. The coverage is at the end normalized to RPM (Reads Per Million: coverage / (total_reads/106)) and converted in bigWigs. If Copy Number Variations (CNV) calling have been requested, an extra folder will show the bigWigs RPM normalized and corrected for CNVs. However, these bigWig files can be normalized more precisely normalized in the case that you dispone of a corresponding RNA-seq data set using CHIPIN (L. Polit et.al, BMC Bioinformatics, 2021). Examples of CHIPIN usage can be found at S. Gregoricchio et al., Nucleic Acids Research (2022).
- 04_MACS3_peaks: Peaks and summits (if required by the user) are called by MACS3 (narrow mode). The FDR (False Discovery Ratio) threshold used is indicated in the file name (suffix: qValue).
- 05_Footprinting_TOBIAS: results of Tn5-bias corrected data performed by TOBIAS.
- 06_Overall_quality_and_info: this folder contains multiple quality controls, feature counts and sample correlation plots. More details are available in the next paragraph.
This folder contains the results of featureCounts (from subread) with the counts of reads and other statistics on called peaks for each sample. It is available also tab-separated file containing a summary of the main features counts for each sample:
Summary counts table description
Column | Description |
---|---|
Sample | Sample name |
dedup_BAM | Total number of reads left after BAM reads deduplication. |
shifted_BAM | Number of reads in the shifted BAMs. |
loss_post_shifting | Number of reads lost upon BAM shifting. Consider that reads falling in blacklisted regions are removed. |
n.peaks | Total number of peaks called. |
FRiP.perc | Frequency Reads in Peaks percentage, corresponds to the number of reads falling in peak regions divide by the total number of reads and multiplied by 100. |
FRiP.quality | A label ("good" or "bad") to indicate whether the FRiP score is good or not for a given sample. The threshold can be changed in the config file by the user, by the default 20 (as suggested by the ENCODE guidelines). |
in this folder the distribution of the fragment sizes for each sample and a file collecting all the plots in a single file. Here after an example of a good (left) and a bad (right) fragment size distribution.
An optimal fragment size distribution should be included within a range of 50-800bp, with a periodicity of ~150bp (corresponding to mono-, di-, tri-, ... nucleosomes) with a lower count for larger fragments.
The Lorenz_curve_deeptools.plotFingreprint_allSamples.pdf
is a plot showing the enrichment of the signal allover the genome. Indeed, if a sample does not show any enrichment the reads will equally distributed over the genome resulting in a diagonal line in the plot (left panel). When instead the signal is specific for the feature sought (e.g., open chromatin) it will be enriched only at specific location and the curve will be closer to the bottom-right corner of the plot (right panel).
This folder contains a report combining all the individual fastqc available in the 02_BAM_fastQC directory.
The plots in this folder help the user to understand the variability of the samples.
-
multiBigWigSummary_matrix_allSamples.npz
: result of deeptools multiBigWigSummary used to plot the PCA and correlation plots; -
PCA_on_BigWigs_wholeGenome.pdf
: Principal Component Analyses results of the signal allover the genome; -
Peak_comparison
:-
all_samples_peaks_concatenation_collapsed_sorted.bed
: the peaks called in all samples are merged and collapsed in this bed file; -
peaks_score_matrix_all_samples_MACS3.npz
: a matrix containing the average score at each peak (previous bed file) for each samples is generated; -
peaks_score_matrix_all_samples_table_MACS3.tsv
: same matrix as before, but in tab-separated format. -
Heatmaps
: the matrix generated on all peaks is used to cluster the samples and two heatmaps are plotted: one on the log1p of the raw scores, and one on the z-score (on rows)
-
-
Sample_correlation
: scatter and heatmap correlation plots are generated based on the signal over the whole genome. Both Pearson and Spearman methods are used.
The structure of the output_folder is the following:
output_folder ├── 01_BAM_filtered │ ├── sample_mapQ20_woMT_dedup_shifted_sorted.bam │ ├── sample_mapQ20_woMT_dedup_shifted_sorted.bam.bai │ ├── flagstat │ │ ├── sample_flagstat_filtered_bam_woMT_dedup.txt │ │ └── sample_flagstat_woMT_dedup_shifted_sorted.txt │ ├── metrics │ │ └── sample_metrics_woMT_dedup_bam.txt │ └── unshifted_bams │ ├── sample_mapQ20_sorted_woMT_dedup.bam │ └── sample_mapQ20_sorted_woMT_dedup.bai │ ├── 02_BAM_fastQC │ ├── sample_sorted_woMT_dedup_fastqc.html │ └── sample_sorted_woMT_dedup_fastqc.zip | ├── 03_Normalization │ ├── RPM_normalized │ │ └── sample_mapq20_woMT_dedup_shifted_RPM.normalized.bw │ └── RPM_normalized_CNA.corrected │ └── sample_mapq20_woMT_dedup_shifted_RPM.normalized_CNA.corrected_bs10.bw │ ├── 04_MACS3_peaks │ ├── sample_mapQ20_woMT_dedup_shifted_qValue0.01_peaks.narrowPeak │ ├── sample_mapQ20_woMT_dedup_shifted_qValue0.01_peaks_chr.narrowPeak │ ├── sample_mapQ20_woMT_dedup_shifted_qValue0.01_peaks.xls │ ├── sample_mapQ20_woMT_dedup_shifted_qValue0.01_summits.bed │ └── log │ └── sample_mapQ20_woMT_dedup_shifted_FDR0.01.log | ├── 05_Footprinting_TOBIAS │ ├── sample_mapq20_sorted_woMT_dedup_AtacBias.pickle │ ├── sample_mapq20_sorted_woMT_dedup_atacorrect.pdf │ ├── sample_mapq20_sorted_woMT_dedup_bias.bw │ ├── sample_mapq20_sorted_woMT_dedup_corrected.bw │ ├── sample_mapq20_sorted_woMT_dedup_expected.bw │ ├── sample_mapq20_sorted_woMT_dedup_expected.bw │ └── sample_mapq20_sorted_woMT_dedup_uncorrected.bw │ ├── 06_Overall_quality_and_info | ├── Lorenz_curve_deeptools.plotFingreprint_allSamples.pdf | ├── Counts | │ ├── counts_summary.txt | │ └── subread_featureCounts_output | │ └── sample | │ ├── sample.readCountInPeaks | │ ├── sample.readCountInPeaks.log | │ └── sample.readCountInPeaks.summary │ ├── fragmentSizeDistribution_plots | | ├── ALL.samples_fragment_size_distribution.pdf │ │ └── sample_fragment_size_distribution.pdf │ ├── multiQC_dedup_bams │ │ ├── multiQC_report_BAMs_dedup_data │ │ │ ├── multiqc_citations.txt │ │ │ ├── multiqc_data.json │ │ │ ├── multiqc_fastqc.txt │ │ │ ├── multiqc_general_stats.txt │ │ │ ├── multiqc.log │ │ │ └── multiqc_sources.txt │ │ └── multiQC_report_BAMs_dedup.html | └── Sample_comparisons | ├── multiBigWigSummary_matrix_allSamples.npz | ├── PCA_on_BigWigs_wholeGenome.pdf | ├── Peak_comparison | │ ├── all_samples_peaks_concatenation_collapsed_sorted.bed | │ ├── peaks_score_matrix_all_samples_MACS3.npz | │ └── peaks_score_matrix_all_samples_table_MACS3.tsv | | └── Heatmaps | | ├── Heatmap_on_log1p.rawScores_for_MACS3.peaks_union_population.pdf | │ └── Heatmap_on_zScores_for_MACS3.peaks_union_population.pdf | └── Sample_correlation | ├── Correlation_heatmap_on_BigWigs_wholeGenome_pearsonMethod.pdf | ├── Correlation_heatmap_on_BigWigs_wholeGenome_spearmanMethod.pdf | ├── Correlation_scatterplot_on_BigWigs_wholeGenome_pearsonMethod.pdf | └── Correlation_scatterplot_on_BigWigs_wholeGenome_spearmanMethod.pdf | ...
Contributors