Skip to content

05. Peak calling and normalization

Sebastian Gregoricchio edited this page Oct 28, 2023 · 18 revisions

Workflow configuration files

Sample table

The pipeline requires a sample configuration file which provides information about ChIP-Input pairs and the type of peak calling to perform (broad or narrow).
This configuration file must be in a tab-delimited text file (including column names) containing the following information (respect the column order):

target_id input_id broad
sample_A input_A-B False
sample_B input_A-B False
sample_C input_C True

A dummy-table could be found in resources/peakCalling_sampleConfig_example.txt.


Pipeline parameters

Despite many parameters are already available in the SPACCA_ChIPanalyses_configfile.yaml file, some additional information must be updated in the configuration file:

  • the source bam directory (e.g. rename folder)
  • the output directory where you want your results to be stored (if not already available, the pipeline will make it for you)
  • whether your data contain UMIs
  • whether your data are paired- or single-end
  • the genome to use (in fasta format, and the genome id for CNV calling)
  • the path to the sample configuration table

Other pipeline parameters

Hereafter there are some details of additional parameters available in the configfile_peakCalling.yaml. However, default parameters are already pre-set and should not be changed without expert advices.
If you wish to make changes, just make a copy of the config file and provide the path to the new file in the snakemake running command line.

Parameter Description
bam_suffix Default: ".bam". String with the suffix of the source bam files.
skip_bam_filtering Default: False. True/False to indicate whether the bam MAPQ filtering and MarkDuplicates should be skipped. Useful to save computation when the bam files have been generated by the DNA-mapping pipeline. The bams will be linked in a folder, while the .bai index and the flagstat are re-computed.
umi_present Default: True. True/False to indicate whether the data contain UMIs (ignored for single-end data).
remove_duplicates Default: False. True/False to define whether remove the duplicates from the bam files (if true the tag in the bams will be _dedup instead of _mdup).
MAPQ_threshold Default: 20. All reads with a mapping quality (MAPQ) score lower than this value will be filtered out from the bam files.
bigWig_binSize Default: 50. Size, in bp, of the bins used to compute the normalized bigWig files.
use_macs3 Default: False. True/False to define whether to run macs3 instead of macs2.
macs_qValue_cutoff Default: 0.01. False Discovery Ratio (FDR) (q-value) cutoff used by MACS to filter the significant peaks.
perform_plotFingerprint Default: False. True/False to define whether perform the finger printing (Lorenz curve).
perform_fragmentSizeDistribution Default: False. True/False to define whether to plot the fragment size distribution (Paired-end only).
fragment_length Default: 200. Size in bp of the virtual fragment length at which each read should be extended in order to perform the plotFingerprint.
correlation_heatmap_colorMap Default: 'PuBuGn'. A string indicating the color gradient pattern to use for the correlation heatmaps. This value is passed to matplotlib/seaborn. Therefore, available options (see matplotlib page for examples) are the following: 'Accent', 'Blues', 'BrBG', 'BuGn', 'BuPu', 'CMRmap', 'Dark2', 'GnBu', 'Greens', 'Greys', 'OrRd', 'Oranges', 'PRGn', 'Paired', 'Pastel1', 'Pastel2', 'PiYG', 'PuBu', 'PuBuGn', 'PuOr', 'PuRd', 'Purples', 'RdBu', 'RdGy', 'RdPu', 'RdYlBu', 'RdYlGn', 'Reds', 'Set1', 'Set2', 'Set3', 'Spectral', 'Wistia', 'YlGn', 'YlGnBu', 'YlOrBr', 'YlOrRd', 'afmhot', 'autumn', 'binary', 'bone', 'brg', 'bwr', 'cividis', 'cool', 'coolwarm', 'copper', 'cubehelix', 'flag', 'gist_earth', 'gist_gray', 'gist_heat', 'gist_ncar', 'gist_rainbow', 'gist_stern', 'gist_yarg', 'gnuplot', 'gnuplot2', 'gray', 'hot', 'hsv', 'icefire', 'inferno', 'jet', 'magma', 'mako', 'nipy_spectral', 'ocean', 'pink', 'plasma', 'prism', 'rainbow', 'rocket', 'seismic', 'spring', 'summer', 'tab10', 'tab20', 'tab20b', 'tab20c', 'terrain', 'twilight', 'twilight_shifted', 'viridis', 'vlag', 'winter'.



Performing peak calling and normalization

To partially avoid unexpected errors during the execution of the pipeline, a so called 'dry-run' is strongly recommended. Indeed, adding a -n flag at the end of the snakemake running command will allow snakemake to check that all links and file/parameters dependencies are satisfied before to run the "real" processes. This command will therefore help the debugging process.
Always activate your environment, otherwise the pipeline won't be able to find the packages required for the analyses.

NOTE: if the bam files derive from the DNA-mapping pipeline you can save time by adding the flag skip_bam_filtering="True" (MAPQ filter and MarkDuplicates are skipped). Notice that you may need to add/modify the flag for the bam suffix to something like "_mapq20_mdup_sorted.bam" in order to match the extension of the output files of the DNA-mapping pipeline.

snakemake \
--cores 10 \
-s </target/folder>/ChIP_Zwart/workflow/SPACCA_ChIPanalyses.snakefile \
--configfile </target/folder>/ChIP_Zwart/config/SPACCA_ChIPanalyses_configfile.yaml \
--config \
-n

If no errors occur, the pipeline can be run with the same command but without the final -n flag:

Notice that the absence of errors does not mean that the pipeline will run without any issues; the "dry-run" is only checking whether all the resources are available.



Peak calling workflow

Here after you can see the full potential workflow of the paired-end and single-end ChIP-seq pipeline:

SINGLE-END

PE workflow



PAIRED-END

PE workflow



Peak calling results

The results structure is the following:

  • 01_BAM_filtered -> bams filtered for mapping quality (MAPQ) and with the duplicates marked/removed (if filtering was skipped, this folder will contain a symbolic link to the source bam and new indexes). If GCbias correction has been requested, a folder named GCbias_corrected_files will contain GC corrected bams (do not apply any deduplication on these files!)
  • 02_fastQC_on_BAM_filtered -> individual fastQC for each filtered bam
  • 03_bigWig_bamCoverage -> bigWig of the bam coverage normalized (RPGC = Reads Per Genomic Content) or not (raw_coverage) depending on the sequencing depth. If GC-bias and/or CNV have been requested, other sub-folders may be included: RPGC_normalized_CNA.corrected, RPGC_normalized_GCcorrected.
  • 04_Called_peaks -> peaks called with macs2/3 (de-blacklisted). If single-end, it can be found also a folder with the output of phantompeakqualtools as the calculated fragment length is used for running macs2/3 with single-end data. If the bed file does not contain already the 'chr' for the "canonical" chromosomes, it will be added in a separated file ending by _chr.narrow/broadPeak
  • 05_Quality_controls_and_statistics -> this folder contains sample correlations heatmaps and PCAs, a multiQC report containing multiple info (number of reads, duplicates, peak counts and fragmenth lenght, phantom results), statistics on the called peaks (FRiP, number, etc.)

Here an example directory tree:

output_folder
├── 01_BAM_filtered
│   ├── sample_mapq20_mdup_sorted.bam
│   ├── sample_mapq20_mdup_sorted.bai
│   ├── flagstat
│   │   └── sample_mapq20_mdup_sorted_flagstat.txt
|   ├── MarkDuplicates_metrics
│   │   └── sample_MarkDuplicates_metrics.txt
│   └── umi_metrics  ### (if UMI present) ##
│       └── sample_UMI_metrics.txt
|
├── 02_fastQC_on_BAM_filtered
│   ├── sample_sorted_woMT_dedup_fastqc.html
│   └── sample_sorted_woMT_dedup_fastqc.zip
|
├── 03_bigWig_bamCoverage
│   ├── raw_coverage
│   │   └── sample_mapq20_mdup_raw.coverage_bs10.bw
│   └── RPGC_normalized
│       └── sample_mapq20_mdup_RPGC.normalized_bs10.bw
│
├── 04_Called_peaks   ### BAM if Single-End ###
│   ├──phantom
│   │   └── sample.phantom.ssp.out
│   ├── sample.filtered.BAMPE_peaks_chr.narrowPeak
│   ├── sample.filtered.BAMPE_peaks.narrowPeak
│   └── sample.filtered.BAMPE_peaks.xls
|
└── 05_Quality_controls_and_statistics
    ├── multiQC
    │   └── multiQC_report.html
    ├── peaks_stats
    │    └── all_samples_FRiP_report.tsv
    ├── plotFingerprint   ### optional ###
    |   ├── quality_metrics
    |   │   └── em>sample_fingerPrinting_quality_metrics.txt
    |   └── em>sample_fingerPrinting_plot.pdf
    ├── sample_comparisons_atPeaks
    │   ├── all_peaks_merged_sorted.bed
    │   ├── multiBigWigSummary_matrix_atPeaks.npz
    │   ├── sample_pearson.correlation_heatmap_atPeaks.pdf
    │   ├── sample_spearman.correlation_heatmap_atPeaks.pdf
    │   ├── sample_correlation_PCA.1-2_heatmap_atPeaks.pdf
    │   └── sample_correlation_PCA.2-3_heatmap_atPeaks.pdf
    └── sample_comparisons_wholeGenome
        ├── all_peaks_merged_sorted.bed
        ├── multiBigWigSummary_matrix_atPeaks.npz
        ├── sample_pearson.correlation_heatmap_wholeGenome.pdf
        ├── sample_spearman.correlation_heatmap_wholeGenome.pdf
        ├── sample_correlation_PCA.1-2_heatmap_wholeGenome.pdf
        └── sample_correlation_PCA.2-3_heatmap_wholeGenome.pdf