Skip to content

Latest commit

 

History

History
134 lines (103 loc) · 9.85 KB

README.md

File metadata and controls

134 lines (103 loc) · 9.85 KB

Method for generating a master list / Index of DNaseI-Hypersensitive Sites ("consensus DHSs").

All code, implementation details and design by Wouter Meuleman and Eric Rynes. Docker implementation by Jemma Nelson.

How to run within Docker:

This approach is not recommended, because it will process biosamples serially, which will take an enormous amount of time for any real-life dataset.

  1. Build the docker image from this repo, with docker build . --tag=masterlist
  2. Change to the directory that contains your input files
  3. Run docker run --mount type=bind,source="$(pwd)",target=/data masterlist run_sequential.sh my_output_dir chrom_sizes.bed listOfFiles.txt

my_output_dir will be created inside the working directory. chrom_sizes.bed is a standard bed file containing the sizes of the chromosomes. listOfFiles.txt is a list of peak files (which are starch-formatted), containing one peak file per line. A relative path should be used to a file inside the working directory - absolute paths are not supported, due to the way the files are mounted in the Docker container.

How to run with SLURM:

From any directory, execute

[path to]ML_build_slurm.sh workdir outputID chrom_sizes.bed listOfFiles.txt partitionName memSize

where

  • messages and intermediate files will be written into workdir (it will be created if it doesn't already exist)
  • outputID is an identifier to be used in the output filenames (e.g., containing the date, number of samples, organism, etc.)
  • chrom_sizes.bed is a 0-based 3-column BED file containing the lengths of the relevant chromosomes
  • listOfFiles.txt is a plain text file containing the paths to variable-width peak files, one per biological sample, one path per line
  • partitionName is the name of the cluster on which the SLURM jobs will run (--partition=partitionName)
  • memSize is the amount of memory to require for each step, best if tailored to the needs of the collation of peak files (--mem=memSize)

Output files will be written to the current directory. Messages and intermediate files will be written into workdir.

Prerequisites:

For each biological sample to be represented in the Index, a variable-width peak file generated by hotspot2 is required. To create an Index, a plain text file containing the locations of such peak files, one file per sample, one file per line, is required. When using Docker to create an Index, all peak files must reside in the same directory as the file containing the list of them, and no path information can be present in the list of files. (Users knowledgeable about Docker mounting can use mounts to loosen these restrictions.) When using SLURM, absolute and relative paths are both allowed in this text file; multiple directories and symbolic links can be used in this case.

All peak files must have been generated using the same FDR threshold. After hotspot2 has generated per-base results for a biosample up to a given FDR threshold (e.g. 5% or 100%), hotspots and peaks at any stricter FDR threshold of interest can be extracted from the results by a pair of scripts contained in the hotspot2 package. The per-base results reside in a file whose name ends in allcalls.starch. To obtain variable-width peaks at a desired FDR threshold, this file, and the hotspot2 output files containing mapped cleavages (name ending in cutcounts.starch) and the total number of mapped cleavages (name ending in cleavage.total) are needed.

Suppose that each file that hotspot2 created for a certain biological sample has a name beginning with "sampXYZ," that the per-base results were thresholded at FDR 5%, and that you want to create an Index using peaks called at FDR 0.1%. For each sample, you would need to run the following two hotspot2 scripts, the second after the first has completed:

hsmerge.sh -f 0.001 sampXYZ.allcalls.starch sampXYZ.hotspots.fdr0.001.starch

density-peaks.bash tmpdirXYZ varWidth_20_XYZ sampXYZ.cutcounts.starch sampXYZ.hotspots.fdr0.001.starch chromSizes.bed sampXYZ.density.starch sampXYZ.peaks.starch `cat sampXYZ.cleavage.total`

The first script, hsmerge.sh, uses sampXYZ.allcalls.starch to call hotspots at FDR 0.1% and write them to samp.XYZ.hotspots.fdr0.001.starch. The second script, density-peaks.bash, uses the newly-created FDR 0.1% hotspot file, and pre-existing files sampXYZ.cutcounts.starch and sampXYZ.cleavage.total, to create the desired FDR 0.1% peak file sampXYZ.peaks.starch. (It also creates file sampXYZ.density.starch, which is not needed for Index creation.) A zero-based, 3-column BED file containing the lengths of the relevant chromosomes, named chromSizes.bed in the above command, is needed to create each peak file and also needed to create an Index. tmpdirXYZ in the second command is the name/location of a directory where temporary files will be stored. (Important: If density-peaks.bash is run multiple times simultaneously to process samples in parallel, distinct temporary directories must be specified for each run/sample.) The argument varWidth_20_XYZ consists of 3 parts, separated by underscores: "varWidth," necessary to specify the creation of variable-width peaks; an integer specifying the minimum width in bp that a variable-width peak must have to be reported (default = 20); and a unique identifier (ID) for the biological sample ("XYZ" in the above example). The ID will be written into column 4 of every line of the peaks file, and used to define DHSs in the Index and to identify which biological samples contribute to each entry in the Index; it may contain underscores if desired.

Output

As a basis, we generate a full set of DHSs that may contain partially overlapping elements (_all_chunkIDs.bed). This is the version we recommend to use for downstream annotations and analyses. However, as an alternative, we also provide two additional versions in which overlaps were either resolved fully (_nonovl_any_chunkIDs.bed), or partially only for the central "peak summit zones" (_nonovl_core_chunkIDs.bed).

The columns in these files contain the following information:

  • column 1: chromosome name
  • column 2: start coordinate of the element (a.k.a. DHS); all coordinates are 0-based
  • column 3: end coordinate
  • column 4: DHS name/identifier
  • column 5: sum of normalized densities at the summit of the DHS
  • column 6: number of samples that contributed a peak to the DHS
  • column 7: number of distinct peaks that contributed to the DHS (col7 usually equals col6, but sometimes multiple nearby peaks within one sample contribute to the delineation of the DHS)
  • column 8: width of the DHS (equals col3 minus col2)
  • column 9: coordinate of the summit of the DHS
  • column 10: left boundary of the "core" of the DHS (summit coordinate minus dispersion; dispersion = median absolute deviation)
  • column 11: right boundary of the "core" of the DHS (summit plus dispersion)

For each of these 3 files, a 12-column version (.bed12) and bigBed version (.bb) are also created. The latter can be viewed in the UCSC Genome Browser; instructions for doing so are provided in the browser documentation. A (sub)directory will filled with messages and intermediate files; it will be created if it doesn't already exist. If any output files are incomplete or are not created, check the errorMessages and R_Messages subdirectories of the output directory for error messages that can help identify the underlying cause.

Example using provided data

In subdirectory data of this repository, we provide variable-width peak files for 3 biological samples, and a text file containing their names (fileOf3PeakFiles.txt). These files can be used to produce an example Index. We also provide file chromSizes.hg38.bed3, which contains chromosome lengths in hg38 coordinates, in this subdirectory.

To create the example Index using SLURM:

  • cd into data
  • execute: ../ML_build_slurm.sh mySubdir MyResults chromSizes.hg38.bed3 fileOf3PeakFiles.txt <your computing cluster name> 100M

To create the example Index using Docker:

  • (build the Docker image, as instructed above in section "How to run within Docker")
  • cd into data
  • execute: docker run --mount type=bind,source="$(pwd)",target=/data masterlist run_sequential.sh mySubdir chromSizes.hg38.bed3 fileOf3PeakFiles.txt MyResults

Either approach will create the following 3 files, each containing a "flavor" of an Index as described in section "Output," in directory data:

  • masterlist_DHSs_MyResults_all_chunkIDs.bed: an Index in which a small percentage of the DHSs overlap one another
  • masterlist_DHSs_MyResults_nonovl_any_chunkIDs.bed: same as previous, but with overlaps removed
  • masterlist_DHSs_MyResults_nonovl_core_chunkIDs.bed: same as previous, but restricted to "core" DHSs (see section "Output")

For each of these 3 files, a 12-column version (.bed12) and bigBed version (.bb) are also created. The latter can be viewed in the UCSC Genome Browser; instructions for doing so are provided in the browser documentation. Subdirectory mySubdir will also be created, and filled with subdirectories and many intermediate files. If any "masterlist" files are incomplete or are not created, check the errorMessages and R_Messages subdirectories of mySubdir for error messages that can help identify the underlying cause.

Files of interest:

File Purpose
chunk_bed.awk script for partitioning the peaks into genomic islands or "chunks" that can be processed in parallel
code_ML.R common routines
code_build.R code used for converting a genomic chunk of peak calls into tentative DHSs
code_overlap.R code used to detect and resolve overlapping elements, if so desired
code_gen_masterlist.sh code used to concatenate the output of all chunks and generate browser tracks
ML_build_slurm.sh SLURM submission script which executes each of the above in sequence
run_sequential.sh Script for executing each of the above in sequence using Docker