Skip to content

Commit

Permalink
Merge pull request #422 from ComparativeGenomicsToolkit/graphmap-split
Browse files Browse the repository at this point in the history
update hal2vg
  • Loading branch information
glennhickey authored Feb 12, 2021
2 parents 91154bd + 0aaba1e commit dec2b45
Show file tree
Hide file tree
Showing 4 changed files with 22 additions and 31 deletions.
2 changes: 1 addition & 1 deletion ReleaseNotes.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Release 2.0.0 ???
# Release 1.3.0 2021-02-11

This release introduces the [Cactus Pangenome Pipeline](https://github.com/ComparativeGenomicsToolkit/cactus/blob/master/doc/pangenome.md), which can be used to align together samples from the same species in order to create a pangenome graph:

Expand Down
6 changes: 3 additions & 3 deletions build-tools/downloadPangenomeTools
Original file line number Diff line number Diff line change
Expand Up @@ -165,7 +165,7 @@ else
fi

# hal2vg
wget -q https://github.com/ComparativeGenomicsToolkit/hal2vg/releases/download/v1.0.6/hal2vg
wget -q https://github.com/ComparativeGenomicsToolkit/hal2vg/releases/download/v1.0.7/hal2vg
chmod +x hal2vg
if [[ $STATIC_CHECK -ne 1 || $(ldd hal2vg | grep so | wc -l) -eq 0 ]]
then
Expand All @@ -174,7 +174,7 @@ else
exit 1
fi
# clip-vg
wget -q https://github.com/ComparativeGenomicsToolkit/hal2vg/releases/download/v1.0.6/clip-vg
wget -q https://github.com/ComparativeGenomicsToolkit/hal2vg/releases/download/v1.0.7/clip-vg
chmod +x clip-vg
if [[ $STATIC_CHECK -ne 1 || $(ldd clip-vg | grep so | wc -l) -eq 0 ]]
then
Expand All @@ -183,7 +183,7 @@ else
exit 1
fi
# halRemoveDupes
wget -q https://github.com/ComparativeGenomicsToolkit/hal2vg/releases/download/v1.0.6/halRemoveDupes
wget -q https://github.com/ComparativeGenomicsToolkit/hal2vg/releases/download/v1.0.7/halRemoveDupes
chmod +x halRemoveDupes
if [[ $STATIC_CHECK -ne 1 || $(ldd halRemoveDupes | grep so | wc -l) -eq 0 ]]
then
Expand Down
43 changes: 17 additions & 26 deletions doc/pangenome.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@ cactus-graphmap ./jobstore ./examples/evolverPrimates.txt primates.gfa primates.

Create the cactus alignment from the seqfile and PAF, and export the output to vg.
```
cactus-align ./jobstore ./examples/evolverPrimates.txt primates.paf primates.hal --pangenome --pafInput --realTimeLogging --outVG
cactus-align ./jobstore ./examples/evolverPrimates.txt primates.paf primates.hal --pangenome --pafInput --realTimeLogging --outVG --acyclic simChimp
```

## HPRC Chromosome-by-Chromosome Graph Construction
Expand All @@ -65,40 +65,34 @@ Make a node on AWS with `toil launch-cluster --leaderNodeType t2.medium --leader

### Preprocessing (about 7 hours)

minigraph is unable to map within centromeres. While `cactus-bar` can fill in alignments within these regions to some extent, we currently recommend removing them instead. This substantially reduces complexity in the resulting pangenome, as well as lowers the probability that assembly or alignment errors get passed off as real variation. `cactus-preprocess` can be used to remove centromeres as follows.
In order to limit spurious alignments and graph complexity in difficult regions, it is recommended to mask these regions out from the get-go. In an abundance of caution, we use two methods to detect these regions. First, align the sequences to create a PAF

Make a `config.xml` file where `cpu` attribute of the `dna-brnn` `preprocessorJob` is increased from `2` to `16` (it takes a little under 2 hours to mask a genome with 16 cores). Also change the `active` attribute of the `checkUniqueHeaders` job from `1` to `0` (to save time). (leaving everything else default is fine)

Make a seqfile for the output fastas in `./masked`
```
cactus-prepare ./seqfile.txt --outDir ./masked > /dev/null
# edit masked/seqfile.txt to remove Anc0, the tree, and verify paths (and put them in S3!!)
cactus-graphmap <jobstore> seqfile.txt <minigraph GFA> s3://<bucket>/GRCh38-freeze1-orig.paf --realTimeLogging --outputFasta s3://<bucket>/GRCh38-freeze1.gfa.fa --logFile graphmap-1.log --batchSystem mesos --provisioner aws --nodeTypes r3.8xlarge:0.7 --maxNodes 20 --defaultPreemptable --betaInertia 0 --targetTime 1
```

Run the preprocessor, clipping out brnn-masked regions > 100kb (except on reference). (bed files of what was clipped also written).
Next, we combine the coverage gaps in the PAF with dna-brnn to produce the final masking, using a 100kb length threshold (default in cactus config). Note that we need to create a `seqfile.masked.txt` seqfile specifying the locations of the masked fasta sequences.

```
cactus-preprocess <jobstore> ./seqfile.txt ./masked/seqfile.txt --clipAlpha 100000 --configFile ./config.xml --ignore hg38 --realTimeLogging --logFile preprocess.log --batchSystem mesos --provisioner aws --defaultPreemptable --nodeTypes r3.8xlarge:0.7 --maxNodes 20
cactus-preprocess <jobstore> seqfile.txt seqfile.masked.txt --realTimeLogging --logFile preprocess.log --batchSystem mesos --provisioner aws --nodeTypes r3.8xlarge:0.7 --maxNodes 25 --defaultPreemptable --betaInertia 0 --targetTime 1 --maskPAF s3://<bucket>/GRCh38-freeze1-orig.paf --maskAlpha --brnnCores 8
```

Note: To mask the centromeres and leave them as unaligned bubbles in the graph (instead of clipping them out), do the following:
* Pass `--maskAlpha 100000` instead of `--clipAlpha 100000` to `cactus-preprocess` above
* Pass `--maskFilter 100000` to `cactus-graphmap` below to tell `minigraph` to ignore these regions
* Pass `--barMaskFilter 100000` to `cactus-align` below to tell `cactus-bar` to ignore these regions
Note that instead of softmasking sequences and leaving them unaligned, they can be clipped out entirely using `--clipAlpha` instead of `--maskAlpha` above.

### Map contigs to minigraph (about 3 hours)
### Map contigs to minigraph (about 2 hours)

Use `cactus-graphmap` to do the mapping. It will produce a PAF file of pairwise alignments for cactus to build on. It will also edit the input seqfile in order to add the minigraph fasta (specfied with `--outputFasta`). Note that `--refFromGFA hg38` is used in order to not map hg38 and instead pull its alignments from the rGFA tags.
Use `cactus-graphmap` to do the mapping. It will produce a PAF file of pairwise alignments for cactus to build on. It will be much faster this time than above, as it will ignore the masked sequences. A future version of Cactus will clean up the interface in order to make this second call unnecessary!

```
cactus-graphmap <jobstore> `masked/seqfile.txt` <minigraph GFA> s3://<bucket>/GRCh38-freeze1.paf --realTimeLogging --logFile graphmap.log --batchSystem mesos --provisioner aws --defaultPreemptable --nodeTypes r3.8xlarge:0.7 --maxNodes 20 --outputFasta s3://<bucket>/GRCh38-freeze1.gfa.fa --refFromGFA hg38
cactus-graphmap <jobstore> `seqfile.masked.txt` <minigraph GFA> s3://<bucket>/GRCh38-freeze1.paf --maskFilter 100000 --realTimeLogging --logFile graphmap2.log --batchSystem mesos --provisioner aws --defaultPreemptable --nodeTypes r3.8xlarge:0.7 --maxNodes 20 --outputFasta s3://<bucket>/GRCh38-freeze1.gfa.fa
```

### Split the sequences by reference chromosome (about 1 hour)
### Split the sequences by reference chromosome (about 2 hours)

Cactus hasn't yet been tested on the entire input set at once. In the meantime, we split into chromosomes and make a graph for each. The GFA results can be simply catted to make a single pangenome graph. All output will be written in bucket specified by `--outDir`. A separate cactus project will be written for each chromosome (`--refContigs`). The decomposition is determined from mapping: an input contig maps to the reference chromosome to which it has the most alignment in the PAF. Input contigs that cannot be assigned confidently to a reference chromosome will be flagged as `_AMBIGUOUS_`. It's currently important to use `--reference hg38` in order for it not to be filtered out with ambiguity filter.
Cactus hasn't yet been tested on the entire input set at once. In the meantime, we split into chromosomes and make a graph for each. All output will be written in bucket specified by `--outDir`. A separate cactus project will be written for each chromosome (`--refContigs`). The decomposition is determined from mapping: an input contig maps to the reference chromosome to which it has the most alignment in the PAF. Input contigs that cannot be assigned confidently to a reference chromosome will be flagged as `_AMBIGUOUS_`. It's currently important to use `--reference hg38` in order for it not to be filtered out with ambiguity filter.

```
cactus-graphmap-split <jobstore> `masked/seqfile.txt` <minigraph GFA> s3://<bucket>/GRCh38-freeze1.paf --refContigs $(for i in $(seq 1 22; echo X; echo Y; echo M); do echo chr$i; done) --reference hg38 --outDir s3://<bucket>/chroms --realTimeLogging --logFile graphmap-split.log --batchSystem mesos --provisioner aws --defaultPreemptable --nodeTypes r3.8xlarge:0.7 --maxNodes 20
cactus-graphmap-split <jobstore> `seqfile.masked.txt` <minigraph GFA> s3://<bucket>/GRCh38-freeze1.paf --refContigs $(for i in $(seq 1 22; echo X; echo Y; echo M); do echo chr$i; done) --reference hg38 --maskFilter 100000 --outDir s3://<bucket>/chroms --realTimeLogging --logFile graphmap-split.log --batchSystem mesos --provisioner aws --defaultPreemptable --betaInertia 0 --targetTime 1 --nodeTypes r3.8xlarge:0.7 --maxNodes 20
```

### Make a pangenome graph for each chromosome (about 20 hours)
Expand All @@ -113,17 +107,14 @@ aws s3 cp s3://<bucket>/chroms/chromfile.txt ./
Use `cactus-align-batch` to align each chromosome on its own aws instance. The `--alignCoresOverrides` option is used to ensure that the larger chromosomes get run on bigger instances.

```
cactus-align-batch ./chromfile.txt s3://<bucket>/align-batch --alignCores 32 --alignCoresOverrides chr1,64 chr2,64, chr3,64 chr4,64, chr5,64 --alignOptions "--pafInput --pangenome --outVG --realTimeLogging" --batchSystem mesos --provisioner aws --defaultPreemptable --nodeTypes r4.16xlarge:1.7,r4.8xlarge:0.7 --nodeStorage 1000 --maxNodes 5,20 --betaInertia 0 --targetTime 1 --logFile align-batch.log --realTimeLogging
cactus-align-batch ./chromfile.txt s3://<bucket>/align-batch --alignCores 32 --alignCoresOverrides chr1,64 chr2,64, chr3,64 chr4,64, chr5,64 --alignOptions "--pafInput --pangenome --outVG --barMaskFilter 100000 --realTimeLogging" --batchSystem mesos --provisioner aws --defaultPreemptable --nodeTypes r4.16xlarge:1.7,r4.8xlarge:0.7 --nodeStorage 1000 --maxNodes 5,20 --betaInertia 0 --targetTime 1 --logFile align-batch.log --realTimeLogging
```

### Combining the output into a single graph

It is important to first make sure that the chromosome graphs have unique ids:
```
vg ids -j $(for i in $(seq 1 22; echo X; echo Y; echo M); do echo chr${i}.vg; done)
```

The can then be merged with:
The chromosome graphs can then be merged with:
```
vg combine $(for i in $(seq 1 22; echo X; echo Y; echo M); do echo chr${i}.vg; done) > GRCh38-freeze1.cactus.vg
```


2 changes: 1 addition & 1 deletion src/cactus/shared/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -892,7 +892,7 @@ def getDockerImage():

def getDockerRelease(gpu=False):
"""Get the most recent docker release."""
r = "quay.io/comparative-genomics-toolkit/cactus:v1.2.3"
r = "quay.io/comparative-genomics-toolkit/cactus:v1.3.0"
if gpu:
r += "-gpu"
return r
Expand Down

0 comments on commit dec2b45

Please sign in to comment.