-
Notifications
You must be signed in to change notification settings - Fork 21
3. Phasing assessment with hap mers
When parental read sets are available, we can get the parental specific mers and overlay on the child’s read set. (See build hap-mer dbs for trios for more details).
Here, we use the parental assemblies of COL and CVI strain to get the parental specific k-mers, and generated the hap-mers by intersecting with the F1’s read set. Download and untar the pre-built inherited meryl dbs: col0.hapmer.meryl.tar and cvi0.hapmer.meryl.tar.
The hap-mer painted on the F1’s read set, along with the cut-offs for the solid k-mers are as follows. The stacked version gives an overall view, unstacked line and filled versions allows visually confirm the hap-mer distribution follows binomial distribution. A properly phased assembly is expected to contain each hap-mer in the corresponding haplotype assembly.
The blob plots provides a quick glance on the overall phasing across an assembly. Each dot (circle) represents a sequence, a contig or scaffold, with its size relative to the sequence length. Each sequence is colored by assembly. The x and y axis are the number of hap-mers found.
A more phased sequence will have less hap-mers found from the other haplotype, so the closer the circles are aligned along the axis, the better the phasing performance is.
- Note: There will be an option available soon to transform the axis 45 degrees, to less overlap the axis.
Hap-mer completeness is computed similar to the k-mer completeness, but this time, per each hap-mer set. The hap-mer itself is already filtered, so we can use it directly as solid k-mers.
hap-mer completeness = found hap-mers in an assembly / total hap-mers
Similar to the overall k-mer completeness, this time, the second column is the hap-mer set used for evaluation.
col col0.hapmer 18316089 18408825 99.4962
cvi col0.hapmer 200302 18408825 1.08808
both col0.hapmer 18344405 18408825 99.6501
col cvi0.hapmer 132496 18264244 0.725439
cvi cvi0.hapmer 18148554 18264244 99.3666
both cvi0.hapmer 18161738 18264244 99.4388
Here, we see the col assembly is capturing 99.5% of the col hap-mers, with 1% of cvi hap-mers which are likely switch errors or base pair errors.
Depending on the de novo phasing assembly strategy, these could be useful to measure if the assembly achieved complete phasing per haplotypes.
both
is showing the stats for both assemblies combined, which could be used as a measure for overall haplotype completeness.
Associated spectra-cn plots are generated per assembly and per hap-mer.
Here are the examples of the col assembly with col0 and cvi0 hapmers.
The col assembly shows col0 hap-mers in the expected copy number range, with almost no hap-mer missing. On the other hand, we have almost no cvi0 hap-mers found, regardless of the copy number.
See example/triocanu_clr...spectra-cn.*.png for more details.
The phased blocks are defined by hap-mers found in the assembly. By using the hap-mers as markers, we can define a phased block where we see more than 2 markers from the same haplotype. Assemblies usually come out with local switches, introduced by consensus errors, and we want to tolerate those short-range switches when calculating switch errors within a phased block.
Short-range switches are defined as consecutive hap-mers from the other haplotype, in a defined length of sequence. By default, merqury is set to
-
num_switch
: 100 -
short_range
: 20,000
which allows at most 100 marker switches, bounded in 20kb within a phased-block. If there are more markers, or the markers observed are stretching over 20kb, a long-range switch is occurring, switching the haplotype block.
In short, switch
=100 and range
=20000 allows 0.5% of switches in ~20kb window.
The output summary statistics is saved as *.phased_block.stats
.
head *.100_20000.phased_block.stats
==> triocanu_clr.col.100_20000.phased_block.stats <==
triocanu_clr.col.100_20000 354 123,190,467 19 347,996 4,222,794 12,619,854 56418 19792291 0.28505%
==> triocanu_clr.cvi.100_20000.phased_block.stats <==
triocanu_clr.cvi.100_20000 309 122,007,476 24 394,846 5,524,119 10,997,312 49891 19682578 0.253478%
Each column is:
- Assembly name with short-range switch parameters
- Number of blocks
- Total bases in blocks (Block sum)
- Smallest block size
- Avg. block size
- Block N50 size
- Longest block size
- Number of markers from the other haplotype
- Total number of markers in blocks
- Switch error rate
Blocks are visualized in the N* plots, ordered by its’ size along the block sum coverage. The blocks are colored according to the haplotype. The trio binned assemblies have small blocks of switches to the other haplotype, which indicates the small fraction of reads being mis-binned. (We expect this will become smaller when the error rate in the long reads go down.)
We can compare the phased blocks vs. contig (and scaffolds when provided) N* plots, to see how relatively a contig (or scaffold) is phased.
When using the stats for publication, make sure to report both num_switch and short_range along with the phased block N50 and switch error rate. The larger the num_switch is defined, we tolerate higher short-range switch error rate, thus get longer phased block stats. However, short range switches may be critical for some analysis. In those cases, to get more reliable phased blocks, use smaller num_switches.
For example, to get more stringent phased blocks allowing only 10 switches in a raw, run:
sh $MERQURY/trio/switch_error.sh triocanu_clr.col.sort.bed triocanu_clr.col 10 20000
sh $MERQURY/trio/switch_error.sh triocanu_clr.cvi.sort.bed triocanu_clr.cvi 10 20000
sh $MERQURY/trio/block_n_stats.sh col.fasta triocanu_clr.col.10_20000.phased_block.bed cvi.fasta triocanu_clr.cvi.10_20000.phased_block.bed triocanu_clr.10_20000
Here are the col block continuity as for comparison. Left: 100_20000, right: 10_20000.
As we can see, the more stringent we get, the blocks become smaller and local stretches of switches forms its own block.
There are several files generated loadable on genome browsers (IGV) for further investigating haplotype consistency.
- phased_block.bed per assembly and haplotype
- hapmer.bed and hapmer.tdf per assembly and haplotype