Skip to content

ucladx/pancan-panel-design

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Design a capture-based DNA sequencing panel for cancer

GOAL: To assess frequently mutated genes and other somatic/germline alterations common to many different types of cancer. This "pan-cancer" analysis can lead to targeted treatment options for patients with biomarkers that are both common or uncommon in their cancer type. Tumor specimens must be FFPE and matched normal DNA is required from patient's blood or saliva.

Overview of required targets:

  • Cancer genes: Coding exons of all major transcripts of ~1k genes
  • Microsatellites: Frequently instable loci in tumor samples with MSI per PCR testing
  • Non-coding loci: UTRs, introns, promoters, etc. with cancer-associated variants or breakpoints
  • Germline SNPs: Commonly heterozygous ~16k loci to measure allelic imbalance and copy-number

Cancer genes

Gather, rename, compare, review, and shortlist genes from 16 sources:

Shortlist 1080 genes for a new panel as follows, and label them as "Y" under column "Selected" in data/cancer_genes_review.txt:

  1. Include all genes from: fmi_cdx_324, fmi_heme_407, tcga_pancan_299, oncokids_170, goal_core_497, and resistance_77.
  2. Include genes seen in at least 2 of the 16 lists.
  3. Include genes seen in at least 1 of the 16 lists, with >26 mutations per Mbp.
  4. Include genes seen in at least 1 of the 16 lists, with >3.15 mean gain (CN>=2).
  5. Include genes seen in at least 1 of the 16 lists, with >1.5 and <1.82 mean loss (CN<=2). Mean loss <1.5 was on sex chromosomes.
  6. Exclude genes PDE4DIP and FAT3 to reduce DNA-seq costs. They are each ~20kbp and of unknown relevance to cancer.
  7. Add OncoKB genes based on their descriptions, mut/CN burdens, and literature review (ARID3A, ARID4B, ATP6AP1, ATP6V1B2, ATXN7, CRBN, CYP19A1, DEK, DKK4, EGR1, ERF, EZH1, EZHIP, FLI1, GAB1, GAB2, KBTBD4, KLF3, KNSTRN, LCK, LRP5, LRP6, LTB, MEF2D, MIDEAS, MIR142, NADK, PGBD5, PPP4R2, PRKD1, PTP4A1, PTPN1, RAC2, ROBO1, RRAS, SAMHD1, SERPINB3, SERPINB4, SESN2, SESN3, SETD1B, SETDB2, SMYD3, SP140, SPRTN, STAG1, STAT1, STAT2, STK19, TCL1B, TET3, WIF1, WWTR1).
  8. Add more cancer genes requested by UCLA colleagues.
    • INO80 - DNA repair, regulates abundance and positioning of nucleosomes, mutated in 4% of DLBCL, algorithmically predicted to be a cancer driver per intogen.org
    • LUC7L2 - Splicing factor, low expression in 14% of MDS patients, del(7q) is common and truncating mutations have been reported
    • MBD4 - Germline loss predisposes to Uveal Melanoma and Leukemia, targeted by DFCI's OncoPanel (aka Profile)
    • SRP72 - Germline loss predisposes to Familial Aplasia and Myelodysplasia, targeted by Mayo's OncoHeme panel
    • FRK - Somatic mutations in 6% of hepatocellular adenomas, and frequent rearrangements

Mutations per Mbp calculated using TCGA+TARGET MuTect2 MAFs from NCI GDC, and gene sizes from Gencode v35. Mean gain/loss calculated using TCGA+TARGET Gistic2 gene-level absolute CN from NCI GDC.

Extract the gene names and their Ensembl ENSG IDs:

cut -f1,2,9 data/cancer_genes_review.txt | grep -w Y$ | cut -f1,2 | sort > data/exon_targets_gene_list.txt

Create a BED file for these genes' coding regions with 2bp flanks, using their Gencode basic transcripts except level 3 (not verified nor curated):

gzip -dc /hot/tracks/gencode.v38.basic.annotation.gff3.gz | grep -w "$(cut -f2 data/exon_targets_gene_list.txt)" | perl -a -F'\t' -ne '%t=map{split("=")} split(";",$F[8]); if(($t{gene_type} eq "protein_coding" and $F[2] eq "CDS" and $t{level} ne "3" and $t{ID}!~m/PAR/) or ($t{gene_type}=~/lncRNA|miRNA|pseudogene/ and $F[2] eq "exon")){$F[3]-=3; $F[4]+=2; print join("\t",@F[0,3,4],$t{gene_name}.":".$F[2],@F[5,6])."\n"}' | sort -s -k1,1V -k2,2n -k3,3n | bedtools merge -i - -c 4 -o distinct > targets/exon_targets_grch38.bed

Source: ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_38/gencode.v38.basic.annotation.gff3.gz

Try different combinations of transcripts/exons/annotations in steps above and measure total target space:

awk -F"\t" '{sum+=$3-$2} END {print sum}' targets/exon_targets_grch38.bed

2798232 bps in total; 2805929 if we include level 3 transcripts; 2986850 if we used 8bp flanks; 6117858 if we included all UTRs.

Microsatellites

  • GOAL consortium targets 28 microsatellites including the 7 sites targeted by the Promega PCR kit
  • There will be hundreds more frequently mutable microsatellites in other regions we are targeting

Fetch loci of upstream/downstream probes around microsatellites targeted by GOAL consortium:

curl -sL https://github.com/GoalConsortium/goal_misc/raw/e9966b5/GOAL_GRCh38%2Bviral/Consortium_Probes_All_Final.probes_GRCh38%2Bviral.bed | cut -f-2 -d\| | grep -w MSI | sed -E 's/MSI\|//; s/$/:Microsatellite/' > targets/goal_msi_targets_grch38.bed

Missing downstream probe for MONO-27, NR-24, D18S58 and upstream probe for D10S197, D17S250. But these are short enough to be captured by 1 probe each.

Non-coding loci

  • GOAL consortium targets fusion breakpoints for ALK, BRAF, CD74, EGFR, ETV6, FGFR1, FGFR2, FGFR3, MET, NTRK1, NTRK2, PAX8, RAF1, RET, ROS1, RSPO3, TFE3, TFEB

Fetch loci of GOAL consortium probes targeting fusion breakpoints and merge probes <=60bp (half a probe) apart:

curl -sL https://github.com/GoalConsortium/goal_misc/raw/e9966b5/GOAL_GRCh38%2Bviral/Consortium_Probes_All_Final.probes_GRCh38%2Bviral.bed | cut -f1 -d\| | grep _Fusion | sed -E 's/_Fusion/:FusionSite/' | bedtools merge -i - -d 60 -c 4 -o distinct > targets/goal_fusion_targets_grch38.bed

For each gene in the panel, find associated GeneHancer clusters with score>325, and then fetch loci of overlapping ENCODE cCREs with score>450:

curl -sL 'https://api.genome.ucsc.edu/getData/track?genome=hg38;track=geneHancerInteractions' | jq -r '.geneHancerInteractions[] | [.geneHancerChrom,.geneHancerStart,.geneHancerEnd,.name,.score,.geneStrand] | @tsv' | perl -a -F'\t' -ne 'BEGIN{%gs=map{chomp; ($_,1)}`cut -f1 data/exon_targets_gene_list.txt`} $F[1]--; ($g)=split("/",$F[3]); print join("\t",@F) if($F[4]>325 && $gs{$g})' | sort -s -k1,1V -k2,2n -k3,3n > data/genehancer_regions_grch38.bed
curl -sL 'https://api.genome.ucsc.edu/getData/track?genome=hg38;track=encodeCcreCombined' | jq -r '.encodeCcreCombined[] | [.chrom,.chromStart,.chromEnd,.ucscLabel,.name,.score,.strand] | @tsv' | perl -a -F'\t' -ne '$F[1]--; print join("\t",@F[0..3,5,6])' | sort -s -k1,1V -k2,2n -k3,3n | bedtools intersect -f 1 -wo -a - -b data/genehancer_regions_grch38.bed | perl -a -F'\t' -ne '($g)=split("/",$F[9]); print join("\t",@F[0..2],"$g:$F[3]",@F[4,5])."\n" if($F[4]>450)' > data/encode_ccre_grch38.bed

The ENCODE cCREs cover ~800Kbp which is too large. Reduce this to ~5Kbp by targeting only cCREs associated with APC, FOXA1, PMS2, PTEN, and TERT:

perl -a -F'\t' -pe '($g,$t)=split(":",$F[3]); $_="" unless($g=~m/^(APC|FOXA1|PMS2|PTEN|TERT)$/)' data/encode_ccre_grch38.bed > targets/non_coding_targets_grch38.bed

For each gene in the panel, target the ends of 5'UTRs of MANE transcripts where mutations could cause loss of function:

gzip -dc /hot/tracks/gencode.v38.basic.annotation.gff3.gz | grep -w "$(cut -f2 data/exon_targets_gene_list.txt)" | perl -a -F'\t' -ne '%t=map{split("=")} split(";",$F[8]); if($t{gene_type} eq "protein_coding" and $F[2] eq "five_prime_UTR" and $t{tag}=~/MANE/ and $t{ID}!~m/PAR/){print join("\t",$F[0],$F[3]-2,($F[3]+118<$F[4]?$F[3]+118:$F[4]),$t{gene_name}.":5pUTR")."\n".join("\t",$F[0],($F[4]-118>$F[3]?$F[4]-118:$F[3]),$F[4]+2,$t{gene_name}.":5pUTR")."\n"}' | sort -s -k1,1V -k2,2n -k3,3n | bedtools merge -i - -c 4 -o distinct >> targets/non_coding_targets_grch38.bed
sort -su -k1,1V -k2,2n -k3,3n targets/non_coding_targets_grch38.bed -o targets/non_coding_targets_grch38.bed

Fetch loci of Pathogenic and Likely Pathogenic (P/LP) mutations with decent evidence from ClinVar related to cancer:

curl -sL 'https://api.genome.ucsc.edu/getData/track?genome=hg38;track=clinvarMain' | jq -r '.clinvarMain[] | [.chrom,.chromStart,.chromEnd,._variantId,._clinSignCode,.reviewStatus,.phenotypeList] | @tsv' | perl -a -F'\t' -ne 'print if($F[4]=~m/^P|LP$/ && $F[5]=~m/practice guideline|expert panel|multiple submitters/i) && $F[6]=~m/cancer|lynch|neoplas|tumor|adenoma|carcinoma|li-fraumeni|polyposis|hippel-lindau/i' | cut -f1-4 | sort -su -k1,1V -k2,2n -k3,3n > data/clinvar_plp_muts_grch38.bed

Target the subset of P/LP ClinVar mutations that do not overlap targeted exons:

bedtools subtract -a data/clinvar_plp_muts_grch38.bed -b targets/exon_targets_grch38.bed | sed 's/$/:ClinVar/' | sort -su -k1,1V -k2,2n -k3,3n >> targets/non_coding_targets_grch38.bed
sort -s -k1,1V -k2,2n -k3,3n targets/non_coding_targets_grch38.bed -o targets/non_coding_targets_grch38.bed

Manually added the following into targets/non_coding_targets_grch38.bed:

  • Breakpoints of MSH2 inversion (PMID: 18335504, 12203789, 24114314)
  • Breakpoints of PMS2 retrotransposon insertion and intronic regions homologous to PMS2CL pseudogene (PMID: 22461402, 29792936)
  • Breakpoints of 40Kbp duplication between GREM1 and SCG5 (PMID: 22561515, 26493165)
  • Pathogenic germline variants near cancer susceptibility genes (ClinVar, April 2020)

Downloaded TSV-format catalog data from ebi.ac.uk/gwas for these genome-wide association studies listing GRCh38 genomic loci per trait-associated SNP:

  • 265 SNP loci for Prostate cancer risk prediction (PMID: 33398198)
  • 461 SNP loci for Pan-cancer risk prediction (PMID: 32887889)

Manually converted into targets/gwas_targets_grch38.bed with 711 unique loci. Data provenance not possible, and parsing out loci was messy and needed regexes. So, cannot document steps for reproducibility.

Germline SNPs

  • Ensure frequent heterozygosity across each human subpopulation, to ensure uniform sensitivity of allelic imbalance (AI)
  • Usable to detect loss of heterozygosity (LOH), copy-number aberration (CNA), homologous recombination deficiency (HRD)
  • SNPs per gene for gene-level AI/CNA, evenly distributed across genome for arm-level CNA, at telomeres to detect telomeric AI
  • Usable for QCs steps like fingerprinting, detecting tumor-normal swaps, cross contamination, and identifying contaminant samples

gnomAD v3 lists GRCh38 variants from 71702 genomes (and no exomes). For each variant, they list AC, AN, and nhomalt per subpopulation. Frequency of heterozygosity per subpopulation can be calculated as: (AC - 2*nhomalt) / (AN/2)

Use a script to find ~2m candidate SNPs with a frequency of heterozygotes >25% in all 9 gnomAD v3 subpopulations:

python3 scripts/select_gnomad_snps.py --gnomad-vcf /hot/ref/gnomad/gnomad.genomes.r3.0.sites.vcf.bgz --max-snps 2100000 --output-bed data/snp_candidates_grch38.bed
sort -su -k1,1V -k2,2n -k3,3n data/snp_candidates_grch38.bed -o data/snp_candidates_grch38.bed

Source: https://storage.googleapis.com/gcp-public-data--gnomad/release/3.0/vcf/genomes/gnomad.genomes.r3.0.sites.vcf.bgz

We captured and sequenced ~42k of these SNPs in the v0.9 design using 16 FFPE samples and analyzed them. Per testing, major contributors to off-bait capture are Alu repeats.

Exclude SNPs <90bp from an Alu repeat, or within genomic loci flagged by vendor as contributors to off-bait capture:

curl -sL https://hgdownload.cse.ucsc.edu/goldenpath/hg38/database/rmsk.txt.gz | gzip -dc | perl -a -F'\t' -ne 'print join("\t",@F[5..7],"$F[12]:$F[10]",$F[2],$F[9])."\n"' > data/repeat_masker_grch38.bed
grep Alu: data/repeat_masker_grch38.bed | grep -P "^chr(\d+|X|Y)\t" | bedtools slop -b 90 -g /hot/ref/GRCh38_Verily_v1.genome.fa.fai -i | cut -f1-3 - data/vendor_flagged_probes.bed data/vendor_flagged_targets.bed | sort -su -k1,1V -k2,2n -k3,3n | bedtools subtract -a data/snp_candidates_grch38.bed -b - > data/snp_candidates_good_grch38.bed

Using per-target Picard HsMetrics across 16 samples (MQ>=1), choose SNPs with >0.5 and <1.5 normalized depth in >=12 samples:

grep -hv ^chrom /hot/bams/mdl_cancer_v1.2/*.mq1.hsmetrics_by_target | cut -f1-3,8 | awk -F'\t' '{if($4>0.5 && $4<1.5){c[$1"\t"$2-1"\t"$3]++}} END{for(v in c){if(c[v]>=12){print v}}}' | sort -su -k1,1V -k2,2n -k3,3n | bedtools window -w 60 -a - -b data/snp_candidates_good_grch38.bed | cut -f4- | sort -su -k1,1V -k2,2n -k3,3n > data/snp_candidates_good_grch38.bed_
mv -f data/snp_candidates_good_grch38.bed_ data/snp_candidates_good_grch38.bed

Target 2474 SNPs within exon targets or <240bp near them, which gives 664 genes at least 1 SNP likely to help detect LOH (don't choose >1 SNP <200bp apart):

bedtools window -w 240 -a targets/exon_targets_grch38.bed -b data/snp_candidates_good_grch38.bed | cut -f5-8 | sort -su -k1,1V -k2,2n -k3,3n | bedtools spacing -i - | awk -F'\t' '{if($5>=200) print}' | cut -f1-4 | sed 's/$/:SNP_LOH/' > targets/snp_targets_grch38.bed

Add 13526 more SNPs (16k total, sufficient for HRD) that are most distant from their nearest SNPs:

grep SNP_LOH$ targets/snp_targets_grch38.bed | bedtools slop -b 200 -g /hot/ref/GRCh38_Verily_v1.genome.fa.fai -i | bedtools subtract -a data/snp_candidates_good_grch38.bed -b - | bedtools spacing -i - | sort -k7,7rn | head -n13526 | cut -f1-4 | sed 's/$/:SNP_CNV/' >> targets/snp_targets_grch38.bed
sort -s -k1,1V -k2,2n -k3,3n targets/snp_targets_grch38.bed -o targets/snp_targets_grch38.bed

Probe design

Combined all targets into a single merged BED file:

cat targets/*_targets_grch38.bed | sort -s -k1,1V -k2,2n -k3,3n > data/ucla_mdl_targets_grch38.bed

Estimated how many probes will be needed for 1x tiling (targets <=120bp get one 120bp probe each, others get total bps ÷ 120):

bedtools merge -i data/ucla_mdl_targets_grch38.bed -c 4 -o distinct | awk -F"\t" '{len=$3-$2; sum+=(len<120?120:len)} END {print sum/120}'

At this point, we sent our targets to the custom panel vendor in BED format. They ran bioinformatics tools to design the tiling and content of 120bp probes across our targets. They also flagged tricky targeted regions that are most likely to cause off-bait capture (usually homology with common genomic repeats), and sent us data/vendor_flagged_targets.bed. Most of these are intergenic SNPs overlapping Alu repeats that we either excluded, or we shifted the probes slightly to avoid overlap with Alu repeats. We reviewed the remaining targets based on importance.

Review non-SNP targets with partial or no coverage by vendor's probe design:

echo -e "Region\tLabels\tLength\tSkipped_Length\tFraction_Skipped\tReason_to_Keep" > data/ucla_mdl_tricky_targets_grch38.txt
bedtools intersect -wo -a data/ucla_mdl_targets_grch38.bed -b data/vendor_flagged_targets.bed | perl -ane '$l=$F[2]-$F[1]; $s=$F[6]-$F[5]; print join("\t","$F[0]:$F[1]-$F[2]",$F[3],$l,$s,$s/$l,"")."\n" unless($F[3]=~m/^(rs\d+|\.)/)' >> data/ucla_mdl_tricky_targets_grch38.txt

Shortlisted 17 important targets in data/ucla_mdl_tricky_targets_grch38.txt we asked vendor to capture those anyway, at the cost of some off-bait reads. Final target/bait loci of manufactured probes are in the repo release as ucla_mdl_cancer_ngs_v1_baits.hg38.bed.gz and ucla_mdl_cancer_ngs_v1_targets.hg38.bed.gz. These are useful for calculating hybrid selection metrics. Note that bait loci are merged and precise tiling/genomic loci of each 120bp bait is not indicated. Our vendor considered this proprietary and shared it with us under an NDA.

Testing

Described below is the testing of the v0.9 panel on FFPE specimens to optimize the wet lab workflow and to identify probes that cause the most off-target capture. This data helped add, remove, or reposition probes in the v1.0 panel that was actually put into service. Version numbers below do not correspond to repo releases.

Captured 2 pools of 8 FFPE samples each and sequenced on a HiSeq 2500 Rapid at 2x100bp. Ran GATK's best-practice for secondary analysis of TN-pairs using Sarek v2.7.1. The 16 BAMs after Picard MarkDuplicates and GATK BQSR are stored at /hot/bams/mdl_cancer_v1/*.bam. Per Picard HsMetrics, overall capture efficiency (FOLD_ENRICHMENT) was poor due to abundance of off-bait reads, mostly from intergenic SNPs. On vendor's recommendation, we tried it again with post-hyb wash temperature increased from 68deg to 70deg. FOLD_ENRICHMENT roughly doubled, though could be better. BAMs for these are stored at /hot/bams/mdl_cancer_v1.2/*.bam. These were sequenced on a NovaSeq 6000 SP at 2x150bp and shared with vendor for further analysis. They ran scripts that find the closest matching probe per off-bait read, and then ranked all probes by percent-contribution to the total number of off-bait reads in data/ucla_mdl_cancer_ngs_v1_ranked_probes.bed.

Create Picard-friendly interval lists for baits/targets, and generate HsMetrics per target requiring MQ>=1:

picard BedToIntervalList --INPUT ucla_mdl_cancer_ngs_v1_baits.hg38.bed --OUTPUT ucla_mdl_cancer_ngs_v1_baits.hg38.ilist --SEQUENCE_DICTIONARY /hot/bams/mdl_cancer_v1.2/NGSPanel_FFPE001.md.bam
picard BedToIntervalList --INPUT ucla_mdl_cancer_ngs_v1_targets.hg38.bed --OUTPUT ucla_mdl_cancer_ngs_v1_targets.hg38.ilist --SEQUENCE_DICTIONARY /hot/bams/mdl_cancer_v1.2/NGSPanel_FFPE001.md.bam
find /hot/bams/mdl_cancer_v1.2 -name "*.bam" | parallel -j16 picard CollectHsMetrics --INPUT {} --OUTPUT {.}.mq1.hsmetrics --PER_TARGET_COVERAGE {.}.mq1.hsmetrics_by_target --TARGET_INTERVALS ucla_mdl_cancer_ngs_v1_targets.hg38.ilist --BAIT_INTERVALS ucla_mdl_cancer_ngs_v1_baits.hg38.ilist --REFERENCE_SEQUENCE /hot/ref/GRCh38_Verily_v1.genome.fa --INCLUDE_INDELS --MINIMUM_MAPPING_QUALITY 1

Find the type of repeats that contribute the most to off-bait reads:

bedtools intersect -wao -f 0.25 -a data/ucla_mdl_cancer_ngs_v1_ranked_probes.bed -b data/repeat_masker_grch38.bed | awk -F"\t" 'OFS="\t" {sum[$8]+=$4} END {for (i in sum) print sum[i],i}' | sort -k1,1rg | less

77% of off-bait reads are from targets overlapping Alu repeats of which 45% are AluS repeats, 23% are AluY, and 7% are AluJ. Another 12% of off-bait reads are from simple 2-mer repeats like (TG)n, (AC)n, (CA)n, and (GT)n.

Create a BED file of the 4479 probes that contributed >0.001% of off-bait reads:

awk '{if($4>0.00001){print}}' data/ucla_mdl_cancer_ngs_v1_ranked_probes.bed > data/vendor_flagged_probes.bed

Of those probes, review and shortlist the ones that provide coverage for important targets:

bedtools intersect -wo -a data/vendor_flagged_probes.bed -b data/ucla_mdl_targets_grch38.bed | perl -a -F'\t' -ne '$F[3]=sprintf("%.5f",$F[3]); print join("\t",@F[0,1,2,7,3])."\n"' | bedtools merge -i - -d -120 -c 4,5 -o distinct | sort -k5,5rg > data/ucla_mdl_tricky_probes_grch38.bed

Shared data/ucla_mdl_tricky_probes_grch38.bed with the vendor to request repositioning these probes to avoid the overlapping repeats that are causing off-bait capture.

Due to paralogous genes, many coding exons will not be genotypable if we require MQ>=1. This can be alleviated by shearing the input DNA into longer fragments aka inserts. Insert sizes for these libraries are somewhat short at around 210bp (lots of overlapping read-pairs with 2x150bp sequencing), but MQ improves significantly for many genes as we get closer to 350bp inserts (PMID: 32523024).

Re-generate HsMetrics per target requiring MQ>=0:

find /hot/bams/mdl_cancer_v1.2 -name "*.bam" | parallel -j16 picard CollectHsMetrics --INPUT {} --OUTPUT {.}.mq0.hsmetrics --PER_TARGET_COVERAGE {.}.mq0.hsmetrics_by_target --TARGET_INTERVALS ucla_mdl_cancer_ngs_v1_targets.hg38.ilist --BAIT_INTERVALS ucla_mdl_cancer_ngs_v1_baits.hg38.ilist --REFERENCE_SEQUENCE /hot/ref/GRCh38_Verily_v1.genome.fa --INCLUDE_INDELS --MINIMUM_MAPPING_QUALITY 0

Shortlist targets with mean min_normalized_coverage <0.3 (too low) or mean max_normalized_coverage >1.9 (too high) across 16 samples (MQ>=0):

grep -hv ^chrom /hot/bams/mdl_cancer_v1.2/*.mq0.hsmetrics_by_target | cut -f1-3,9 | awk -F'\t' 'OFS="\t" {c[$1"\t"$2-1"\t"$3]+=$4} END{for(v in c){m=c[v]/16; if(m<0.3){print v,m}}}' | sort -su -k1,1V -k2,2n -k3,3n | bedtools intersect -wo -a data/ucla_mdl_targets_grch38.bed -b - | cut -f1-4,8 | bedtools merge -i - -c 4,5 -o distinct | sort -k5,5g > data/ucla_mdl_lowcov_targets_grch38.bed
grep -hv ^chrom /hot/bams/mdl_cancer_v1.2/*.mq0.hsmetrics_by_target | cut -f1-3,10 | awk -F'\t' 'OFS="\t" {c[$1"\t"$2-1"\t"$3]+=$4} END{for(v in c){m=c[v]/16; if(m>1.9){print v,m}}}' | sort -su -k1,1V -k2,2n -k3,3n | bedtools intersect -wo -a data/ucla_mdl_targets_grch38.bed -b - | cut -f1-4,8 | bedtools merge -i - -c 4,5 -o distinct | sort -k5,5rg > data/ucla_mdl_2hicov_targets_grch38.bed