-
Notifications
You must be signed in to change notification settings - Fork 1
/
clip_targets.Rmd
111 lines (89 loc) · 4.38 KB
/
clip_targets.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
---
title: "Clip 38 regions to CpH/CpGs"
author: "Kim Siegmund"
date: "10/14/2019"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(GenomicAlignments)
library(tidyverse)
library(writexl)
library(BSgenome.Hsapiens.UCSC.hg19)
```
I want to analyze reads that cover the entire region, but I don't want to require bases that do not appear in my analysis. So, I will clip the intervals to capture the Cs.
# Load annotation information
Load the 38 amplicons & CpG positions.
```{r loadannot}
load("data/cxpos.rda")
load("data/targetgr.rda")
load("data/targetseqs.rda")
```
# Clip the ranges to 1st and last C position
Create and save new CpH and CpG positions to correspond to position in clipped target regions. We do this to require the read cover all Cs in the amplicon.
I'm going to still do this for just the CpGs to know how much wider the intervals will be when saving all Cs, not just CpGs.
```{r cliprange-cgs}
cliptargetgr <- targetgr
firstcg <- map_int(cxpos[["cgpos"]],pluck,1)
lastcg <- map_int(cxpos[["cgpos"]],~pluck(.x, length(.x)))
start(cliptargetgr) <- start(targetgr) - 1 + firstcg
end(cliptargetgr) <- start(targetgr) - 1 + lastcg
cliptargetgr[c(1:2,38)]
```
This is the correct range to capture all Cs for the Watson strand. I'm going to also apply it when measuring the Crick strand and see if it captures a reasonable number of Gs to evaluate non-conversion.
```{r cliprange-allcs}
ccliptargetgr <- targetgr
firstc <- map_int(cxpos[["cposWS"]],pluck,1)
lastc <- map_int(cxpos[["cposWS"]],~pluck(.x, length(.x)))
start(ccliptargetgr) <- start(targetgr) - 1 + firstc
end(ccliptargetgr) <- start(targetgr) - 1 + lastc
ccliptargetgr[c(1:2,38)]
```
How wide are the intervals we're mapping and how much wider than when we restrict to the CpGs only?
```{r how-much-longer}
width(ccliptargetgr)
summary(width(ccliptargetgr)-width(cliptargetgr))
```
These look plenty long for assessing non-conversion.
50\% of the amplicons are 12 or fewer bases wider. Those that are much wider are probably the amplicons with few CpGs (i.e. there are 6 with only 1 CpG).
Let's check these against the human ref genome. I'll add 1 base to see if the targets with a C in the final position is a CpG. This applies for the W strand. I have not done the same for the C strand.
```{r viewclippedtargets}
#to view complete CpGs, I will include 1 base 3' of amplicon.
clipgrplus1 <- ccliptargetgr
end(clipgrplus1) <- end(ccliptargetgr) + 1
vi = suppressWarnings(Views(Hsapiens,clipgrplus1))
clipseqplus1 <- as(vi,"DNAStringSet")
clipseqplus1
```
The above region will be my interval for counting reads. Then I need to get the C positions in this interval so we can call the conversion/methylation properly at those positions. (C positions on the W strand; G positions for the C strand)
This looks like a good time to split the amplicons by whether they are on the Watson or Crick strand.
```{r getWS-sequences-and-positions}
Ws <- which(values(clipgrplus1)$WC_strand=="W")
firstc <- map_int(cxpos[["cposWS"]],pluck,1)
cgposWSafterclip <- map2(cxpos$cgpos,firstc, ~ .x -.y + 1) # 38 obj
chposWSafterclip <- map2(cxpos$chposWS,firstc[Ws], ~ .x -.y + 1) # 16 obj
```
I will clip the C strand as if it is a W strand (already done above). It will give me all CpGs, and should capture enough Gs (C-complements) to assess bisulfite (non-) conversion. So, let's grab just the Crick strands and find the HpGs for assessing bisulfite non-conversion.
```{r getCS-sequences-and-positions}
Cs <- which(values(clipgrplus1)$WC_strand=="C")
nCtargets <- length(Cs)
# Now let's find the CpHs (HpGs)
chposCSafterclip <- vector(mode = "list", length = nCtargets)
for (i in 1:nCtargets)
{ts <- clipseqplus1[Cs[i]]
dn <- BStringSet(toString(ts), end=2:width(ts), width=2)
chposCSafterclip[[i]] <- c(which(dn=="AG"),which(dn=="GG"),which(dn=="TG"))+1
chposCSafterclip[[i]] <- sort(as.integer(chposCSafterclip[[i]]))
}
```
Save these results for calling DNA methylation haplotypes. Save the trailing G position because the minus strand-first mate in pair reads will need to be reverse engineered.
```{r saveme}
seqlevels(clipgrplus1) <- unique(as.vector(seqnames(clipgrplus1)))
save(clipgrplus1,file="data/clipgrplus1.rda")
save(cgposWSafterclip,file="data/cgposWSafterclip.rda")
save(chposWSafterclip,file="data/chposWSafterclip.rda")
save(chposCSafterclip,file="data/chposCSafterclip.rda")
```
```{r sI}
sessionInfo()
```