-
Notifications
You must be signed in to change notification settings - Fork 1
/
README.Rmd
216 lines (153 loc) · 7.4 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, echo = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "README-"
)
```
## Overview
## Installation
```{r, eval = FALSE}
# install.packages("devtools")
devtools::install_github("samabs/conliga")
```
## Usage
Load the `conliga` package.
```{r}
library(conliga)
```
The typical workflow:
1. `choose_loci()`: filter loci.
2. `fit_controls()`: fit a vector of proportions, using a set of control samples
3. `run_MCMC()`: run the MCMC to obtain posterior probabilities of the latent variables in the conliga model
4. `process_MCMC()`: process the MCMC to summarise the posterior probabilities to obtain relative copy number profiles and basic plots
## Example
Let's go through an example using the count data from the paper.
### Set-up
First let's load the loci count data.
```{r}
# load the loci count data for the samples
data(loci_counts)
loci_counts[1:6, 1:6]
```
Now we have a `loci_counts` data frame with `r nrow(loci_counts)` rows corresponding to genomic loci. The data frame contains `r ncol(loci_counts)` columns. The first 3 columns contain `chr`, `leftPos` and `strand` and correspond to the chromosome, position and strand of a locus. The other `r ncol(loci_counts) - 3` columns represent count observations for each sample.
### Filtering loci using `choose_loci()`
Now we filter the rows of the data frame to filter out noise and retain 'robust' loci for further analysis. We choose a set of samples to use as a basis for filtering and a threshold on the number of zeros permitted in the set of samples. It can be useful to list the samples we want to use as a basis for filtering in a csv or tsv file.
```{r, message=FALSE}
samples_file = system.file("extdata",
"samples_for_robust_loci.csv",
package = "conliga",
mustWork = TRUE)
samples_for_filtering = readr::read_csv(samples_file) %>%
.$Sample
samples_for_filtering
```
At the very least, we must ensure that at least one of the control samples used for fitting the proportions (see next section) has a non-zero count. In this example, we aggressively filter the loci by removing a locus with a zero count in any of out control samples.
Now we run the `choose_loci` function, passing it:
* an analysis path which acts as out base directory for all output in our analysis.
* the counts data frame
* the chromosomes we are interested in (in this case chr1-22)
* the samples we use as a basis for filtering
* the number of permitted zeros
If a locus has more than the number of permitted zeros in the samples list and/or it is in a chromosome we are not interested in, it is filtered.
```{r, message=FALSE}
# the base directory for the analysis output files
analysis_path = "~/my_analysis_path"
# the chromosomes we want to analyse
chromosomes = paste0("chr", 1:22)
filtered_counts = choose_loci(analysis_path=analysis_path,
counts=loci_counts,
chr_order=chromosomes,
samples_used_to_filter=samples_for_filtering,
zeros_permitted=0)
filtered_counts[1:6, 1:6]
```
Now we are left with `r nrow(filtered_counts)` loci in the `filtered_counts` data frame.
## Fitting the proportion of read counts using `fit_controls()`
Now we use a set of controls (which are assumed to be diploid at each locus) from a batch, to fit the proportion of read counts at each locus.
We need to provide a data frame which describes the samples we will be analysing and defines the samples we will use as controls:
```{r, message=FALSE}
sample_types_file = system.file("extdata",
"samples_info.csv",
package = "conliga",
mustWork = TRUE)
sample_types = readr::read_csv(sample_types_file)
sample_types
```
The column `Control` is `1` if the sample is a control and `0` otherwise.
```{r, eval=FALSE}
fit_data = fit_controls(analysis_path=analysis_path,
counts = filtered_counts,
sample_types=sample_types,
iterations = 20000,
burn_in = 5000,
precision_prior = c(1.5, 1000000) # shape and scale of gamma prior on the sample inverse dispersion parameter
)
```
This returns a list including `map_means`, which is a vector of MAP estimates for the loci count proportion bias $\mathbf{m}$. It also includes `run_data`, which is the loci counts for the samples we are interested in.
### Running the MCMC to infer the relative copy number profiles using `run_MCMC()`
Get data ready to run the MCMC.
```{r, message=FALSE}
# read cytoband information cytoBandIdeo.txt file - obtained from UCSC
cytoband_file = system.file("extdata",
"cytoBandIdeo.txt",
package = "conliga",
mustWork = TRUE)
cytoband = read_cytobands(cytoband_file,
chr_order=chromosomes)
# alternatively get cytoband information by querying UCSC
# cytoband = get_cytobands(genome="hg38",
# chr_order=chromosomes)
cytoband
# we need to state our priors in the following format
priors_file = system.file("extdata",
"MCMC_priors.csv",
package = "conliga",
mustWork = TRUE)
priors = readr::read_csv(priors_file)
priors
# which samples do we want to infer relative copy number profiles for?
samples_to_run = c("SLX-9396.FastSeqB",
"SLX-9396.FastSeqC")
# or run all non-controls:
# samples_to_run = sample_types %>%
# dplyr::filter(Control == 0) %>%
# .$Sample
num_cores = 2
iterations = 50000
thin = 10
```
Now we run the function `run_MCMC`.
```{r, eval=FALSE}
run_MCMC(analysis_path=analysis_path,
cytoband=cytoband,
samples=samples_to_run,
fit_data=fit_data,
priors=priors,
num_cores=num_cores,
iterations=iterations,
thin=thin,
gamma = 1, # set gamma set to 1,
max_states=30,
init_states=TRUE) # start chain off by k-means clustering
```
Note that we have fixed the hyperparameter gamma to 1. The other hyperparameters alpha+kappa and rho will be inferred. The `run_MCMC` function usually takes approximately 30 mins per 10,000 iterations of the MCMC (depending on your machine).
### Processing the MCMC chains and obtaining results with `process_MCMC()`
Now we process the MCMC using the the `process_MCMC` function. Here we have selected to burn the first 10000 iterations.
```{r, eval=FALSE}
burn_in = 10000
samples_to_run %>% purrr::map(~process_MCMC(analysis_path=analysis_path,
sample_name=.x,
fit_data=fit_data,
cytoband=cytoband,
burn_in=burn_in))
```
This function saves the inferred relative copy number calls to `analysis_path/results/sample_name/sample_name.tsv` and provides a number of basic plots under `analysis_path/results/sample_name/standard_plots/`.
Finally, we can delete the MCMC output files if we wish.
```{r, eval=FALSE}
unlink(file.path(analysis_path, "MCMC"), recursive = TRUE, force = TRUE)
```