README.Rmd

---
output: github_document
---

<!-- README.md is generated from README.Rmd. Please edit that file -->

```{r, echo = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "README-"
)
```


## Overview


## Installation

```{r, eval = FALSE}
# install.packages("devtools")
devtools::install_github("samabs/conliga")
```


## Usage

Load the `conliga` package.
```{r}
library(conliga)
```

The typical workflow:

1. `choose_loci()`: filter loci.
2. `fit_controls()`: fit a vector of proportions, using a set of control samples
3. `run_MCMC()`: run the MCMC to obtain posterior probabilities of the latent variables in the conliga model
4. `process_MCMC()`: process the MCMC to summarise the posterior probabilities to obtain relative copy number profiles and basic plots

## Example

Let's go through an example using the count data from the paper.

### Set-up

First let's load the loci count data.

```{r}
# load the loci count data for the samples
data(loci_counts)

loci_counts[1:6, 1:6]
```

Now we have a `loci_counts` data frame with `r nrow(loci_counts)` rows corresponding to genomic loci.  The data frame contains `r ncol(loci_counts)` columns.  The first 3 columns contain `chr`, `leftPos` and `strand` and correspond to the chromosome, position and strand of a locus.  The other `r ncol(loci_counts) - 3` columns represent count observations for each sample.

### Filtering loci using `choose_loci()`

Now we filter the rows of the data frame to filter out noise and retain 'robust' loci for further analysis.  We choose a set of samples to use as a basis for filtering and a threshold on the number of zeros permitted in the set of samples.  It can be useful to list the samples we want to use as a basis for filtering in a csv or tsv file.

```{r, message=FALSE}
samples_file = system.file("extdata",
                           "samples_for_robust_loci.csv",
                           package = "conliga",
                           mustWork = TRUE)

samples_for_filtering = readr::read_csv(samples_file) %>%
  .$Sample

samples_for_filtering
```

At the very least, we must ensure that at least one of the control samples used for fitting the proportions (see next section) has a non-zero count.  In this example, we aggressively filter the loci by removing a locus with a zero count in any of out control samples.

Now we run the `choose_loci` function, passing it:
* an analysis path which acts as out base directory for all output in our analysis.
* the counts data frame
* the chromosomes we are interested in (in this case chr1-22)
* the samples we use as a basis for filtering
* the number of permitted zeros

If a locus has more than the number of permitted zeros in the samples list and/or it is in a chromosome we are not interested in, it is filtered.

```{r, message=FALSE}
# the base directory for the analysis output files
analysis_path = "~/my_analysis_path"
 
# the chromosomes we want to analyse
chromosomes = paste0("chr", 1:22)

filtered_counts = choose_loci(analysis_path=analysis_path,
                              counts=loci_counts,
                              chr_order=chromosomes, 
                              samples_used_to_filter=samples_for_filtering,
                              zeros_permitted=0)

filtered_counts[1:6, 1:6]
```

Now we are left with `r nrow(filtered_counts)` loci in the `filtered_counts` data frame.

## Fitting the proportion of read counts using `fit_controls()`

Now we use a set of controls (which are assumed to be diploid at each locus) from a batch, to fit the proportion of read counts at each locus.

We need to provide a data frame which describes the samples we will be analysing and defines the samples we will use as controls:

```{r, message=FALSE}
sample_types_file = system.file("extdata",
                                "samples_info.csv",
                                package = "conliga",
                                mustWork = TRUE)

sample_types = readr::read_csv(sample_types_file)
sample_types
```

The column `Control` is `1` if the sample is a control and `0` otherwise.

```{r, eval=FALSE}
fit_data = fit_controls(analysis_path=analysis_path,
                        counts = filtered_counts,
                        sample_types=sample_types,
                        iterations = 20000,
                        burn_in = 5000,
                        precision_prior = c(1.5, 1000000) # shape and scale of gamma prior on the sample inverse dispersion parameter 
                        )
```


This returns a list including `map_means`, which is a vector of MAP estimates for the loci count proportion bias $\mathbf{m}$.  It also includes `run_data`, which is the loci counts for the samples we are interested in.

### Running the MCMC to infer the relative copy number profiles using `run_MCMC()`

Get data ready to run the MCMC.
```{r, message=FALSE}
# read cytoband information cytoBandIdeo.txt file - obtained from UCSC

cytoband_file = system.file("extdata",
                            "cytoBandIdeo.txt",
                            package = "conliga",
                            mustWork = TRUE)

cytoband = read_cytobands(cytoband_file,
                          chr_order=chromosomes)

# alternatively get cytoband information by querying UCSC
# cytoband = get_cytobands(genome="hg38",
#                          chr_order=chromosomes)

cytoband

# we need to state our priors in the following format
priors_file = system.file("extdata",
                          "MCMC_priors.csv",
                          package = "conliga",
                          mustWork = TRUE)

priors = readr::read_csv(priors_file)

priors

# which samples do we want to infer relative copy number profiles for?
samples_to_run = c("SLX-9396.FastSeqB",
                   "SLX-9396.FastSeqC")

# or run all non-controls:
# samples_to_run = sample_types %>%
#   dplyr::filter(Control == 0) %>%
#   .$Sample

num_cores = 2
iterations = 50000
thin = 10

```

Now we run the function `run_MCMC`.

```{r, eval=FALSE}

run_MCMC(analysis_path=analysis_path,
         cytoband=cytoband,
         samples=samples_to_run,
         fit_data=fit_data,
         priors=priors,
         num_cores=num_cores,
         iterations=iterations,
         thin=thin,
         gamma = 1, # set gamma set to 1,
         max_states=30,
         init_states=TRUE) # start chain off by k-means clustering
```
Note that we have fixed the hyperparameter gamma to 1.  The other hyperparameters alpha+kappa and rho will be inferred.  The `run_MCMC` function usually takes approximately 30 mins per 10,000 iterations of the MCMC (depending on your machine).

### Processing the MCMC chains and obtaining results with `process_MCMC()`

Now we process the MCMC using the the `process_MCMC` function.  Here we have selected to burn the first 10000 iterations.
```{r, eval=FALSE}

burn_in = 10000

samples_to_run %>% purrr::map(~process_MCMC(analysis_path=analysis_path,
                                            sample_name=.x,
                                            fit_data=fit_data,
                                            cytoband=cytoband,
                                            burn_in=burn_in))
```
This function saves the inferred relative copy number calls to `analysis_path/results/sample_name/sample_name.tsv` and provides a number of basic plots under `analysis_path/results/sample_name/standard_plots/`.

Finally, we can delete the MCMC output files if we wish.
```{r, eval=FALSE}
unlink(file.path(analysis_path, "MCMC"), recursive = TRUE, force = TRUE)
```