Skip to content

Latest commit

 

History

History
1515 lines (1139 loc) · 49.2 KB

README.md

File metadata and controls

1515 lines (1139 loc) · 49.2 KB

Chick 10x Neural Development

This repository provides the code to run the 10x single cell RNA-eq analysis in Trevers et al. 2023.


Data availability

This repository contains the required code to run the entire alignment and downstream analysis pipeline. For those who would like to re-run the analysis, the following files should be downloaded:

  • to run the analysis including the alignment, the raw fastq sequencing files can be found here.
  • to run the downstream analysis from the UMI counts generated from 10x Genomics Cell Ranger are embedded can be found within the repository here.

Reproducing our analyses

Reproducing our analysis is computationally intensive, therefore we recommend to run the pipeline on a HPC.

In order to make our analysis fully reproducible, we have developed a Nextflow pipeline which executes the analysis within our custom Docker container.

To re-run our analysis you will need to:

Important! All shell scripts are to be executed from the base directory of the project!


Download GitHub repository

In order to reproduce our analysis, you will first need to download our otic-repregramming GitHub repository. To do this run:

git clone https://github.com/alexthiery/10x_neural_tube

Download data

To download the GalGal5 genome from Ensembl, run:

rsync -av rsync://ftp.ensembl.org/ensembl/pub/release-94/fasta/gallus_gallus/dna/Gallus_gallus.Gallus_gallus-5.0.dna.toplevel.fa.gz <LOCAL_PATH>

To download the Galgal5 gtf from Ensembl, run:

rsync -av rsync://ftp.ensembl.org/ensembl/pub/release-94/gtf/gallus_gallus/Gallus_gallus.Gallus_gallus-5.0.94.gtf.gz <LOCAL_PATH>

Once the genome files have been downloaded, they need to be unzipped before running the alignment.

ADD SEQUENCING DATA DOWNLOAD PATHS


Nextflow configuration file

Configuration properties and file paths are passed to Nextflow via configuration files.

In order to re-run the alignments and downstream analysis for this project, you will need to create a custom configuration file which contains:

  • max system resource allocation
  • Docker/Singularity activation

Below is a copy of the config file used to run the pipeline at The Francis Crick Institute HPC. Given that this alignment was run on a HPC, we configure Nextflow to run with Singularity instead of Docker.

#!/usr/bin/env nextflow

singularity {
  enabled = true
  autoMounts = true
  docker.enabled = false
}

process {
  executor = 'slurm'
}

params {
  max_memory = 224.GB
  max_cpus = 32
  max_time = 72.h
}

Further information on Nextflow configuration can be found here.

After you have saved your custom configuration file, you are ready to align the data!


Sequence alignment

Our 10x data is aligned using a custom Nextflow pipeline.

Sample FASTQ paths are provided via a samplesheet csv. The template csv used to run this analysis can be found template.

Once you have changed the sample paths in the samplesheet csv, execute the following commans to align the 10x data.

export NXF_VER=20.07.1

nextflow run NF-10x_alignment/main.nf \
-c <PATH_TO_CONFIG> \
--samplesheet <PATH_TO_SAMPLESHEET> \
--output output/NF-10x_alignment \
--gtf <PATH_TO_GTF> \
--fasta <PATH_TO_GENOME> \
-resume

Interactive downstream analysis

If do not want to re-run the alignment, but would like to run the downstream analysis from the count files, you can run RStudio from within our Docker container. This will ensure that you have all of the same packages and dependencies required to carry out the analysis.

To interactively explore the data, follow these steps:

  1. clone our GitHub repository to your local computer - git clone https://github.com/alexthiery/10x_neural_tube
  2. start a terminal session and pull the Docker image from Dockerhub - docker pull alexthiery/10x_neural_tube:v1.1
  3. within terminal launch a Docker container interactively - docker run --rm -e PASSWORD=password -p 8787:8787 -v <PATH_TO_LOCAL_REPOSITORY>:/home/rstudio alexthiery/10x_neural_tube:v1.1
  4. go to localhost:8787 in the browser to open RStudio
  5. enter the username rstudio and the password password
  6. access the downstream analysis R script in the Files tab in R studio by following the path ./bin/R/seurat_full.R

Downstream analysis pipeline

This analysis is ran using Seurat v3.1.5 For more information see https://satijalab.org/seurat/.

To navigate between the different sections of our analysis click here:

  1. Calculating sex effect and removing sex genes
  2. Identify and remove contamination
  3. Remove poor quality clusters
  4. Check for cell cycle effect
  5. Cell state classification
  6. Gene modules
  7. Subset neural clusters
  8. Pseudotime

Load packages

#!/usr/bin/env Rscript

reticulate::use_python('/usr/bin/python3.7')
library(Seurat)

library(getopt)
library(future)
library(cowplot)
library(clustree)
library(gridExtra)
library(grid)
library(pheatmap)
library(RColorBrewer)
library(Antler)
library(ggbeeswarm)
library(tidyverse)

Set environmental variables, load data and make Seurat object In order to be able to run the script from either Rstudio, local terminal, or cluster terminal, I add a switch which looks for command line arguments. This then sets the directory paths accordingly.

# set arguments for Rscript
spec = matrix(c(
  'runtype', 'l', 2, "character",
  'cores'   , 'c', 2, "integer",
  'customFuncs', 'm', 2, "character",
  'networkGenes', 'd', 2, "character"
), byrow=TRUE, ncol=4)
opt = getopt(spec)

# set default location
if(length(commandArgs(trailingOnly = TRUE)) == 0){
  cat('No command line arguments provided, user defaults paths are set for running interactively in Rstudio on docker\n')
  opt$runtype = "user"
} else {
  if(is.null(opt$runtype)){
    stop("--runtype must be either 'user', 'nextflow' or 'docker'")
  }
  if(tolower(opt$runtype) != "docker" & tolower(opt$runtype) != "user" & tolower(opt$runtype) != "nextflow"){
    stop("--runtype must be either 'user', 'nextflow' or 'docker'")
  }
  if(tolower(opt$runtype) == "nextflow"){
    if(is.null(opt$customFuncs)){
      stop("--customFuncs path must be specified")
    }
    if(is.null(opt$networkGenes)){
      stop("--networkGenes path must be specified")
    }
  }
}

####################################################################
# user paths need to be defined here in order to run interactively #
####################################################################
if (opt$runtype == "user"){
  
  # Input data paths
  data_path = "./output/NF-10x_alignment/cellrangerCounts"
  custom_func_path = './bin/R/custom_functions/'
  network_genes_path = './input_files/network_expression.csv'

  # Output data paths  
  plot_path = "./output/NF-downstream_analysis/plots/"
  rds_path = "./output/NF-downstream_analysis/rds_files/"
  antler_dir = "./output/NF-downstream_analysis/antler_input/"

} else if (opt$runtype == "nextflow"){
  cat('pipeling running through nextflow\n')

  # Input data paths
  data_path = "./input/cellrangerCounts"
  custom_func_path = opt$customFuncs
  network_genes_path = opt$networkGenes
  
  # Output data paths
  plot_path = "./plots/"
  rds_path = "./rds_files/"
  antler_dir = "./antler_input/"

}

# Load custom functions
sapply(list.files(custom_func_path, full.names = T), source)

# Create output directories
dir.create(plot_path, recursive = T)
dir.create(rds_path, recursive = T)
dir.create(antler_dir, recursive = T)
  
# Read input files
files <- list.files(data_path, recursive = T, full.names = T)
# Remove file suffix
file.path <- dirname(files)[!duplicated(dirname(files))]
# Make dataframe with tissue matching directory
tissue = c("hh4", "hh6", "ss4", "ss8")
matches <- sapply(tissue, function(x) file.path[grep(pattern = x, x = file.path)])
sample.paths <- data.frame(tissue = names(matches), path = matches, row.names = NULL)

# Read in network genes and remove 0 timepoint
network_genes <- read.csv(network_genes_path) %>% filter(!timepoint == 0)

Set number of cores to use for parallelisation

if(is.null(opt$cores)){ncores = 4}else{ncores= opt$cores}
cat(paste0("script ran with ", ncores, " cores\n"))

Make Seurat objects for each of the different samples.

for(i in 1:nrow(sample.paths["path"])){
  name<-paste(sample.paths[i,"tissue"])
  assign(name, CreateSeuratObject(counts= Read10X(data.dir = paste(sample.paths[i,"path"])), project = paste(sample.paths[i, "tissue"])))
}

The four Seurat objects are then merged, before running CreateSeuratObject again on the output in order to apply the min.cells parameter on the final merged dataset.

temp <- merge(hh4, y = c(hh6, ss4, ss8), add.cell.ids = c("hh4", "hh6", "ss4", "ss8"), project = "chick.10x")
merged.data<-CreateSeuratObject(GetAssayData(temp), min.cells = 3, project = "chick.10x.mincells3")

Make seurat object with ensembl names and save as separate dataframe for adding to misc slot

for(i in 1:nrow(sample.paths["path"])){
  name<-paste(sample.paths[i,"tissue"])
  assign(paste0(name, "_ensID"), CreateSeuratObject(counts= Read10X(data.dir = paste(sample.paths[i,"path"]), gene.column = 1), project = paste(sample.paths[i, "tissue"])))
}
temp <- merge(hh4_ensID, y = c(hh6_ensID, ss4_ensID, ss8_ensID), add.cell.ids = c("hh4", "hh6", "ss4", "ss8"), project = "chick.10x")
merged.data_ensID<-CreateSeuratObject(GetAssayData(temp), min.cells = 3, project = "chick.10x.mincells3")

Add gene IDs dataframe to merged data object

Misc(merged.data, slot = "geneIDs") <- cbind("gene_ID" = rownames(merged.data_ensID), "gene_name" =  rownames(merged.data))

The original Seurat objects are then removed from the global environment

rm(hh4, hh6, ss4, ss8, sample.paths, temp, hh4_ensID, hh6_ensID, ss4_ensID, ss8_ensID, merged.data_ensID)

Store mitochondrial percentage in object meta data

merged.data <- PercentageFeatureSet(merged.data, pattern = "^MT-", col.name = "percent.mt")

Remove data which do not pass filter threshold

merged.data <- subset(merged.data, subset = c(nFeature_RNA > 1000 & nFeature_RNA < 6000 & percent.mt < 15))

Log normalize data and find variable features

norm.data <- NormalizeData(merged.data, normalization.method = "LogNormalize", scale.factor = 10000)
norm.data <- FindVariableFeatures(norm.data, selection.method = "vst", nfeatures = 2000)

Enable parallelisation

plan("multiprocess", workers = ncores)
options(future.globals.maxSize = 2000 * 1024^2)

Scale data and regress out MT content

norm.data <- ScaleData(norm.data, features = rownames(norm.data), vars.to.regress = "percent.mt")

Perform dimensionality reduction by PCA and UMAP embedding

Change plot path

curr.plot.path <- paste0(plot.path, '0_filt_data/')
dir.create(curr.plot.path)

Run PCA analysis on the each set of data

norm.data <- RunPCA(object = norm.data, verbose = FALSE)

Seurat's clustering algorithm is based on principle components, so we need to ensure that only the informative PCs are kept

Plot heatmap of top variable genes across top principle components

png(paste0(curr.plot.path, "dimHM.png"), width=30, height=50, units = 'cm', res = 200)
DimHeatmap(norm.data, dims = 1:30, balanced = TRUE, cells = 500)
graphics.off()

Another heuristic method is ElbowPlot which ranks PCs based on the % variance explained by each PC

png(paste0(curr.plot.path, "elbowplot.png"), width=24, height=20, units = 'cm', res = 200)
print(ElbowPlot(norm.data, ndims = 40))
graphics.off()

Run clustering and UMAP at different PCA cutoffs - save this output to compare the optimal number of PCs to be used

png(paste0(curr.plot.path, "UMAP_PCA_comparison.png"), width=40, height=30, units = 'cm', res = 200)
PCA.level.comparison(norm.data, PCA.levels = c(7, 10, 15, 20), cluster_res = 0.5)
graphics.off()
Dimensions heatmap ElbowPlot PCA level comparison

Use PCA=15 as elbow plot is relatively stable across stages Use clustering resolution = 0.5 for filtering

norm.data <- FindNeighbors(norm.data, dims = 1:15, verbose = FALSE)
norm.data <- RunUMAP(norm.data, dims = 1:15, verbose = FALSE)
norm.data <- FindClusters(norm.data, resolution = 0.5, verbose = FALSE)

Plot UMAP for clusters and developmental stage

png(paste0(curr.plot.path, "UMAP.png"), width=40, height=20, units = 'cm', res = 200)
clust.stage.plot(norm.data)
graphics.off()

Plot QC for each cluster

png(paste0(curr.plot.path, "cluster.QC.png"), width=40, height=14, units = 'cm', res = 200)
QC.plot(norm.data)
graphics.off()
UMAP Clsuter QC

Find differentially expressed genes and plot heatmap of top 15 DE genes for each cluster

# Find differentially expressed genes and plot heatmap of top DE genes for each cluster
markers <- FindAllMarkers(norm.data, only.pos = T, logfc.threshold = 0.25)

# get automated cluster order based on percentage of cells in adjacent stages
cluster.order = order.cell.stage.clust(seurat_object = norm.data, col.to.sort = seurat_clusters, sort.by = orig.ident)

# Re-order genes in top15 based on desired cluster order in subsequent plot - this orders them in the heatmap in the correct order
top15 <- markers %>% group_by(cluster) %>% top_n(n = 15, wt = avg_logFC) %>% arrange(factor(cluster, levels = cluster.order))

png(paste0(curr.plot.path, 'HM.top15.DE.png'), height = 50, width = 75, units = 'cm', res = 700)
tenx.pheatmap(data = norm.data, metadata = c("seurat_clusters", "orig.ident"), custom_order_column = "seurat_clusters",
              custom_order = cluster.order, selected_genes = unique(top15$gene), gaps_col = "seurat_clusters")
graphics.off()

Heatmap clearly shows clusters segregate by sex - check this and remove sex genes


1) Calculating sex effect and removing sex genes

Change plot path

curr.plot.path <- paste0(plot.path, '1_sex_filt/')
dir.create(curr.plot.path)

There is a strong sex effect - this plot shows DE genes between clusters 1 and 2 which are hh4 clusters. Clustering is driven by sex genes

png(paste0(curr.plot.path, 'HM.top15.DE.pre-sexfilt.png'), height = 40, width = 70, units = 'cm', res = 500)
tenx.pheatmap(data = norm.data[,rownames(norm.data@meta.data[norm.data$seurat_clusters == 1 | norm.data$seurat_clusters == 2,])],
              metadata = c("seurat_clusters", "orig.ident"), selected_genes = rownames(FindMarkers(norm.data, ident.1 = 1, ident.2 = 2)),
              hclust_rows = T, gaps_col = "seurat_clusters")
graphics.off()


Use W chromosome genes to K-means cluster the cells into male (zz) and female (zw)

W_genes <- as.matrix(norm.data@assays$RNA[grepl("W-", rownames(norm.data@assays$RNA)),])
k_clusters <- kmeans(t(W_genes), 2)
k_clusters <- data.frame(k_clusters$cluster)
norm.data@meta.data$k_clusters <- k_clusters[match(colnames(norm.data@assays$RNA), rownames(k_clusters)),]

Get rownames for kmeans clusters 1 and 2

k_clus_1 <- rownames(norm.data@meta.data[norm.data@meta.data$k_clusters == 1,])
k_clus_2 <- rownames(norm.data@meta.data[norm.data@meta.data$k_clusters == 2,])

K clustering identities are stochastic, so I need to identify which cluster is male and female Sum of W genes is order of magnitude greater in cluster 2 - these are the female cells

sumclus1 <- sum(W_genes[,k_clus_1])
sumclus2 <- sum(W_genes[,k_clus_2])

if(sumclus1 < sumclus2){
  k.male <- k_clus_1
  k.female <- k_clus_2
} else {
  k.female <- k_clus_1
  k.male <- k_clus_2
}

cell.sex.ID <- list("male.cells" = k.male, "female.cells" = k.female)

Add sex data to meta.data

norm.data@meta.data$sex <- unlist(lapply(rownames(norm.data@meta.data), function(x)
  if(x %in% k.male){"male"} else if(x %in% k.female){"female"} else{stop("cell sex is not assigned")}))

Filter W chrom genes

Following subsetting of cells and/or genes, the same pipeline as above is repeated i.e. Find variable features -> Scale data/regress out confounding variables -> PCA -> Find neighbours -> Run UMAP -> Find Clusters -> Cluster QC -> Find top DE genes

Remove W genes

norm.data.sexscale <- norm.data[rownames(norm.data)[!grepl("W-", rownames(norm.data))],]

Re-run findvariablefeatures and scaling

norm.data.sexscale <- FindVariableFeatures(norm.data.sexscale, selection.method = "vst", nfeatures = 2000)

# Enable parallelisation
plan("multiprocess", workers = ncores)
options(future.globals.maxSize = 2000 * 1024^2)

norm.data.sexscale <- ScaleData(norm.data.sexscale, features = rownames(norm.data.sexscale), vars.to.regress = c("percent.mt", "sex"))

Set plot path

curr.plot.path <- paste0(plot.path, '1_sex_filt/')

PCA

norm.data.sexfilt <- RunPCA(object = norm.data.sexfilt, verbose = FALSE)

png(paste0(curr.plot.path, "dimHM.png"), width=30, height=50, units = 'cm', res = 200)
DimHeatmap(norm.data.sexfilt, dims = 1:30, balanced = TRUE, cells = 500)
graphics.off()

png(paste0(curr.plot.path, "elbowplot.png"), width=24, height=20, units = 'cm', res = 200)
print(ElbowPlot(norm.data.sexfilt, ndims = 40))
graphics.off()

png(paste0(curr.plot.path, "UMAP_PCA_comparison.png"), width=40, height=30, units = 'cm', res = 200)
PCA.level.comparison(norm.data.sexfilt, PCA.levels = c(7, 10, 15, 20), cluster_res = 0.5)
graphics.off()
Dimensions heatmap ElbowPlot PCA level comparison

Use PCA=15 as elbow plot is relatively stable across stages

norm.data.sexscale <- FindNeighbors(norm.data.sexscale, dims = 1:15, verbose = FALSE)
norm.data.sexscale <- RunUMAP(norm.data.sexscale, dims = 1:15, verbose = FALSE)

Find optimal cluster resolution

png(paste0(curr.plot.path, "clustree.png"), width=70, height=35, units = 'cm', res = 200)
clust.res(seurat.obj = norm.data.sexscale, by = 0.1)
graphics.off()

Use clustering resolution = 0.5 to look for contamination clusters

norm.data.sexscale <- FindClusters(norm.data.sexscale, resolution = 0.5, verbose = FALSE)

Plot UMAP for clusters and developmental stage

png(paste0(curr.plot.path, "UMAP.png"), width=40, height=20, units = 'cm', res = 200)
clust.stage.plot(norm.data.sexscale)
graphics.off()

Plot QC for each cluster

png(paste0(curr.plot.path, "cluster.QC.png"), width=40, height=14, units = 'cm', res = 200)
QC.plot(norm.data.sexscale)
graphics.off()
ClusTree UMAP Cluster QC

Find differentially expressed genes and plot heatmap of top DE genes for each cluster

markers <- FindAllMarkers(norm.data.sexscale, only.pos = T, logfc.threshold = 0.25)
# get automated cluster order based on percentage of cells in adjacent stages
cluster.order = order.cell.stage.clust(seurat_object = norm.data.sexscale, col.to.sort = seurat_clusters, sort.by = orig.ident)
# Re-order genes in top15 based on desired cluster order in subsequent plot - this orders them in the heatmap in the correct order
top15 <- markers %>% group_by(cluster) %>% top_n(n = 15, wt = avg_logFC) %>% arrange(factor(cluster, levels = cluster.order))

png(paste0(curr.plot.path, 'HM.top15.DE.post-sexfilt.png'), height = 75, width = 100, units = 'cm', res = 500)
tenx.pheatmap(data = norm.data.sexscale, metadata = c("seurat_clusters", "orig.ident"), custom_order_column = "seurat_clusters",
              custom_order = cluster.order, selected_genes = unique(top15$gene), gaps_col = "seurat_clusters")
graphics.off()


2) Identify and remove contamination

Change plot path

curr.plot.path <- paste0(plot.path, "2_cluster_filt/")
dir.create(curr.plot.path)

Identify mesoderm and PGCs using candidate genes

genes <- c("EYA2", "SIX1", "TWIST1", "PITX2", "SOX17", "DAZL", "DND1", "CXCR4")

ncol = 3
png(paste0(curr.plot.path, "UMAP_GOI.png"), width = ncol*10, height = 12*ceiling(length(genes)/ncol), units = "cm", res = 200)
multi.feature.plot(seurat.obj = norm.data.sexfilt, gene.list = genes, plot.clusters = T,
                   plot.stage = T, label = "", cluster.col = "RNA_snn_res.1.4", n.col = ncol)
graphics.off()

Dotplot for identifying PGCs, Early mesoderm and Late mesoderm

png(paste0(curr.plot.path, "dotplot.GOI.png"), width = 20, height = 12, units = "cm", res = 200)
DotPlot(norm.data.sexfilt, features = c( "SOX17", "CXCR4","EYA2", "TWIST1", "SIX1",  "PITX2", "DAZL"))
graphics.off()

Feature plots DotPlots

Remove contaminating cells from clusters

Clust 8 = early mesoderm - expresses sox17, eya2, pitx2, cxcr4 Clust 10 = Late mesoderm - expresses twist1, six1, eya2 Clust 11 = PGC's - expresses dazl very highly

norm.data.contamfilt <- rownames(norm.data.sexscale@meta.data)[norm.data.sexscale@meta.data$seurat_clusters ==  8 |
                                                                norm.data.sexscale@meta.data$seurat_clusters == 10 |
                                                                norm.data.sexscale@meta.data$seurat_clusters == 11]

norm.data.contamfilt <- subset(norm.data.sexscale, cells = norm.data.contamfilt, invert = T)

Re-run findvariablefeatures and scaling

norm.data.contamfilt <- FindVariableFeatures(norm.data.contamfilt, selection.method = "vst", nfeatures = 2000)

# Enable parallelisation
plan("multiprocess", workers = ncores)
options(future.globals.maxSize = 2000 * 1024^2)

norm.data.contamfilt <- ScaleData(norm.data.contamfilt, features = rownames(norm.data.contamfilt), vars.to.regress = c("percent.mt", "sex"))

PCA

norm.data.contamfilt <- RunPCA(object = norm.data.contamfilt, verbose = FALSE)

png(paste0(curr.plot.path, "dimHM.png"), width=30, height=50, units = 'cm', res = 200)
DimHeatmap(norm.data.contamfilt, dims = 1:30, balanced = TRUE, cells = 500)
graphics.off()

png(paste0(curr.plot.path, "elbowplot.png"), width=24, height=20, units = 'cm', res = 200)
print(ElbowPlot(norm.data.contamfilt, ndims = 40))
graphics.off()

png(paste0(curr.plot.path, "UMAP_PCA_comparison.png"), width=40, height=30, units = 'cm', res = 200)
PCA.level.comparison(norm.data.contamfilt, PCA.levels = c(7, 10, 15, 20), cluster_res = 0.5)
graphics.off()
Dimensions heatmap ElbowPlot PCA level comparison

Use PCA=15 as elbow plot is relatively stable across stages

norm.data.contamfilt <- FindNeighbors(norm.data.contamfilt, dims = 1:15, verbose = FALSE)
norm.data.contamfilt <- RunUMAP(norm.data.contamfilt, dims = 1:15, verbose = FALSE)

Find optimal cluster resolution

png(paste0(curr.plot.path, "clustree.png"), width=70, height=35, units = 'cm', res = 200)
clust.res(seurat.obj = norm.data.contamfilt, by = 0.2)
graphics.off()

Use clustering resolution = 1.4 in order to make lots of clusters and identify any remaining poor quality cells

norm.data.contamfilt <- FindClusters(norm.data.contamfilt, resolution = 1.4)

Plot UMAP for clusters and developmental stage

png(paste0(curr.plot.path, "UMAP.png"), width=40, height=20, units = 'cm', res = 200)
clust.stage.plot(norm.data.contamfilt)
graphics.off()

Plot QC for each cluster

png(paste0(curr.plot.path, "cluster.QC.png"), width=45, height=14, units = 'cm', res = 200)
QC.plot(norm.data.contamfilt)
graphics.off()
ClusTree UMAP Cluster QC

3) Remove poor quality clusters

Clust 11, 14, 16, 18 = poor quality cells

norm.data.clustfilt <- rownames(norm.data.contamfilt@meta.data)[norm.data.contamfilt@meta.data$seurat_clusters ==  11 |
                                                                  norm.data.contamfilt@meta.data$seurat_clusters == 14 |
                                                                  norm.data.contamfilt@meta.data$seurat_clusters == 16 |
                                                                  norm.data.contamfilt@meta.data$seurat_clusters == 18]

norm.data.clustfilt <- subset(norm.data.contamfilt, cells = norm.data.clustfilt, invert = T)

Re-run findvariablefeatures and scaling

norm.data.clustfilt <- FindVariableFeatures(norm.data.clustfilt, selection.method = "vst", nfeatures = 2000)

# Enable parallelisation
plan("multiprocess", workers = ncores)
options(future.globals.maxSize = 2000 * 1024^2)

norm.data.clustfilt <- ScaleData(norm.data.clustfilt, features = rownames(norm.data.clustfilt), vars.to.regress = c("percent.mt", "sex"))

Change plot path

curr.plot.path <- paste0(plot.path, "3_cluster_filt/")
dir.create(curr.plot.path)

PCA

norm.data.clustfilt <- RunPCA(object = norm.data.clustfilt, verbose = FALSE)

png(paste0(curr.plot.path, "dimHM.png"), width=30, height=50, units = 'cm', res = 200)
DimHeatmap(norm.data.clustfilt, dims = 1:30, balanced = TRUE, cells = 500)
graphics.off()

png(paste0(curr.plot.path, "elbowplot.png"), width=24, height=20, units = 'cm', res = 200)
print(ElbowPlot(norm.data.clustfilt, ndims = 40))
graphics.off()

png(paste0(curr.plot.path, "UMAP_PCA_comparison.png"), width=40, height=30, units = 'cm', res = 200)
PCA.level.comparison(norm.data.clustfilt, PCA.levels = c(7, 10, 15, 20), cluster_res = 0.5)
graphics.off()
Dimensions heatmap ElbowPlot PCA level comparison

Use PCA=15 as elbow plot is relatively stable across stages

norm.data.clustfilt <- FindNeighbors(norm.data.clustfilt, dims = 1:15, verbose = FALSE)
norm.data.clustfilt <- RunUMAP(norm.data.clustfilt, dims = 1:15, verbose = FALSE)

Find optimal cluster resolution

png(paste0(curr.plot.path, "clustree.png"), width=70, height=35, units = 'cm', res = 200)
clust.res(seurat.obj = norm.data.clustfilt, by = 0.2)
graphics.off()

Use clustering resolution = 0.8

norm.data.clustfilt <- FindClusters(norm.data.clustfilt, resolution = 0.8)

Plot UMAP for clusters and developmental stage

png(paste0(curr.plot.path, "UMAP.png"), width=40, height=20, units = 'cm', res = 200)
clust.stage.plot(norm.data.clustfilt)
graphics.off()

Plot QC for each cluster

png(paste0(curr.plot.path, "cluster.QC.png"), width=45, height=14, units = 'cm', res = 200)
QC.plot(norm.data.contamfilt)
graphics.off()
ClusTree UMAP Cluster QC

4) Check for cell cycle effect

Set plot path

curr.plot.path <- paste0(plot.path, "4_cell_cycle/")
dir.create(curr.plot.path)

Calculate cell cycle for each cell

s.genes <- cc.genes$s.genes
g2m.genes <- cc.genes$g2m.genes
pre.cell.cycle.dat <- CellCycleScoring(norm.data.clustfilt, s.features = s.genes, g2m.features = g2m.genes, set.ident = TRUE)

Re-run findvariablefeatures and scaling

norm.data.clustfilt.cc <- FindVariableFeatures(pre.cell.cycle.dat, selection.method = "vst", nfeatures = 2000)

# Enable parallelisation
plan("multiprocess", workers = ncores)
options(future.globals.maxSize = 2000 * 1024^2)

norm.data.clustfilt.cc <- ScaleData(norm.data.clustfilt.cc, features = rownames(norm.data.clustfilt.cc), vars.to.regress = c("percent.mt", "sex", "S.Score", "G2M.Score"))

PCA

norm.data.clustfilt.cc <- RunPCA(object = norm.data.clustfilt.cc, verbose = FALSE)

png(paste0(curr.plot.path, "dimHM.png"), width=30, height=50, units = 'cm', res = 200)
DimHeatmap(norm.data.clustfilt.cc, dims = 1:30, balanced = TRUE, cells = 500)
graphics.off()

png(paste0(curr.plot.path, "elbowplot.png"), width=24, height=20, units = 'cm', res = 200)
print(ElbowPlot(norm.data.clustfilt.cc, ndims = 40))
graphics.off()

png(paste0(curr.plot.path, "UMAP_PCA_comparison.png"), width=40, height=30, units = 'cm', res = 200)
PCA.level.comparison(norm.data.clustfilt.cc, PCA.levels = c(7, 10, 15, 20), cluster_res = 0.5)
graphics.off()
Dimensions heatmap ElbowPlot PCA level comparison

Use PCA=15 as elbow plot is relatively stable across stages

norm.data.clustfilt.cc <- FindNeighbors(norm.data.clustfilt.cc, dims = 1:15, verbose = FALSE)
norm.data.clustfilt.cc <- RunUMAP(norm.data.clustfilt.cc, dims = 1:15, verbose = FALSE)

Find optimal cluster resolution
png(paste0(curr.plot.path, "clustree.png"), width=70, height=35, units = 'cm', res = 200)
clust.res(seurat.obj = norm.data.clustfilt.cc, by = 0.2)
graphics.off()

Use clustering resolution = 1.2

norm.data.clustfilt.cc <- FindClusters(norm.data.clustfilt.cc, resolution = 1.2)

Plot UMAP for clusters and developmental stage

png(paste0(curr.plot.path, "UMAP.png"), width=40, height=20, units = 'cm', res = 200)
clust.stage.plot(norm.data.clustfilt.cc)
graphics.off()

Plot QC for each cluster

png(paste0(curr.plot.path, "cluster.QC.png"), width=40, height=14, units = 'cm', res = 200)
QC.plot(norm.data.clustfilt.cc)
graphics.off()
ClusTree UMAP Cluster QC

UMAP of cell cycle before and after regressing out

png(paste0(curr.plot.path, "cell.cycle.png"), width=40, height=20, units = 'cm', res = 200)
pre.plot <- DimPlot(pre.cell.cycle.dat, group.by = "Phase", reduction = "umap")
post.plot <- DimPlot(norm.data.clustfilt.cc, group.by = "Phase", reduction = "umap")
print(gridExtra::grid.arrange(pre.plot, post.plot, ncol = 2))
graphics.off()


Find differentially expressed genes and plot heatmap of top DE genes for each cluster

markers <- FindAllMarkers(norm.data.clustfilt.cc, only.pos = T, logfc.threshold = 0.25)
# get automated cluster order based on percentage of cells in adjacent stages
cluster.order = order.cell.stage.clust(seurat_object = norm.data.clustfilt.cc, col.to.sort = seurat_clusters, sort.by = orig.ident)
# Re-order genes in top15 based on desired cluster order in subsequent plot - this orders them in the heatmap in the correct order
top15 <- markers %>% group_by(cluster) %>% top_n(n = 15, wt = avg_logFC) %>% arrange(factor(cluster, levels = cluster.order))

png(paste0(curr.plot.path, 'HM.top15.DE.png'), height = 75, width = 100, units = 'cm', res = 500)
tenx.pheatmap(data = norm.data.clustfilt.cc, metadata = c("seurat_clusters", "orig.ident"), custom_order_column = "seurat_clusters",
              custom_order = cluster.order, selected_genes = unique(top15$gene), gaps_col = "seurat_clusters")
graphics.off()


5) Cell state classification

Set plot path

curr.plot.path <- paste0(plot.path, "5_cell_state_classification/")
dir.create(curr.plot.path)

Genes of interest used for cell state classification

GOI = rev(c( "EOMES", "ADMP","YEATS4", "MAFA", "ING5", "LIN28B", "AATF", "SETD2", "GATA2", "DLX5", "TFAP2A", "BMP4", "SIX1", "EYA2", "MSX1", "PAX7", "CSRNP1", "SOX10", "SOX2", "SOX21", "SIX3", "OLIG2", "PAX6", "FOXA2", "SHH", "PAX2", "WNT4", "HOXB2", "HOXA2", "GBX2"))

Change order or clusters for plotting dotplots

cluster_order <- factor(norm.data.clustfilt.cc$seurat_clusters, levels = c(12,5,2,1,3,8,0,11, 10,7,4,14,9,6,13))

Set factor levels for plotting

norm.data.clustfilt.cc$seurat_clusters <- cluster_order

Generate pie charts for cluster dev stage composition

venn_data <- norm.data.clustfilt.cc@meta.data %>%
  rownames_to_column('cell_name') %>%
  dplyr::select(c(cell_name, orig.ident, seurat_clusters)) %>%
  group_by(seurat_clusters) %>%
  count(orig.ident, .drop = FALSE)

venn_data <- venn_data %>%
  mutate(total_cells = sum(n)) %>%
  mutate(n = n/total_cells)

Reverse levels to deal with seurat dotplot reversing y axis

norm.data.clustfilt.cc$seurat_clusters <- fct_rev(cluster_order)

Generate DotPlot

dotplot <- DotPlot(norm.data.clustfilt.cc, group.by = "seurat_clusters", features = GOI) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5),
        legend.position="bottom", legend.box = "horizontal",
        axis.title.x=element_blank(), legend.text=element_text(size=10),
        legend.title=element_text(size=12))

Generate Pie charts

numeric(n), fill=orig.ident, width = total_cells)) +
  geom_bar(position = 'fill', stat = "identity") +
  facet_wrap( ~ seurat_clusters, nrow = nlevels(norm.data.clustfilt.cc@meta.data$seurat_clusters)) +
  coord_polar("y") +
  theme_void() +
  theme(strip.background = element_blank(), strip.text.x = element_blank(),
        legend.position=c(0,0), legend.direction = "horizontal",
        plot.margin = margin(t = 14, r = 0, b = 119, l = -110, unit = "pt"),
        legend.margin=margin(t = 70, r = 0, b = -100, l = -200, unit = "pt"),
        legend.text=element_text(size=10),
        legend.title=element_text(size=12)) +
  labs(fill = "Stage")

Plot dotplot and pies

png(paste0(curr.plot.path, "dotplot.png"), width = 32, height = 18, units = "cm", res = 200)
print(plot_grid(dotplot, pies, rel_widths = c(5,1)))
graphics.off()

Plot dotplot with cell classification labels

dotplot <- DotPlot(norm.data.clustfilt.cc, group.by = "seurat_clusters", features = GOI) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5),
        legend.position="bottom", legend.box = "horizontal",
        axis.title.x=element_blank(), legend.text=element_text(size=10),
        axis.title.y = element_blank(), 
        legend.title=element_text(size=12)) +
  scale_y_discrete(labels = rev(c('Node', 'HH4-1', 'HH4-2', 'HH4-3', 'Non-Neural Plate Progenitors',
                                  'Placodes', 'Neural Plate Progenitors', 'NC Progenitors', 'Delaminating NC',
                                  'Early Forebrain', 'Late Forebrain', 'Ventral Forebrain', 'Early Midbrain', 'Late Midbrain', 'Hindbrain')))

png(paste0(curr.plot.path, "dotplot_classified.png"), width = 36, height = 18, units = "cm", res = 200)
print(plot_grid(dotplot, pies, rel_widths = c(5,1)))
graphics.off()
Cluster ID dotplot Cell state dotplot

6) Gene modules

Unbiased identification of modules of co-correlated genes using Antler.

For further information on Antler gene module identification, click here.

Set plot path

curr.plot.path <- paste0(plot.path, "6_gene_modules/")
dir.create(curr.plot.path, recursive = TRUE)

Extract expression data and make dataset compatible with Antler

# strip end of cell names as this is incorrectly reformated in Antler
norm.data.clustfilt.cc <- RenameCells(norm.data.clustfilt.cc, new.names = sub('-.*','',colnames(norm.data.clustfilt.cc)))

antler_data <- data.frame(row.names = colnames(norm.data.clustfilt.cc),
                          "timepoint" = as.numeric(substring(colnames(norm.data.clustfilt.cc), 3, 3)),
                          "treatment" = rep("null", ncol(norm.data.clustfilt.cc)),
                          "replicate_id" = rep(1, ncol(norm.data.clustfilt.cc))
)
# save pheno data
write.table(antler_data, file = paste0(antler.dir, "phenoData.csv"), row.names = T, sep = "\t", col.names = T)
# save count data
write.table(GetAssayData(norm.data.clustfilt.cc, assay = "RNA", slot = "counts"), file = paste0(antler.dir, "assayData.csv"), row.names = T, sep = "\t", col.names = T, quote = F)

Load data into antler

antler <- Antler$new(output_folder = curr.plot.path, num_cores = 4)
antler$load_dataset(folder_path = antler.dir)

Remove genes which do not have >= 1 UMI count in >= 10 cells

antler$exclude_unexpressed_genes(min_cells=10, min_level=1, verbose=T, data_status='Raw')

Normalise data

antler$normalize(method = 'MR')

Calculate unbiased gene modules

antler$gene_modules$identify(
  name                  = "unbiasedGMs",
  corr_t                = 0.3,  # the Spearman correlation treshold
  corr_min              = 3,    # min. number of genes a gene must correlate with
  mod_min_cell          = 10,   # min. number of cells expressing the module
  mod_consistency_thres = 0.4,  # ratio of expressed genes among "positive" cells
  process_plots         = TRUE)

Copy seurat object for plotting

plot_data <- norm.data.clustfilt.cc
plot_data@meta.data <- plot_data@meta.data %>%
  rename(Stage = orig.ident,
         Clusters = seurat_clusters)

Get automated cluster order based on percentage of cells in adjacent stages

cluster.order = order.cell.stage.clust(seurat_object = plot_data, col.to.sort = Clusters, sort.by = Stage)

Plot all gene modules

png(paste0(curr.plot.path, 'allmodules.png'), height = 150, width = 120, units = 'cm', res = 500)
GM.plot(data = plot_data, metadata = c("Clusters", "Stage"), gene_modules = antler$gene_modules$lists$unbiasedGMs$content,
        show_rownames = F, custom_order = cluster.order, custom_order_column = "Clusters")
graphics.off()

Plot gene modules with at least 50% of genes DE logFC > 0.25 & FDR < 0.001

# Find DEGs
DEgenes <- FindAllMarkers(plot_data, only.pos = T, logfc.threshold = 0.25) %>% filter(p_val_adj < 0.001)

# Filter GMs with 50% genes DE logFC > 0.25 & FDR < 0.001
gms <- subset.gm(antler$gene_modules$lists$unbiasedGMs$content, selected_genes = DEgenes$gene, keep_mod_ID = T, selected_gene_ratio = 0.5)

png(paste0(curr.plot.path, 'DE.GM.png'), height = 160, width = 80, units = 'cm', res = 500)
GM.plot(data = plot_data, metadata = c("Clusters", "Stage"), gene_modules = gms, gaps_col = "Clusters",
        show_rownames = T, custom_order = cluster.order, custom_order_column = "Clusters", fontsize = 25, fontsize_row = 10)
graphics.off()

Screen DE GMs for neural induction GRN genes

# Intersect DE GMs with network genes
filtered_gms <- lapply(gms, function(x) x[x %in% network_genes$gene])

# Remove empty list elements
filtered_gms <- filtered_gms[lapply(filtered_gms,length)>0]

png(paste0(curr.plot.path, 'network.GM.png'), height = 33, width = 55, units = 'cm', res = 1200)
GM.plot(data = plot_data, metadata = c("Clusters", "Stage"), gene_modules = filtered_gms, gaps_col = "Clusters",
        show_rownames = T, custom_order = cluster.order, custom_order_column = "Clusters", fontsize = 15, fontsize_row = 19)
graphics.off()

7) Subset neural clusters

Set plot path

curr.plot.path <- paste0(plot.path, "7_neural_subset/")
dir.create(curr.plot.path)

Subset neural clusters

# Clust 12, 3, 11, 8, 10 = not neural clusters of interest
neural_subset <- rownames(norm.data.clustfilt.cc@meta.data)[norm.data.clustfilt.cc@meta.data$seurat_clusters ==  12 |
                                                                    norm.data.clustfilt.cc@meta.data$seurat_clusters == 3 |
                                                                    norm.data.clustfilt.cc@meta.data$seurat_clusters == 11 |
                                                                    norm.data.clustfilt.cc@meta.data$seurat_clusters == 8|
                                                                    norm.data.clustfilt.cc@meta.data$seurat_clusters == 10]

neural_subset <- subset(norm.data.clustfilt.cc, cells = neural_subset, invert = T)

Re-run findvariablefeatures and scaling

# Enable parallelisation
plan("multiprocess", workers = ncores)
options(future.globals.maxSize = 2000 * 1024^2)

neural_subset <- ScaleData(neural_subset, features = rownames(neural_subset), vars.to.regress = c("percent.mt", "sex", "S.Score", "G2M.Score"))

PCA

neural_subset <- RunPCA(object = neural_subset, verbose = FALSE)

png(paste0(curr.plot.path, "dimHM.png"), width=30, height=50, units = 'cm', res = 200)
DimHeatmap(neural_subset, dims = 1:30, balanced = TRUE, cells = 500)
graphics.off()

png(paste0(curr.plot.path, "elbowplot.png"), width=24, height=20, units = 'cm', res = 200)
print(ElbowPlot(neural_subset, ndims = 40))
graphics.off()

png(paste0(curr.plot.path, "UMAP_PCA_comparison.png"), width=40, height=30, units = 'cm', res = 200)
PCA.level.comparison(neural_subset, PCA.levels = c(7, 10, 15, 20), cluster_res = 0.5)
graphics.off()
Dimensions heatmap ElbowPlot PCA level comparison

Use PCA=15 as elbow plot is relatively stable across stages

neural_subset <- FindNeighbors(neural_subset, dims = 1:15, verbose = FALSE)
neural_subset <- RunUMAP(neural_subset, dims = 1:15, verbose = FALSE)

Find optimal cluster resolution

png(paste0(curr.plot.path, "clustree.png"), width=70, height=35, units = 'cm', res = 200)
clust.res(seurat.obj = neural_subset, by = 0.2)
graphics.off()

Use clustering resolution = 1.2

neural_subset <- FindClusters(neural_subset, resolution = 1.2)

Plot UMAP for clusters and developmental stage

png(paste0(curr.plot.path, "UMAP.png"), width=40, height=20, units = 'cm', res = 200)
clust.stage.plot(neural_subset)
graphics.off()

Plot QC for each cluster

png(paste0(curr.plot.path, "cluster.QC.png"), width=40, height=14, units = 'cm', res = 200)
QC.plot(neural_subset)
graphics.off()
ClusTree UMAP Cluster QC

Find differentially expressed genes and plot heatmap of top DE genes for each cluster

markers <- FindAllMarkers(neural_subset, only.pos = T, logfc.threshold = 0.25)
# get automated cluster order based on percentage of cells in adjacent stages
cluster.order = order.cell.stage.clust(seurat_object = neural_subset, col.to.sort = seurat_clusters, sort.by = orig.ident)
# Re-order genes in top15 based on desired cluster order in subsequent plot - this orders them in the heatmap in the correct order
top15 <- markers %>% group_by(cluster) %>% top_n(n = 15, wt = avg_logFC) %>% arrange(factor(cluster, levels = cluster.order))

png(paste0(curr.plot.path, 'HM.top15.DE.png'), height = 75, width = 100, units = 'cm', res = 500)
tenx.pheatmap(data = neural_subset, metadata = c("seurat_clusters", "orig.ident"), custom_order_column = "seurat_clusters",
              custom_order = cluster.order, selected_genes = unique(top15$gene), gaps_col = "seurat_clusters")
graphics.off()


8) Pseudotime

Set plot path

curr.plot.path <- paste0(plot.path, "8_pseudotime/")
dir.create(curr.plot.path)

Extract PC1 values

pc1 <- neural_subset@meta.data[,'orig.ident', drop=F] %>%
  tibble::rownames_to_column(var = "cell_name") %>%
  mutate(pc1 = Embeddings(object = neural_subset[["pca"]])[, 1])

Plot cell stage along PC1 - x-axis inverted to match developmental stage

png(paste0(curr.plot_path, 'pc1.png'), height = 18, width = 26, units = 'cm', res = 400)
ggplot(pc1, aes(x = pc1, y = orig.ident, colour = orig.ident)) +
  geom_quasirandom(groupOnX = FALSE) +
  theme_classic() +
  scale_x_reverse() +
  xlab("PC1") + ylab("Timepoint") +
  ggtitle("Cells ordered by first principal component")
graphics.off()


Developmental time is negatively correlated with pc1 - inverse PC1 and calculate cell rank.

Rank is then converted to 0-99 pseudotime scale

pc1_rank <- pc1 %>%
  mutate(rank = rank(-pc1)) %>%
  mutate(pseudotime = rank*(99/max(rank)))

Add housekeeping genes as a control group

housekeeping <- data.frame(gene_name = c('GAPDH', 'SDHA', 'HPRT1', 'HBS1L', 'TUBB1', 'YWHAZ'), timepoint = 'Housekeeping')
plot_genes <- rbind(network_genes, housekeeping)

Filter scaled seurat counts by network_genes -> generate long df with corresponding pseudotime and normalised count

plot_data <- as.data.frame(t(as.matrix(GetAssayData(neural_subset, slot = 'scale.data')[rownames(neural_subset) %in% plot_genes$gene_name,]))) %>%
  tibble::rownames_to_column(var = "cell_name") %>%
  dplyr::full_join(pc1_rank) %>%
  pivot_longer(!c(cell_name, orig.ident, pseudotime, pc1, rank), names_to = "gene_name", values_to = "scaled_expression") %>%
  dplyr::left_join(plot_genes) %>%
  droplevels() %>%
  mutate(timepoint = factor(timepoint, levels = c(1,3,5,7,9,12,'Housekeeping')))

Mean and SE summary data

plot_data_summary <- plot_data %>%
  mutate(rank_bin = pseudotime - (pseudotime %% 2.5)) %>% 
  group_by(rank_bin, timepoint) %>% 
  summarise(mn = mean(scaled_expression),
            se = sd(scaled_expression)/sqrt(n()))

Plot GAM for all stages without standard error

png(paste0(curr.plot_path, 'gam_pseudotime_allnetwork.png'), height = 18, width = 26, units = 'cm', res = 400)
ggplot(plot_data, aes(x = pseudotime, y = scaled_expression, colour = timepoint)) +
  scale_color_manual(values=c("#6600ff", "#0000ff", "#009900", "#ffcc00", "#ff8533", "#ee0000", "#c0c0c0")) +
  geom_smooth(method="gam", se=FALSE) +
  xlab("Ranking of PC1") + ylab("Average expression (scaled)") +
  theme_classic()
graphics.off()

Plot GAM for all stages with standard error

png(paste0(curr.plot_path, 'gam_se_pseudotime_allnetwork.png'), height = 18, width = 26, units = 'cm', res = 400)
ggplot(plot_data, aes(x = pseudotime, y = scaled_expression, colour = timepoint)) +
  geom_errorbar(data = plot_data_summary, aes(x = rank_bin, y = mn, 
                                              ymax = mn + se, ymin = mn - se), width = 2) +
  geom_point(data = plot_data_summary, aes(x = rank_bin, y = mn)) +
  scale_color_manual(values=c("#6600ff", "#0000ff", "#009900", "#ffcc00", "#ff8533", "#ee0000", "#c0c0c0")) +
  geom_smooth(method="gam", se=FALSE) +
  xlab("Ranking of PC1") + ylab("Average expression (scaled)") +
  theme_classic()
graphics.off()
Pseudotime Pseudotime SE