The package mixhvg works for highly variable gene selection, including popular public available methods, and also the mixture of multiple highly variable gene selection methods. The mixture of methods can combine the advantages captured by each single method.
This function FindVariableFeaturesMix inherits from FindVariableFeatures function of Seurat Package, which can be used the same as FindVariableFeatures. Also, it accepts the dense or sparse matrix input.
Find the tutorial website here https://mixhvg.github.io.
Sept. 15th, 2024: The package mixhvg
is updated to version 1.0.1, which can support the use in Seurat version 5.
Sept. 1st, 2024: The package mixhvg
is updated to version 0.2.1.
Aug. 26th, 2024: Our paper is on biorXiv as a preprint version: Zhao, R., Lu, J., Zhou, W., Zhao, N., & Ji, H. (2024). A systematic evaluation of highly variable gene selection methods for single-cell RNA-sequencing. bioRxiv, 2024-08.
Please download the mixhvg package from CRAN.
install.packages("mixhvg")
Please use GitHub repo to download the most updated package.
devtools::install_github("RuzhangZhao/mixhvg")
There are two inputs can be used in FindVariableFeaturesMix function.
The example data comes from 10x Genomics. You may use this link to download the processed data. The processed data is named as pbmc3k_rna.rds
.
pbmc<-readRDS("pbmc3k_rna.rds")
library(Seurat)
library(mixhvg)
object<-CreateSeuratObject(pbmc)
object<-FindVariableFeaturesMix(object,nfeatures=2000)
head(VariableFeatures(object))
# [1] "CD74" "LYZ" "IGLC3" "IGKC" "RPS29" "IGHA1"
One may use FindVariableFeaturesMix to replace the FindVariableFeatures function in the analysis pipeline of Seurat. For example,
object<-CreateSeuratObject(pbmc)
object<-FindVariableFeaturesMix(object)
object<-NormalizeData(object,verbose=FALSE)
object<-ScaleData(object,verbose = FALSE)
object<-RunPCA(object,npcs=30,verbose=FALSE)
object<-RunUMAP(object,dims=1:30,verbose = FALSE)
pbmc<-readRDS("pbmc3k_rna.rds")
library(mixhvg)
pbmc_hvg<-FindVariableFeaturesMix(pbmc,nfeatures=2000)
head(pbmc_hvg)
# [1] "CD74" "LYZ" "IGLC3" "IGKC" "RPS29" "IGHA1"
The method.names
can take one method or multiple methods for mixture.
pbmc_hvg<-FindVariableFeaturesMix(pbmc,method.names="seuratv3")
pbmc_hvg<-FindVariableFeaturesMix(pbmc,method.names="scran")
pbmc_hvg<-FindVariableFeaturesMix(pbmc,
method.names=c("scran","scran_pos","seuratv1"))
In the following example, we use a small cell sorting single-cell RNA-seq dataset with true label, downloaded from a large PBMC dataset. Download the example dataset using this link.
The data was provided by the work (Wang, J., Agarwal, D., Huang, M., Hu, G., Zhou, Z., Ye, C., & Zhang, N. R. (2019). Data denoising with transfer learning in single-cell transcriptomics. Nature methods, 16(9), 875-878.), which was initially downsampled from (Zheng, G. X., Terry, J. M., Belgrader, P., Ryvkin, P., Bent, Z. W., Wilson, R., ... & Bielas, J. H. (2017). Massively parallel digital transcriptional profiling of single cells. Nature communications, 8(1), 14049.).
Load the scRNA-seq data and its corresponding labels.
expr<-readRDS("pbmc_900.rds")
cell_label<-read.table("cell_labels.txt",header = T)
cell_label<-factor(cell_label$cell_type_labels,levels = c("regulatory_t","naive_t","cd34","memory_t","cd56_nk","cytotoxic_t","cd14_monocytes","b_cells","naive_cytotoxic"))
There are in total 9 cell type labels, with 100 cells for each cell type.
Here, we check the performance of two different HVG selection methods: Seurat v3 v.s. mixHVG. The evaluation is illustrated by the UMAP plots.
First, we show the standard pipeline for visualizing the dataset with default Seurat pipeline. Check Seurat tutorial for more details.
library(Seurat)
npcs=30
set.seed(1)
obj<-CreateSeuratObject(expr)
obj$cell_label<-cell_label
obj<-NormalizeData(obj,verbose=FALSE)
obj<-FindVariableFeatures(obj,verbose=FALSE)
obj<-ScaleData(obj,verbose=FALSE)
suppressWarnings(obj<-RunPCA(obj,npcs=npcs,verbose=FALSE))
obj<-RunUMAP(obj,dims = 1:npcs,verbose=FALSE)
pdf(paste0("umap_hvg_by_seuratv3.pdf"),height = 5.5,width = 7)
DimPlot(obj,reduction = "umap",group.by = "cell_label",shuffle = T)
dev.off()
Second, we can simply replace the line FindVariableFeatures
by FindVariableFeaturesMix
with package mixhvg loaded to apply the mixHVG method.
library(Seurat)
library(mixhvg)
npcs=30
set.seed(1)
pbmc<-CreateSeuratObject(expr)
pbmc$cell_label<-cell_label
pbmc<-NormalizeData(pbmc,verbose=FALSE)
pbmc<-FindVariableFeaturesMix(pbmc,verbose=FALSE)
pbmc<-ScaleData(pbmc,verbose=FALSE)
suppressWarnings(pbmc<-RunPCA(pbmc,npcs=npcs,verbose=FALSE))
pbmc<-RunUMAP(pbmc,dims = 1:npcs,verbose=FALSE)
pdf(paste0("umap_hvg_by_mixhvg.pdf"),height = 5.5,width = 7)
DimPlot(pbmc,reduction = "umap",group.by = "cell_label",shuffle = T)
dev.off()
The evaluation is by adjusted rand index (ARI). Refer to our paper for more details.
maxARI<-function(embedding,label){
label = as.numeric(as.factor(label))
snn_<- FindNeighbors(object = embedding,
nn.method = "rann",
verbose = F)$snn
res<-sapply( seq(0,2,0.1) ,function(cur){
cluster_label <- FindClusters(snn_,resolution = cur,
verbose = F)[[1]]
mclust::adjustedRandIndex(cluster_label,label)})
max(res)
}
ari_seuratv3<-maxARI(obj@reductions$umap@cell.embeddings,cell_label)
ari_mixhvg<-maxARI(pbmc@reductions$umap@cell.embeddings,cell_label)
ARI for UMAP with HVG selection by Seurat v3 is 0.6348
ARI for UMAP with HVG selection by mixHVG is 0.6767
The following methods can be chosen. And also, any mixture of the following methods is acceptable. For example, the default is c("scran","seuratv1","mv_PFlogPF","scran_pos")
- scran: Use mean-variance curve adjustment on lognormalized count matrix, which is scran ModelGeneVar.
- mv_ct: Use mean-variance curve adjustment on count matrix, inherited from scran ModelGeneVar.
- mv_nc: Use mean-variance curve adjustment on normalized count matrix, inherited from scran ModelGeneVar.
- mv_lognc: The same as scran.
- mv_PFlogPF: Use mean-variance curve adjustment on PFlog1pPF matrix, inherited from scran ModelGeneVar.
- scran_pos: Use scran poisson version, modelGeneVarByPoisson.
- seuratv3: Use logmean-logvariance curve adjustment on count matrix, which is vst, Seurat FindVariableFeatures Function(https://satijalab.org/seurat/reference/findvariablefeatures).
- logmv_ct: The same as seuratv3.
- logmv_nc: Use logmean-logvariance curve adjustment on normalized count matrix, inherited from seuratv3(vst).
- logmv_lognc: Use logmean-logvariance curve adjustment on lognormalized count matrix, inherited from seuratv3(vst).
- logmv_PFlogPF: Use logmean-logvariance curve adjustment on PFlog1pPF matrix, inherited from seuratv3(vst).
- seuratv1: Use dispersion on lognormalized count matrix, which is dispersion (disp), Seurat FindVariableFeatures Function(https://satijalab.org/seurat/reference/findvariablefeatures).
- disp_lognc: The same as seuratv1.
- disp_PFlogPF: Use dispersion on PFlog1pPF matrix, inherited from seuratv1(disp).
- mean_max_ct: Highly Expressed Features with respect to count matrix.
- mean_max_nc: Highly Expressed Features with respect to normalized count matrix.
- mean_max_lognc: Highly Expressed Features with respect to lognormalized count matrix