scGFT (single-cell Generative Fourier Transformer) is a generative model built upon the principles of the Fourier Transform. It employs a one-shot transformation paradigm to synthesize single-cell gene expression profiles that reflect the natural biological variability found in authentic datasets.
scGFT can be installed directly from this github with:
if (!require("devtools", quietly = TRUE))
install.packages("devtools")
devtools::install_github("Sanofi-Public/PMCB-scGFT",
build_vignettes=FALSE)
scGFT framework is designed to be compatible with the Seurat R analysis pipelines. To install, please run:
# Enter commands in R (or R studio, if installed)
install.packages("Seurat")
install.packages("SeuratObject")
Visit Seurat for more details.
The scGFT package comprises only two functions: one to synthesize cells and a second to evaluate the synthesis quality.
# to synthsize cells
RunScGFT(object, nsynth, ncpmnts = 1, groups = NULL, cells = NULL)
RunScGFT
requires, at a minimum, a Seurat object (object
), the number of
desired cells to be synthesized (nsynth
), and a metadata variable indicating
groups of cells (groups
), or cells
for cell-specific synthesis. cells
specifies the list of barcode(s) of the cell(s) to be used for cell-based synthesis.
For elemnts of the list with one cell barcode, nsynth
cells will be synthesized.
If a vector
of barcodes is provided, nsynth
cells will be synthesized for the
specified group of barcodes.
# to evaluate synthsized cells
statsScGFT(object, groups)
statsScGFT
requires a Seurat object that includes synthesized cells (object
)
and a character variable from the original object metadata (groups
).
It calculates the likelihood that synthesized cells will have the same identity
as their original counterparts. It also reports the relative deviation of
synthesized gene expression profiles from original cells.
We provided the dataset PRJEB44878 (Wohnhaas 2021), which comprises 34,200 processed cells derived from primary small airway epithelial cells from healthy individuals and patients with chronic obstructive pulmonary disease. To download this dataset please run:
# Enter commands in R (or R studio, if installed)
data_url <- "https://zenodo.org/records/12516896/files/scGFT_GitHub_PRJEB44878.rds"
data_path <- "~/scGFT_GitHub_PRJEB44878.rds" # correct destination path includes the filename
download.file(url=data_url, destfile=data_path, method="auto")
data_obj <- readRDS(data_path)
cnts <- data_obj$counts
mtd <- data_obj$metadata
set.seed(1234)
sobj_synt <- CreateSeuratObject(counts=cnts,
meta.data=mtd) %>%
NormalizeData(., normalization.method="LogNormalize", scale.factor=1e6) %>%
FindVariableFeatures(., nfeatures=2000) %>%
ScaleData(.) %>%
RunPCA(., seed.use=42) %>%
RunHarmony(., group.by.vars="sample") %>% # sample-specific batch correction
FindNeighbors(., reduction="harmony", dims=1:30) %>%
FindClusters(., random.seed=42) %>%
# ================================
# synthesis 34,200 cells (1x), through modification of 10 complex components.
RunScGFT(., nsynth=1*dim(.)[2], ncpmnts=10, groups="seurat_clusters") %>%
# The combined dataset of original and synthetic cells undergoes another round.
# Re-normalization is not needed as the new cells are synthesized from already normalized data.
# ================================
FindVariableFeatures(., nfeatures=2000) %>%
ScaleData(.) %>%
RunPCA(., seed.use=42) %>%
RunHarmony(., group.by.vars="sample") %>% # sample-specific batch correction
FindNeighbors(., reduction="harmony", dims=1:30) %>%
FindClusters(., random.seed=42) %>%
RunUMAP(., reduction="harmony", seed.use=42, dims=1:30)
RunScGFT
console outputs:
Discrete fourier transform...
Inverse fourier transform...
synthesizing 34,200 cells...
4,902 cells synthesized...
9,712 cells synthesized...
14,389 cells synthesized...
17,116 cells synthesized...
19,708 cells synthesized...
21,988 cells synthesized...
24,052 cells synthesized...
25,812 cells synthesized...
27,264 cells synthesized...
28,563 cells synthesized...
29,474 cells synthesized...
30,334 cells synthesized...
31,123 cells synthesized...
31,898 cells synthesized...
32,570 cells synthesized...
33,226 cells synthesized...
33,600 cells synthesized...
33,882 cells synthesized...
34,057 cells synthesized...
34,150 cells synthesized...
34,200 cells synthesized...
Deviation from originals (%): 6.13 +/- 1.22
Synthesis completed in: 1.91 min
Integrating data (1/2)
[==================================================] 100% in 36s
Integrating data (2/2)
[==================================================] 100% in 2m
A Seurat object with 68,400 cells, including 34,200 synthesized.
statsScGFT(object=sobj_synt, groups="seurat_clusters")
statsScGFT
console outputs:
Synthesized cells: 34,200
Matching cells: 32,157
Accuracy (%): 94.03
Utilizing UMAP for a qualitative evaluation, we project synthesized and real cells onto the embedded manifold:
Depending on the operating system used for calculations and due to the stochastic nature of a generative model, the results can differ from the projected ones.
In this showcase, we expand rare epithelial subtypes, including aberrant basaloid cells, PNECs, and ionocytes, each comprising less than 0.3% of the population. An individual cell from each cell type was randomly selected for synthesis:
set.seed(1234)
sobj_exp <- CreateSeuratObject(counts=cnts,
meta.data=mtd) %>%
NormalizeData(., normalization.method="LogNormalize", scale.factor=1e6) %>%
FindVariableFeatures(., nfeatures=2000) %>%
ScaleData(.) %>%
RunPCA(., seed.use=42) %>%
RunHarmony(., group.by.vars="sample") %>% # sample-specific batch correction
FindNeighbors(., reduction="harmony", dims=1:30) %>%
# ================================
# synthesis 1,000 cells, through modification of 10 complex components, for each of given annotated rare epithelial subtypes
RunScGFT(., nsynth=1000, ncpmnts=10, cells = list("S2_ACGGAGAGTTCCCGAG-1", # a pre-annotated "Ionocyte" cell
"S1_ATTACTCTCGTTGCCT-1", # a pre-annotated "PNEC" cell
"S1_AAGCCGCGTGCCTGCA-1") # a pre-annotated "Aberrant basaloid" cell
) %>%
# ================================
FindVariableFeatures(., nfeatures=2000) %>%
ScaleData(.) %>%
RunPCA(., seed.use=42) %>%
RunHarmony(., group.by.vars=c("sample")) %>% # sample-specific batch correction
FindNeighbors(., reduction="harmony", dims=1:30) %>%
FindClusters(., random.seed=42) %>%
RunUMAP(., reduction="harmony", seed.use=42, dims=1:30)
RunScGFT
console outputs:
Discrete fourier transform...
Inverse fourier transform...
synthesizing 1,000 cells...
1,000 cells synthesized...
Deviation from originals (%): 8.67 +/- 2.65
Discrete fourier transform...
Inverse fourier transform...
synthesizing 1,000 cells...
1,000 cells synthesized...
Deviation from originals (%): 8.82 +/- 2.9
Discrete fourier transform...
Inverse fourier transform...
synthesizing 1,000 cells...
1,000 cells synthesized...
Deviation from originals (%): 8.39 +/- 2.78
Synthesis completed in: 0.02 min
Integrating data (1/2)
[==================================================] 100% in 1s
Integrating data (2/2)
[==================================================] 100% in 3s
A Seurat object with 37,200 cells, including 3,000 synthesized.
Next, we evaluate the consistency of cell types in synthesized cells relative to the originals. Cells goes through another round of cell type annotation using Sargent, an automated, cluster-free, score-based annotation method that classifies cell types based on distinct markers (a helper script can be found here). Then, the annotations of the synthesized cells are evaluated by:
statsScGFT(object=sobj_synt, groups="sargent_celltype")
statsScGFT
console outputs:
Synthesized cells: 3,000
Matching cells: 2,998
Accuracy (%): 99.93
Utilizing UMAP for a qualitative evaluation, we project synthesized and real cells onto the embedded manifold:
Depending on the operating system used for calculations and due to the stochastic nature of a generative model, the results can differ from the projected ones.
For help and questions please contact the scgft's maintenance team.