This repository contains toolkits for analyzing embeddings data through dimension reduction techniques and multivariate analysis, facilitating insights into high-dimensional data structures.
This repository currently contains a script for analyzing embeddings and performing multivariate analysis. The script reads data, formats embeddings, and visualizes results using various functions from scDiagnostics
packages.
To run the main R script, you will need some standard Bioconductor packages which you can install with the following command:
BiocManager::install(c("SingleCellExperiment", "scater"))
In addition, we will use the latest version of the scDiagnostics
package which you can install with the following command:
BiocManager::install("ccb-hms/scDiagnostics")
The current main script begins by reading in data from the file bge-small-en-v1.5_embedding.csv
which contains the embeddings of corpus nodes from several scientific articles in the references
folder, as well as the embeddings of several questions generated from the text each of these corpus nodes.
Next, it formats embeddings for both corpus and question data, and creates SingleCellExperiment
objects using formatted embeddings for corpus and question data.
Generates a multidimensional scaling (MDS) plot with file type coloring.
Runs PCA on the embeddings from each corpus nodes. The question embeddings are then projected onto the PCA space generated by the text contained in the corpus nodes.
Runs a discriminant space model on the embeddings from each corpus nodes for each pairwise combination of manuscripts. The question embeddings are then projected onto the discriminant space generated by the text contained in the corpus nodes contained in the pair of manuscripts.