Skip to content

This repository contains toolkits for analyzing embeddings data through dimension reduction techniques and multivariate analysis, facilitating insights into high-dimensional data structures.

Notifications You must be signed in to change notification settings

ccb-hms/embeddingsAnalytics

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 

Repository files navigation

embeddingsAnalytics

This repository contains toolkits for analyzing embeddings data through dimension reduction techniques and multivariate analysis, facilitating insights into high-dimensional data structures.

Introduction

This repository currently contains a script for analyzing embeddings and performing multivariate analysis. The script reads data, formats embeddings, and visualizes results using various functions from scDiagnostics packages.

Required Packages

Installation

To run the main R script, you will need some standard Bioconductor packages which you can install with the following command:

BiocManager::install(c("SingleCellExperiment", "scater"))

In addition, we will use the latest version of the scDiagnostics package which you can install with the following command:

BiocManager::install("ccb-hms/scDiagnostics")

Data Preprocessing

Read in Data

The current main script begins by reading in data from the file bge-small-en-v1.5_embedding.csv which contains the embeddings of corpus nodes from several scientific articles in the references folder, as well as the embeddings of several questions generated from the text each of these corpus nodes.

Format Embeddings

Next, it formats embeddings for both corpus and question data, and creates SingleCellExperiment objects using formatted embeddings for corpus and question data.

Visualizations

MDS Scatter Plot

Generates a multidimensional scaling (MDS) plot with file type coloring.

PCA Plot

Runs PCA on the embeddings from each corpus nodes. The question embeddings are then projected onto the PCA space generated by the text contained in the corpus nodes.

Discriminant Space Plot

Runs a discriminant space model on the embeddings from each corpus nodes for each pairwise combination of manuscripts. The question embeddings are then projected onto the discriminant space generated by the text contained in the corpus nodes contained in the pair of manuscripts.

About

This repository contains toolkits for analyzing embeddings data through dimension reduction techniques and multivariate analysis, facilitating insights into high-dimensional data structures.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages