Skip to content

Commit

Permalink
added placeholder pre-processing notebook
Browse files Browse the repository at this point in the history
  • Loading branch information
cornhundred committed Jan 7, 2025
1 parent 5249922 commit 3588b52
Show file tree
Hide file tree
Showing 3 changed files with 207 additions and 12 deletions.
5 changes: 3 additions & 2 deletions docs/examples/index.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
# Jupyter Notebook Examples

[Landscape View Xenium](short_notebooks/Landscape_View_Xenium.ipynb)
## Brief Notebooks
[Landscape View Xenium](brief_notebooks/Landscape_View_Xenium.ipynb)

[Mouse-Brain_Alpha-Shape-Neighborhood](short_notebooks/Mouse-Brain_Alpha-Shape-Neighborhood.ipynb)
[Mouse-Brain_Alpha-Shape-Neighborhood](brief_notebooks/Mouse-Brain_Alpha-Shape-Neighborhood.ipynb)
208 changes: 201 additions & 7 deletions docs/overview/file_formats.md
Original file line number Diff line number Diff line change
@@ -1,15 +1,15 @@
# Celldega File Formats
While there has been tremendous progress in developing standardized data formats and architectures for spatial-omics data, namely [SpatialData](https://spatialdata.scverse.org/en/stable/) and the related [AnnData](https://anndata.readthedocs.io/en/stable/), these approaches currently lack support for interactive cloud-based visualization of large (>100M transcripts) Spatial Transcriptomics (ST) data. Furthermore, all-in-one data format approaches preclude the development of more efficient visualization-specific data formats.
While there has been tremendous progress in developing standardized data formats and architectures for spatial-omics data, namely [SpatialData](https://spatialdata.scverse.org/en/stable/) and the related [AnnData](https://anndata.readthedocs.io/en/stable/), these approaches currently lack support for interactive cloud-based visualization of large (>100M transcripts) Spatial Transcriptomics (ST) data. Furthermore, all-in-one data format approaches preclude the development of compact visualization-specific data formats.

The Celldega project addresses these challenges with the development of a new ST data format called LandscapeFiles, specifically built for cloud-based visualization. LandscapeFiles support Celldega's Landscape visualization method by leveraging compact [image formats](../technologies/index.md/#webp) and [cloud-native data formats](../technologies/index.md/#apache-parquet) to enable efficient storage and visualization of image (e.g., microscopy images) and vectorized data (e.g., transcript coordinates). This approach is highly scalable, enabling the visualization of very large ST datasets (>400M transcripts), while remaining compact enough that LandscapeFiles for an entire Xenium dataset can be hosted in a [public GitHub repository](https://github.com/broadinstitute/celldega_Xenium_Prime_Human_Skin_FFPE_outs/).
The Celldega project addresses these challenges with the development of a new ST data format called LandscapeFiles, specifically built for cloud-based visualization. LandscapeFiles support Celldega's Landscape visualization method by leveraging compact [image formats](../technologies/index.md/#webp) and [cloud-native data formats](../technologies/index.md/#apache-parquet) to enable efficient storage and visualization of image (e.g., microscopy images) and vectorized data (e.g., transcript coordinates). This approach is highly scalable, enabling the visualization of very large ST datasets (>400M transcripts), while remaining compact enough that the LandscapeFiles for an entire Xenium dataset can be hosted in a [public GitHub repository](https://github.com/broadinstitute/celldega_Xenium_Prime_Human_Skin_FFPE_outs/).


## LandscapeFiles
LandscapeFiles are generated using the Celldega [pre](../python/pre/api.md) module (see example Google Colab notebook [Celldega-Landscape-Pre-Process_Xenium-Pancreas-Dataset](https://colab.research.google.com/drive/1guUFhXP3nlZ4Es2-tsnraFKlAKHZSCZC?usp=sharing)) and are used by Celldega's JavaScript front-end to interactively visualize ST data. Users have several options for hosting LandscapeFiles both locally on the cloud (e.g., Terra.bio buckets) or locally (e.g., running a local server to locally host LandscapeFiles).

### iST LandscapeFiles

The file structure for a Xenium Prime dataset's LandscapeFiles is shown below:
The file structure for a Xenium Prime dataset's LandscapeFiles is shown below.

```
.
Expand All @@ -30,8 +30,10 @@ The file structure for a Xenium Prime dataset's LandscapeFiles is shown below:
```

The LandscapeFiles for an an example public 10X Genomics Xenium dataset can be found [here](https://github.com/broadinstitute/celldega_Xenium_Prime_Human_Skin_FFPE_outs).

#### Cell-by-Gene
The `cbg` directory contains parquet files for each gene. Each file has a table of all the non-zero single cell expression counts - see example below
The `cbg` directory contains parquet files for each gene. Each file has a table of all the non-zero single cell expression counts. See example below:

```
A2ML1
Expand All @@ -47,7 +49,8 @@ aaacknep-1 1
The `cell_clusters` directory contains single-cell clustering data. For Xenium data, these will include the default clustering results stored in two parquet files.

##### cluster.parquet
This file contains the cluster identity of each cell.
This file contains the cluster identity of each cell. See example below:

```
cluster
aaaaljij-1 28
Expand All @@ -59,7 +62,7 @@ aaacknep-1 28
```

##### meta_cluster.parquet
This file contains metadata on the cell clusters, which includes the color and cell count.
This file contains metadata on the cell clusters, which includes the color and cell count. See example below:
```
color count
1 #1f77b4 12742
Expand All @@ -71,20 +74,211 @@ This file contains metadata on the cell clusters, which includes the color and c
```

#### Cell Metadata
The `cell_metadata.parquet` file contains the centroid positions of all cells. See example below:

```
cell_id name geometry
aaaaljij-1 aaaaljij-1 [819.7626194690856, 10819.416697734863]
aaabgfcl-1 aaabgfcl-1 [861.7377139772034, 10683.254024123535]
aaacghkb-1 aaacghkb-1 [876.8403955191346, 10627.146491566895]
aaachnfg-1 aaachnfg-1 [799.6315031020813, 10692.094786328125]
aaacknep-1 aaacknep-1 [760.0623424668274, 10729.360408533203]
```

#### Cell Segmentation
The `cell_segmentation` directory contains tiled parquet files that contain cell segmentation polygons for the cells within a given tile. See example below:

```
cell_id GEOMETRY name
mnnojdjm-1 [[[35052.998290142576, 2648.999973659546], [35... mnnojdjm-1
mnoafgkh-1 [[[35229.99735775, 2654.999800875], [35227.998... mnoafgkh-1
mnodjmcf-1 [[[35090.99920641015, 2657.9998580948486], [35... mnodjmcf-1
moelbbjj-1 [[[35233.997817008785, 2602.9998622198486], [3... moelbbjj-1
moemfhce-1 [[[35242.998275892576, 2539.9998095], [35239.9... moemfhce-1
```

#### Cell Cluster Gene Expression Signatures
The `df_sig.parquet` file contains the gene expression signatures of the cell clusters - defined as the average gene expression level of a cluster's cells. See example below:

```
1 2 3 4 5 6 7 8 9 10 ... 20 21 22 23 24 25 26 27 28 29
A2ML1 0.000235 0.000597 0.000109 0.000114 0.002320 0.000823 0.000996 0.000372 0.000760 0.000473 ... 0.018356 0.0000 0.000000 0.000000 0.000000 0.000000 2.593168 8.309091 5.602632 0.000000
AAMP 0.296029 0.298668 0.032494 0.052727 0.003222 0.515027 0.014938 0.070061 0.395857 0.470894 ... 0.122905 0.0960 0.429596 0.058206 0.128743 0.511224 0.580745 0.456566 0.136842 0.052632
AAR2 0.075655 0.069994 0.015375 0.023118 0.002964 0.118705 0.006971 0.038840 0.091410 0.117132 ... 0.054270 0.0424 0.129148 0.022901 0.030938 0.130612 0.154244 0.117172 0.057895 0.021053
AARSD1 0.074557 0.156194 0.013412 0.017880 0.001546 0.200357 0.005477 0.028805 0.121057 0.120208 ... 0.047087 0.0272 0.093274 0.019084 0.028942 0.223469 0.120083 0.024242 0.005263 0.010526
ABAT 0.004787 0.008053 0.009814 0.015830 0.000902 0.009743 0.004481 0.004832 0.003801 0.006626 ... 0.008779 0.0072 0.004484 0.017176 0.008982 0.002041 0.006211 0.002020 0.000000 0.005263
```

#### Landscape Parameters
This file contains the configuration information about the dataset. See example below:

```
{
"technology": "Xenium",
"max_pyramid_zoom": 16,
"tile_size": 250,
"image_info": [
{
"name": "dapi",
"button_name": "DAPI",
"color": [
0,
0,
255
]
},
{
"name": "bound",
"button_name": "BOUND",
"color": [
0,
255,
0
]
},
{
"name": "rna",
"button_name": "RNA",
"color": [
255,
0,
0
]
},
{
"name": "prot",
"button_name": "PROT",
"color": [
255,
255,
255
]
}
],
"image_format": ".webp"
}
```

#### Gene Metadata
The `gene_metadata.parquet` file contains gene level metadata including: average expression across all cells, standard deviation, max expression, proportion of cells with non-zero expression, and the color assigned to each gene. See example below:

```
mean std max non-zero color
A2ML1 0.078391 3.128721 46.0 0.000009 #1f77b4
AAMP 0.175449 1.621841 7.0 0.000009 #ff7f0e
AAR2 0.048494 0.780702 4.0 0.000009 #2ca02c
AARSD1 0.060533 0.897824 4.0 0.000009 #d62728
ABAT 0.006575 0.285613 3.0 0.000009 #9467bd
```

#### Pyramid Images
The `pyramid_images` directory contains iST images from all available channels saved [Deep Zoom pyramids](../technologies/index.md#deep-zoom) using the image file format [WebP](../technologies/index.md#webp). An example directory structure for a Xenium multi-modal dataset looks like:

```
.
├── bound.dzi
├── bound_files
│   ├── 0
│   ├── 1
│   ├── 10
│   ├── 11
│   ├── 12
│   ├── 13
│   ├── 14
│   ├── 15
│   ├── 16
│   ├── 2
│   ├── 3
│   ├── 4
│   ├── 5
│   ├── 6
│   ├── 7
│   ├── 8
│   ├── 9
│   └── vips-properties.xml
├── dapi.dzi
├── dapi_files
│   ├── 0
│   ├── 1
│   ├── 10
│   ├── 11
│   ├── 12
│   ├── 13
│   ├── 14
│   ├── 15
│   ├── 16
│   ├── 2
│   ├── 3
│   ├── 4
│   ├── 5
│   ├── 6
│   ├── 7
│   ├── 8
│   ├── 9
│   └── vips-properties.xml
├── prot.dzi
├── prot_files
│   ├── 0
│   ├── 1
│   ├── 10
│   ├── 11
│   ├── 12
│   ├── 13
│   ├── 14
│   ├── 15
│   ├── 16
│   ├── 2
│   ├── 3
│   ├── 4
│   ├── 5
│   ├── 6
│   ├── 7
│   ├── 8
│   ├── 9
│   └── vips-properties.xml
├── rna.dzi
└── rna_files
├── 0
├── 1
├── 10
├── 11
├── 12
├── 13
├── 14
├── 15
├── 16
├── 2
├── 3
├── 4
├── 5
├── 6
├── 7
├── 8
├── 9
└── vips-properties.xml
```

#### Transcript Tiles

#### Image Transformation
The `transcript_tiles` directory contains tiled parquet files that contain transcript data for transcripts within a given tile. See example below:

```
name geometry
20862147 AARSD1 [25663.38, 11758.09]
20862230 ABCA1 [25650.44, 11757.28]
20862780 ABCD1 [25634.19, 11754.12]
20862819 ABCD1 [25635.51, 11755.29]
20863051 ABHD6 [25506.54, 11753.75]
```

#### Image Transformation
The `xenium_transform.csv` file contains the 3x3 image transformation matrix to transition from physical coordinates into image coordinates.

### sST LandscapeFiles
6 changes: 3 additions & 3 deletions mkdocs.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -37,9 +37,9 @@ nav:
- Xenium Multi Dataset: gallery/gallery_xenium_multi.md
- Example Notebooks:
- examples/index.md
- Short Notebooks:

- Landscape View Xenium: examples/short_notebooks/Landscape_View_Xenium.ipynb
- Brief Notebooks:
- Landscape View Xenium: examples/brief_notebooks/Landscape_View_Xenium.ipynb
- Pre-Process LandscapeFiles: examples/brief_notebooks/Pre-process_Xenium_V1_human_Pancreas_FFPE_outs.ipynb
# - Tutorials:
# - examples/Landscape_View_Xenium.ipynb

Expand Down

0 comments on commit 3588b52

Please sign in to comment.