Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AnnData spatial analysis part 1 #1105

Open
1 of 3 tasks
camisowers opened this issue Jan 10, 2024 · 4 comments · May be fixed by #1112
Open
1 of 3 tasks

AnnData spatial analysis part 1 #1105

camisowers opened this issue Jan 10, 2024 · 4 comments · May be fixed by #1112
Assignees
Labels
design_doc Detailed implementation plan

Comments

@camisowers
Copy link
Contributor

camisowers commented Jan 10, 2024

This is for internal use only; if you'd like to open an issue or request a new feature, please open a bug or enhancement issue

Relevant background

Now that we have the AnnData conversion notebook, we can start implementing AnnData with some of the spatial analysis functions.

Design overview

We have two options:

  • Hard fork: Change all the functions to pass in AnnData directory and load the cell table information. The AnnData table conversion is a very quick process so forcing people to use it moving forwards seems reasonable.
  • Soft fork: Add an optional anndata_dir arg which will load in and save to the AnnData table if provided. When left as None, the functions will continue to use the cell_table.csv and process the same as before. This isn't too much double code because the Anndata reading and writing calls are very simple, but it'll add quite a few extra if statements.

Things which will be adapted for AnnData compatibility in this new PR:

  1. Distance matrices calculation
  2. Neighborhood matrices calculation (will save to both the existing AnnData table and as individual csv)
  3. Cell Distances analysis

Since all the spatial notebooks (except cell distances) only take the neighborhood mats as inputs, these changes will not break any other notebooks if we choose to do a hard fork.
** Note: This will break the spatial enrichment code we have in ark, but I'm pretty confident we can just scrap the old stuff and build on squidpy's nhood_enrichment function.

Code mockup
I can write up more specific pseudocode when we decide between a hard or soft fork, but for now here's which functions will need to be altered regardless.

1. Distance matrices calculation

Currently we load in the segmentations masks to get the labels and centroids of the cells by running regionprops(). This doesn't make much sense in our pipeline anymore since this is already done in the segmentation notebook and the centroids are stored in the cell table; I think moving forward we can skip loading in the segmentation masks and re-running regionprops (it's very time consuming using the NAS and also redundant).

Also I will probably do some efficiency testing to find a replacement for cdist() that creates a sparse matrix rather than computing ever pairwise distance between cells.

Adjusted functions:
calc_dist_matrix - adjust for AnnData

Soft fork example:

def calc_dist_matrix(cell_table, save_path, anndata_dir=None):
    """Generate matrix of distances between center of pairs of cells.

    Saves each one individually to `save_path`.

    Args:
        anndata_dir (str):
           path where the AnnData object is stored
    """

    if anndata_dir:
       fov_names = list_files(anndata_dir, substr=".zarr")

    else:
       # check that both label_dir and save_path exist
       io_utils.validate_paths([label_dir, save_path])

       fov_names = cell_table.fov.unique()

    # iterate for each fov
    with tqdm(total=len(fov_names), desc="Distance Matrix Generation", unit="FOVs") \
            as dist_mat_progress:
        for fov in fov_names:
 
            if anndata_dir:
               fov_adata = anndata.read_zarr(os.path.join(anndata_dir, f"{fov}.zarr"))
               centroids = fov_adata.obsm["spatial"]
            else:
                fov_data = cell_table[cell_table["fov"] == fov]
                fov_data = [fov_data.centroid-0, fov_data.centroid-1]

            # generate the distance matrix, then assign centroid_labels as coords
            dist_matrix = cdist(centroids, centroids).astype(np.float32)
            dist_mat_xarr = xr.DataArray(dist_matrix, coords=[centroid_labels, centroid_labels])

            # save distances to AnnData table
            if anndata_dir:
               fov_adata.obsp["distance"] = dist_mat_xarr
               fov_adata.write_zarr(os.path.join(anndata_dir, f"{fov}.zarr"), chunks=(1000, 1000))

            # save the distance matrix to save_path
            dist_mat_xarr.to_netcdf(
                os.path.join(save_path, fov_name + '_dist_mat.xr'),
                format='NETCDF3_64BIT'
            )

            dist_mat_progress.update(1)

Hard fork example:

def calc_dist_matrix(anndata_dir):
    """Generate matrix of distances between center of pairs of cells.

    Args:
        anndata_dir (str):
           path where the AnnData object is stored
    """

    # load fov names
    fov_names = list_files(anndata_dir, substr=".zarr")
  

    # iterate for each fov
    with tqdm(total=len(fov_names), desc="Distance Matrix Generation", unit="FOVs") \
            as dist_mat_progress:
        for fov in fov_names:

           fov_adata = anndata.read_zarr(os.path.join(anndata_dir, f"{fov}.zarr"))
           centroids = fov_adata.obsm["spatial"]

           # generate the distance matrix, then assign centroid_labels as coords
           dist_matrix = cdist(centroids, centroids).astype(np.float32)
           dist_mat_xarr = xr.DataArray(dist_matrix, coords=[centroid_labels, centroid_labels])

           # save distances to AnnData table
           fov_adata.obsp["distance"] = dist_mat_xarr
           fov_adata.write_zarr(os.path.join(anndata_dir, f"{fov}.zarr"), chunks=(1000, 1000))

           dist_mat_progress.update(1)

2. Neighborhood matrices calculation
Adjusted functions:
create_neighborhood_matrix

3. Cell Distances analysis
Adjusted functions:
generate_cell_distance_analysis
possibly calculate_mean_distance_to_all_cell_types andcalculate_mean_distance_to_cell_type as well

Required inputs

Cell table / AnnData table.

Output files

Since no one seems to really open or use the saved distance matrix xarrays anyways, I think we can just append them to the AnnData table and save some storage space.

The neighborhood mats and cell distance results will be saved to both the AnnData table and as individual csvs.

Timeline
Give a rough estimate for how long you think the project will take. In general, it's better to be too conservative rather than too optimistic.

  • A couple days
  • A week
  • Multiple weeks. For large projects, make sure to agree on a plan that isn't just a single monster PR at the end.

Estimated date when a fully implemented version will be ready for review:
1/19

Estimated date when the finalized project will be merged in:
1/23

  • delayed with distance matrix calculation benchmarking, should be in by 1/27
@camisowers camisowers added the design_doc Detailed implementation plan label Jan 10, 2024
@camisowers camisowers self-assigned this Jan 10, 2024
@camisowers
Copy link
Contributor Author

camisowers commented Jan 10, 2024

@srivarra @alex-l-kong @ngreenwald @jranek Lmk your thoughts. I'm leaning towards a hard switch since we just released v0.7.0 of ark and anyone super committed to staying away from AnnData can always use that version.

@ngreenwald
Copy link
Member

I agree, I think it makes sense to just switch. Everything else looks good.

@srivarra
Copy link
Contributor

Looks good to me!

@alex-l-kong
Copy link
Contributor

Looks good

@camisowers camisowers linked a pull request Jan 25, 2024 that will close this issue
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
design_doc Detailed implementation plan
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants