cbiohub

WARNING ⚠️: This package is still under construction.

cbiohub is a Python package that provides convenience functions for analyzing data files from cBioPortal. Although several Python API clients exist, they work on slices of the cBioPortal data retrieved via the REST rather than that they enable easy analysis of all the data files in bulk. This package aims to provide a more user-friendly interface for accessing data from cBioPortal like those stored in the public datahub. By using parquet files, rather than flat csv/tsv files, the data can be analyzed much more quickly and efficiently.

Usage

Analyze Local files

Step 1: Obtain data files

You can e.g. download the cBioPortal datahub files:

git clone git@github.com:cbioportal/datahub ~/git/datahub

Step 2: Ingest and Combine

Now ingest them i.e. convert them into parquet files on your local machine:

cbiohub ingest ~/git/datahub/public/

All the data by default gets stored in ~/cbiohub/. Combine all the study data together into a single study:

cbiohub combine

Step 3: Analyze

Now you can use the cbiohub package to analyze the data quickly. For example, you can load the combined study data into a pandas DataFrame:

import cbiohub

df = cbiohub.get_combined_df()

Or you can use the cbiohub cli to do quick analyses:

> cbiohub find BRAF V600E
✅ Variant found in 3595 samples across 117 studies:
kirp_tcga:TCGA-AL-3467-01
kirp_tcga:TCGA-UZ-A9PP-01
...

or search for the same BRAF V600E variant but with a specific genomic change (A>T):

> cbiohub find 7 140453136 140453136 A T
✅ Variant found in 3571 samples across 117 studies:
kirp_tcga:TCGA-AL-3467-01
kirp_tcga:TCGA-UZ-A9PP-01
...

Clean

Remove all local parquet files.

cbiohub clean

Development

To set up the development environment, install the development dependencies:

poetry install

You can run the cli using e.g.:

poetry ingest ~/git/datahub/public/

and

poetry run cbiohub find BRAF V600E

You can also use IPython for interactive exploration:

poetry run ipython

TODO

Add github action datahub that usies cbiohub to push combined parquet data to hugging face (https://huggingface.co/datasets/cBioPortal/datahub)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

cbiohub

Usage

Analyze Local files

Step 1: Obtain data files

Step 2: Ingest and Combine

Step 3: Analyze

Clean

Development

TODO

Files

README.md

Latest commit

History

README.md

File metadata and controls

cbiohub

Usage

Analyze Local files

Step 1: Obtain data files

Step 2: Ingest and Combine

Step 3: Analyze

Clean

Development

TODO