From 1d8fd83087d235cc2722222ada8f59b3a990246c Mon Sep 17 00:00:00 2001 From: Nick Semenkovich Date: Fri, 15 Dec 2023 10:54:37 -0600 Subject: [PATCH] + Design doc --- LICENSE | 2 +- README.md | 19 ++++++++++++++++--- 2 files changed, 17 insertions(+), 4 deletions(-) diff --git a/LICENSE b/LICENSE index 68fb5fd..79a392a 100644 --- a/LICENSE +++ b/LICENSE @@ -1,6 +1,6 @@ MIT License -Copyright © 2023 Nick Semenkovich +Copyright © 2023 Nick Semenkovich \ Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal diff --git a/README.md b/README.md index 143dd97..bc805ff 100644 --- a/README.md +++ b/README.md @@ -30,10 +30,10 @@ bam2tensor is a Python package for converting .bam files to dense representation ## Features - Parses .bam files using [pysam](https://github.com/pysam-developers/pysam) - Extracts methylation data from all CpG sites -- Easily parallelizable - Supports any genome (Hg38, T2T-CHM13, mm10, etc.) -- Stores methylation data as .npz NumPy arrays - Stores data in sparse format (COO matrix) for efficient loading +- Exports methylation data to .npz NumPy arrays +- Easily parallelizable ## Requirements @@ -50,7 +50,18 @@ pip install bam2tensor ## Usage -Please see the [Reference Guide] for details. +Please see the [Reference Guide] for full details. + +## Data Structure + +One `.npz` file is generated for each separate `.bam`, which can be loaded using `scipy.sparse.load_npz()`. Each `.npz` file contains a single sparse SciPy [COO matrix]. + +In the COO matrix, each row represents a read and each column represents a CpG site. The value at each row/column is the methylation state (`0` = unmethylated, `1` = methylated, `-1` = no data). Note that `-1` can represent indels or point mutations. + +## Todo +- Consider storing a Read ID: Row ID mapping? +- Export / more stably store & import embedding mapping? (.npz or other instead of .json?) +- Store metadata / object reference in .npz file? ## Contributing @@ -79,9 +90,11 @@ This project was generated from [Statistics Norway]'s [SSB PyPI Template]. [ssb pypi template]: https://github.com/statisticsnorway/ssb-pypitemplate [file an issue]: https://github.com/mcwdsi/bam2tensor/issues [pip]: https://pip.pypa.io/ +[COO matrix]: https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.coo_matrix.html [license]: https://github.com/mcwdsi/bam2tensor/blob/main/LICENSE [contributor guide]: https://github.com/mcwdsi/bam2tensor/blob/main/CONTRIBUTING.md [reference guide]: https://mcwdsi.github.io/bam2tensor/reference.html +q \ No newline at end of file