Deploying to gh-pages from @ 78ce346 🚀

regulatory-genomics · Oct 1, 2024 · 7e49d0f · 7e49d0f
commit 7e49d0f
Show file tree

Hide file tree

Showing 167 changed files with 23,485 additions and 0 deletions.
diff --git a/.buildinfo b/.buildinfo
@@ -0,0 +1,4 @@
+# Sphinx build info version 1
+# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
+config: f2af386d6e99bb22829f120be7f0f3d7
+tags: 645f666f9bcd5a90fca523b33c5a78b7
diff --git a/.doctrees/_autosummary/precellar.SeqSpec.add_read.doctree b/.doctrees/_autosummary/precellar.SeqSpec.add_read.doctree
diff --git a/.doctrees/_autosummary/precellar.SeqSpec.doctree b/.doctrees/_autosummary/precellar.SeqSpec.doctree
diff --git a/.doctrees/_autosummary/precellar.SeqSpec.to_yaml.doctree b/.doctrees/_autosummary/precellar.SeqSpec.to_yaml.doctree
diff --git a/.doctrees/_autosummary/precellar.align.doctree b/.doctrees/_autosummary/precellar.align.doctree
diff --git a/.doctrees/_autosummary/precellar.make_fragment.doctree b/.doctrees/_autosummary/precellar.make_fragment.doctree
diff --git a/.doctrees/_autosummary/precellar.make_genome_index.doctree b/.doctrees/_autosummary/precellar.make_genome_index.doctree
diff --git a/.doctrees/_autosummary/precellar.utils.strip_barcode_from_fastq.doctree b/.doctrees/_autosummary/precellar.utils.strip_barcode_from_fastq.doctree
diff --git a/.doctrees/api.doctree b/.doctrees/api.doctree
diff --git a/.doctrees/environment.pickle b/.doctrees/environment.pickle
diff --git a/.doctrees/index.doctree b/.doctrees/index.doctree
diff --git a/.doctrees/nbsphinx/tutorials/txg_multiome.ipynb b/.doctrees/nbsphinx/tutorials/txg_multiome.ipynb
@@ -0,0 +1,275 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Preprocess 10X Genomics Gene Expression + ATAC Data"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import precellar"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We first create a SeqSpec object from the 10X multiome seqspec template file."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "\u001b[90m[\u001b[0m2024-09-30T16:24:31Z \u001b[32mINFO \u001b[0m cached_path::cache\u001b[90m]\u001b[0m Cached version of https://raw.githubusercontent.com/regulatory-genomics/precellar/refs/heads/main/seqspec_templates/10x_rna_atac.yaml is up-to-date\n"
+     ]
+    }
+   ],
+   "source": [
+    "assay = precellar.SeqSpec(\"https://raw.githubusercontent.com/regulatory-genomics/precellar/refs/heads/main/seqspec_templates/10x_rna_atac.yaml\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The SeqSpec object can be visualized as a tree structure."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "\n",
+       "├── rna(153-1150)\n",
+       "│   ├── rna-illumina_p5(29)\n",
+       "│   ├── rna-truseq_read1(29)\n",
+       "│   ├── rna-cell_barcode(16)\n",
+       "│   ├── rna-umi(12)\n",
+       "│   ├── rna-cDNA(1-1000)\n",
+       "│   ├── rna-truseq_read2(34)\n",
+       "│   ├── rna-index7(8)\n",
+       "│   └── rna-illumina_p7(24)\n",
+       "└── atac(153-1150)\n",
+       "    ├── atac-illumina_p5(29)\n",
+       "    ├── atac-cell_barcode(16)\n",
+       "    ├── atac-linker(8)\n",
+       "    ├── atac-nextera_read1(33)\n",
+       "    ├── atac-gDNA(1-1000)\n",
+       "    ├── atac-nextera_read2(34)\n",
+       "    ├── atac-index7(8)\n",
+       "    └── atac-illumina_p7(24)"
+      ]
+     },
+     "execution_count": 3,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "assay"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We then add ATAC data to the SeqSpec object. Specifically, we add R1, R2 and index fastq files. Note that the index read, ususally named \"*_R2_*\" in the 10X multiome data, can be sequenced in either direction, depending on the sequencing platform. Older platforms such as NovaSeq 6000 with v1.0 reagent kits, MiniSeq with Rapid Reagent kits, MiSeq, HiSeq 2500, and HiSeq 2000 use the forward strand workflow. iSeq 100, MiniSeq with Standard reagent kits, NextSeq Systems, NovaSeq 6000 with v1.5 reagent kits, HiSeq X, HiSeq 4000, and HiSeq 3000 use the reverse strand workflow.\n",
+    "\n",
+    "We need to pay special attention to the orientation of the index read. In this example, the index read is sequenced in the reverse direction."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Read 1\n",
+    "assay.add_read(\n",
+    "    read_id=\"atac-R1\",\n",
+    "    fastq=\"/data/kzhang/projects/workshop/data/PS_DNA01-1_S2_L001_R1_001.fq.zst\",\n",
+    "    modality=\"atac\",\n",
+    "    is_reverse=False,\n",
+    "    primer_id=\"atac-nextera_read1\",\n",
+    ")\n",
+    "\n",
+    "# Index Read\n",
+    "assay.add_read(\n",
+    "    read_id=\"atac-I1\",\n",
+    "    fastq=\"/data/kzhang/projects/workshop/data/PS_DNA01-1_S2_L001_R2_001.fq.zst\",\n",
+    "    modality=\"atac\",\n",
+    "    is_reverse=True,\n",
+    "    primer_id=\"atac-nextera_read1\",\n",
+    ")\n",
+    "\n",
+    "# Read 2\n",
+    "assay.add_read(\n",
+    "    read_id=\"atac-R2\",\n",
+    "    fastq=\"/data/kzhang/projects/workshop/data/PS_DNA01-1_S2_L001_R3_001.fq.zst\",\n",
+    "    modality=\"atac\",\n",
+    "    is_reverse=True,\n",
+    "    primer_id=\"atac-nextera_read2\",\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We can visualize the SeqSpec object again after adding the ATAC data. Note that read information is added to the tree."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "\n",
+       "├── rna(153-1150)\n",
+       "│   ├── rna-illumina_p5(29)\n",
+       "│   ├── rna-truseq_read1(29)\n",
+       "│   ├── rna-cell_barcode(16)\n",
+       "│   ├── rna-umi(12)\n",
+       "│   ├── rna-cDNA(1-1000)\n",
+       "│   ├── rna-truseq_read2(34)\n",
+       "│   ├── rna-index7(8)\n",
+       "│   └── rna-illumina_p7(24)\n",
+       "└── atac(153-1150)\n",
+       "    ├── atac-illumina_p5(29)\n",
+       "    ├── atac-cell_barcode(16)\n",
+       "    ├── atac-linker(8)\n",
+       "    ├── atac-nextera_read1(33) [↓atac-R1(150), ↑atac-I1(24)]\n",
+       "    ├── atac-gDNA(1-1000)\n",
+       "    ├── atac-nextera_read2(34) [↑atac-R2(150)]\n",
+       "    ├── atac-index7(8)\n",
+       "    └── atac-illumina_p7(24)"
+      ]
+     },
+     "execution_count": 5,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "assay"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We can save the SeqSpec object to a yaml file and load it next time."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "with open(\"10x_rna_atac.yaml\", \"w\") as f:\n",
+    "    f.write(assay.to_yaml())"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We can now start reads mapping and create the fragment file. If you do not have the genome index, you can create it using the `make_genome_index` function."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "\u001b[90m[\u001b[0m2024-09-30T16:24:39Z \u001b[32mINFO \u001b[0m precellar::align\u001b[90m]\u001b[0m Counting barcodes...\n",
+      "\u001b[90m[\u001b[0m2024-09-30T16:24:39Z \u001b[32mINFO \u001b[0m cached_path::cache\u001b[90m]\u001b[0m Cached version of https://teichlab.github.io/scg_lib_structs/data/10X-Genomics/atac_737K-arc-v1.txt.gz is up-to-date\n",
+      "\u001b[90m[\u001b[0m2024-09-30T16:24:39Z \u001b[33mWARN \u001b[0m seqspec\u001b[90m]\u001b[0m Reads (atac-R1) may contain additional bases downstream of the variable-length region, e.g., adapter sequences.\n",
+      "\u001b[90m[\u001b[0m2024-09-30T16:24:46Z \u001b[32mINFO \u001b[0m precellar::align\u001b[90m]\u001b[0m Found 12387063 barcodes. 94.19% of them have an exact match in whitelist\n",
+      "\u001b[90m[\u001b[0m2024-09-30T16:24:46Z \u001b[32mINFO \u001b[0m precellar::align\u001b[90m]\u001b[0m Aligning reads...\n",
+      "\u001b[90m[\u001b[0m2024-09-30T16:24:46Z \u001b[33mWARN \u001b[0m seqspec\u001b[90m]\u001b[0m Reads (atac-R1) may contain additional bases downstream of the variable-length region, e.g., adapter sequences.\n",
+      "\u001b[90m[\u001b[0m2024-09-30T16:24:46Z \u001b[33mWARN \u001b[0m seqspec\u001b[90m]\u001b[0m Reads (atac-R2) may contain additional bases downstream of the variable-length region, e.g., adapter sequences.\n",
+      "100%|██████████| 12387063/12387063 [05:29<00:00, 37567.71it/s]"
+     ]
+    }
+   ],
+   "source": [
+    "qc = precellar.align(\n",
+    "    assay, \"/data/kzhang/genome/mm10\",\n",
+    "    modality=\"atac\",\n",
+    "    output_fragment=\"atac_fragments.tsv.zst\",\n",
+    "    num_threads=32,\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "QC metrics: "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "{'frac_q30_bases_barcode': 0.9219054690365263, 'frac_nonnuclear': 7.443249461151526e-05, 'frac_valid_barcode': 0.9815396111249293, 'frac_unmapped': 0.008819422106194463, 'frac_confidently_mapped': 0.9292046064511015, 'frac_fragment_in_nucleosome_free_region': 4.518862917979508e-05, 'frac_duplicates': 0.33183332695206896, 'sequenced_reads': 24774126.0, 'frac_q30_bases_read1': 0.9173089515515771, 'frac_q30_bases_read2': 0.9229901653577338, 'sequenced_read_pairs': 12387063.0, 'frac_fragment_flanking_single_nucleosome': 2.912761037492815e-05}\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(qc)"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.13"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/.doctrees/tutorials/index.doctree b/.doctrees/tutorials/index.doctree
diff --git a/.doctrees/tutorials/txg_multiome.doctree b/.doctrees/tutorials/txg_multiome.doctree
diff --git a/.nojekyll b/.nojekyll