Skip to content

Latest commit

 

History

History
113 lines (81 loc) · 5.64 KB

README.md

File metadata and controls

113 lines (81 loc) · 5.64 KB

1M-cells

This is the repository which contains the code that was used to generate the results and figures of the “Single-cell RNA-sequencing reveals widespread personalized, context-specific gene expression regulation in immune cells” paper (https://doi.org/10.1038/s41467-022-30893-5)

data availability

Expression data is available in three flavours at https://eqtlgen.org/sc/datasets/1m-scbloodnl-dataset.html:

  • QC-ed and normalised (A)
  • QC-ed without normalisation (B)
  • pre-QC (C)

A. normalized and QC-ed data

To use the normalized and QC-ed data, the following files are required (for the v2 samples):

  • 10x_v2_barcodes.tsv.gz
  • 10x_v2_SCT_features.tsv.gz
  • 10x_v2_SCT_matrix.mtx.gz

Given these three files are located in a given folder, with the filenames renamed to barcodes.tsv.gz, features.tsv.gz and matrix.mtx.gz, they can be loaded into Seurat, using the following command:

m1_processed_v2 <- Read10X('/dir/to/three/files/', gene.column = 1, cell.column = 1)

or in Scanpy using the original filenames:

# lead count data
m1_processed_v2 = sc.read_mtx('/dir/to/three/files/10x_v2_SCT_matrix.mtx.gz')
# read barcodes
m1_bc_v2 = pd.read_csv('/dir/to/three/files/10x_v2_barcodes.tsv.gz', header=None)
# read features
m1_features_v2 = pd.read_csv('/dir/to/three/files/10x_v2_SCT_features.tsv.gz', header=None)
# transpose to scanpy format
m1_processed_v2 = m1_processed_v2.T
# add barcodes and genes to obs and vars
m1_processed_v2.obs['cell_id']= m1_bc_v2[0].tolist()
m1_processed_v2.var['gene_name']= m1_features_v2[0].tolist()
# set indices for the obs and vars
m1_processed_v2.obs.index = m1_processed_v2.obs['cell_id']
m1_processed_v2.var.index = m1_processed_v2.var['gene_name']

The procedure is the same for the v3 samples

B. raw QC-ed data

To use the non-normalized counts, the following files are required (for the v2 samples):

  • 10x_v2_barcodes.tsv.gz
  • 10x_v2_RNA_features.tsv.gz
  • 10x_v2_RNA_matrix.mtx.gz

check the previous section on how to load these into Seurat or Scanpy. The procedure is the same for the v2 and v3 samples

C. pre-QC data

To use the pre-QC non-normalized counts, the following files are required (these are not split by 10x chemistry):

  • unfiltered_barcodes.tsv.gz
  • unfiltered_features_raw.tsv.gz
  • unfiltered_matrix_raw.mtx.gz

check the previous section on how to load these into Seurat or Scanpy.

Test data

A Seurat object to test with, is supplied here: https://molgenis26.gcc.rug.nl/downloads/1m-scbloodnl/small-test-dataset/ This contains the v3 samples in the UT condition, as well as the SNP affecting RPS26 co-expression.

Processing overview

The code to generate the results is separated by the different steps taken to get from the raw data to the results. Languages and packages are listed below:

  • R >= 3.6.1
  • Seurat >= 3.1
  • Python 3.7.4
  • numpy 1.19.5
  • pandas 1.2.1
  • scipy 1.6.0
  • statsmodels 0.12.2

External tools used were:

If want to rerun any of the analysis steps in R, consider using the Singularity image used for most of the analyses: https://github.com/royoelen/single-cell-container-server

Steps and their respective directories are the following:

License

The code availabe in this repository is available under the 2-Clause BSD License: https://opensource.org/licenses/BSD-2-Clause

Hardware

Analyses were performed on either a 2019 MacBook Pro (16GB), the Gearshift cluster http://docs.gcc.rug.nl/gearshift/cluster/ or for specifically the dataset normalization via SCTransform, the Peregrine cluster https://wiki.hpc.rug.nl/peregrine/start