maleBC_single_cells

This repository is for male breast cancer scRNA-seq and scATAC-seq data processing and figure generation.

The general flow of data processing and figure generation is (1) scRNA-seq data processing from 10x Genomics scRNA-seq FASTQ files, (2) scATAC-seq data processing from 10x Genomics scATAC-seq FASTQ files, (3) figure generation.

Installing

git clone https://github.com/hyunsoo77/maleBC_single_cells.git

scRNA-seq data processing

Step 1: align sequences in scRNA-seq FASTQ files to GRCh38 reference transcriptome by 10x Genomics cellranger count to obtain two filtered_feature_bc_matrix.h5 files for two samples.

Step 2: Make the following directoy structure with copy or link.

../count_male-bc
├── Patient1
│   ├── outs
│   │   └── filtered_feature_bc_matrix.h5
└── Patient2
    └── outs
        └── filtered_feature_bc_matrix.h5

Step 3: Make Seurat object for each sample with the following command:

./make_sc-rna-seq_seurat_obj.R --dir_count ../count_male-bc --dir_output ./output_male-bc --dir_seurat_obj ./output_male-bc/rds_male-bc --type_qc arguments --min_ncount_rna 5000 --min_nfeature_rna 2000 --th_percent.mt 25 --max_dimstouse 30 --seurat_resolution 0.8 --method_to_update_cell_types epithelial_cell_types --method_to_identify_subtypes none --type_infercnv_argset vignettes --method_to_determine_th_cna_value_corr fixed --th_cna_value 0.05 --th_cna_corr 0.35 male-bc Patient1

The above example is only for Patient1, you can make another Seurat object for Patient2 by changing the last argument. The contents of the output directory of "./output_male-bc" follows:

output_male-bc/
├── infercnv
│   ├── male-bc_Patient1_cnv_postdoublet
│   └── male-bc_Patient2_cnv_postdoublet
├── output
│   └── log
├── rds_male-bc
│   ├── male-bc_Patient1_sc-rna-seq_sample_seurat_obj.rds
│   ├── male-bc_Patient2_sc-rna-seq_sample_seurat_obj.rds
│   └── wilcox_degs
├── tsv
│   ├── infercnv_input_barcode_group_male-bc_Patient1.tsv
│   └── infercnv_input_barcode_group_male-bc_Patient2.tsv
└── xlsx
    ├── male-bc_Patient1_sc-rna-seq_pipeline_summary.xlsx
    └── male-bc_Patient2_sc-rna-seq_pipeline_summary.xlsx

Step 4: Merge Seurat objects for multiple samples to make merged Seurat object by the following command:

./make_sc-rna-seq_merged_seurat_obj.R --dir_output ./output_male-bc --dir_seurat_obj ./output_male-bc/rds_male-bc --type_parsing_rds_filename unc-male-bc --method_integration none --max_dimstouse 30 --seurat_resolution 0.2 --harmony_theta 0 male-bc

The output file is located under ./output_male-bc/rds_male-bc that was defined by an argument of --dir_seurat_obj.

output_male-bc/
│   ...
├── rds_male-bc
│   ├── male-bc_Patient1_sc-rna-seq_sample_seurat_obj.rds
│   ├── male-bc_Patient2_sc-rna-seq_sample_seurat_obj.rds
│   ├── male-bc_sc-rna-seq_merged_seurat_obj.rds
│   └── wilcox_degs
...

scATAC-seq data processing

Step 1: align sequences in scATAC-seq FASTQ files to GRCh38 reference genome by 10x Genomics cellranger-atac count to obtain two fragments.tsv.gz files for two samples.

Step 2: Make the following directoy structure with copy or link.

../count_male-bc
├── Patient1
│   ├── outs
│   │   ├── fragments.tsv.gz
│   │   └── fragments.tsv.gz.tbi
└── Patient2
    └── outs
        ├── fragments.tsv.gz
        └── fragments.tsv.gz.tbi

Step 3: Make an ArchRProject object for two samples with the following command:

./make_sc-atac-seq_archr_obj.R --n_cores 60 --dir_count ../count_male-bc --dir_output ./output_male-bc --dir_seurat_obj ./dir_seurat_obj_male-bc --keep_most_common_cell_type_for_each_cluster --max_dimstouse 30 --seurat_resolution 0.2 --min_tss 0 --min_frags 1000 --colorlim 1.5 --umap_n_neighbors 30 --umap_min_dist 0.3 --umap_metric cosine --umap_metric_doubletscores cosine --harmony_theta 0 --harmony_lambda 20 male-bc

This will take care of all samples under ../count_male-bc. The contents of the output directory of "./output_male-bc" follows:

output_male-bc
├── archr_output
│   ├── Annotations
│   ├── ArrowFiles
│   ├── Embeddings
│   ├── LSI_ATAC
│   ├── Peak2GeneLinks
│   ├── PeakCalls
│   │   ├── InsertionBeds
│   │   └── ReplicateCalls
│   ├── Plots
│   └── RNAIntegration
│       └── GeneIntegrationMatrix_ArchR
├── log
├── pdf
├── png
│   ├── male-bc_Patient1_qc.png
│   ├── male-bc_Patient2_qc.png
│   └── ...
├── qc
│   ├── Patient1
│   │   ├── Patient1-Doublet-Summary.pdf
│   │   ├── Patient1-Fragment_Size_Distribution.pdf
│   │   └── Patient1-TSS_by_Unique_Frags.pdf
│   └── Patient2
│       ├── Patient2-Doublet-Summary.pdf
│       ├── Patient2-Fragment_Size_Distribution.pdf
│       └── Patient2-TSS_by_Unique_Frags.pdf
├── rds
│   └── male-bc_archrproj_obj_final.rds
├── tmp
└── xlsx
    └── male-bc_sc-atac-seq_pipeline_summary.xlsx

Let's review the content of QC files (male-bc_Patient1_qc.png and male-bc_Patient2_qc.png).

Step 4: Add peak-to-gene links to the final ArchRProject object and perform motif enrichment analysis by the following command:

./analyze_cancer_specific_p2g.R --n_cores 4 --dir_output ./output_p2g_male-bc --dir_seurat_obj ./dir_seurat_obj_male-bc --dir_archr_output ./output_male-bc/archr_output --subset_archrproject_force_to_update --max_dimstouse 30 --harmony_theta 0 --seed_kmeans 4 --exclude_cluster_epi_unassigned --exclude_cluster_highly_overlapped_with_normal_peaks --exclude_cluster_low_nfrags male-bc

This will subset the final ArchRProject in order to remove clustes of unassigned epithelail cells by InferCNV and low mean of nFrags. The peak-to-gene links were obtained by addPeak2GeneLinks() (see document ArchR book/peak2genelinkage-with-archr). The contents of the output directory of "./output_p2g_male-bc" follows:

output_p2g_male-bc
├── archr_output
│   ├── Annotations
│   ├── ArrowFiles
│   ├── Embeddings
│   ├── LSI_ATAC
│   ├── Peak2GeneLinks
│   │   ├── seATAC-Group-KNN.rds
│   │   └── seRNA-Group-KNN.rds
│   ├── PeakCalls
│   │   ├── ...
│   │   ├── InsertionBeds
│   │   ├── ReplicateCalls
│   │   ├── X0.Epi..Tumor-reproduciblePeaks.gr.rds
│   │   ├── X2.Epi..Tumor-reproduciblePeaks.gr.rds
│   │   └── ...
│   ├── Plots
│   ├── RNAIntegration
│   └── Save-ArchR-Project.rds
├── log
├── pdf
│   ├── scatterplot_cisbp_motif_up.normal-vs-cancer_cancer-specific_enhancer.pdf
│   ├── venn_chippeakanno_overlaps_of_enhancer_peaks.pdf
│   └── ...
├── png
├── rds
│   ├── all_p2g_observed.rds
│   ├── cancer_enriched_enhancer_p2g_table.rds
│   ├── cancer_specific_enhancer_p2g_table.rds
│   ├── cancer_specific_p2g_table_degs.rds
│   ├── find_overlaps_of_enhancer_peaks_output_overlappingpeaks_obj.rds
│   ├── male-bc_archrproj_obj_p2gs.rds
│   ├── markerpeaks.for_comparison.normal-vs-cancer_cancer-specific_enhancer.rds
│   ├── p2g.df.sub.plot_enhancer.rds
│   ├── proj.archr.for_comparison.normal-vs-cancer_cancer-specific_enhancer.rds
│   ├── proj.archr.for_comparison.normal-vs-cancer_non-cancer-specific_enhancer.rds
│   ├── venn_chippeakanno_overlaps_of_enhancer_peaks.rds
│   ├── ...
├── tmp
├── tsv
│   ├── cancer_specific_enhancer_p2g_table.tsv
│   ├── peakannoenrichment_cisbp_motif_up.normal-vs-cancer_non-cancer-specific_enhancer.tsv
│   └── ...
└── xlsx
    └── male-bc_sc-atac-seq_cancer_specific_p2g_summary.xlsx

The final ArchProject object (male-bc_archrproj_obj_p2gs.rds) is located under output_p2g_male-bc/rds. The peak2gene data.frame for all enhancers related with epithelial cells was stored at cancer_enriched_enhancer_p2g_table.rds. The peak2gene data.frame for cancer specific enhancers was stored at cancer_specific_enhancer_p2g_table.rds. The cancer specific enhancers were defined by subtracting cancer enriched enhancers by peak2gene links overlapped with enhancer peaks in normal epithelial cells (i.e. human mamary epithelial cells (HMEC) H3K27ac peaks in our case).

Jupyter notebook

Figures were generated by Jupyter notebook scripts. In order to install Jupyter notebook/lab, see jupyter.org. You need to change dir_rna and/or dir_atac to locate the merged Seurat object or final ArchRProject object you generated. The output files include PDF files that will be located at the directory of "pdf".

./
├── figure1.ipynb
├── figure2_dge.ipynb
├── figure3_01_piedonut.ipynb
├── figure3_02_venn.ipynb
├── figure3_03_heatmap_peak2gene.ipynb
├── figure3_04_normal-vs-cancer_enhancers.ipynb
├── figure4_01_boxplot_for_browser_track.ipynb
├── figure4_02_volcanoplot_dge.cse.ipynb
├── figure4_03_heatmap_dge.cse.ipynb
├── figure4_04_browser_track.ipynb
├── figure_s1_01_sc-rna-seq_qc.ipynb
├── figure_s1_02_sc-atac-seq_qc.ipynb
├── figure_s1_03_sc-atac-seq_qc.ipynb
├── log
├── pdf
│   ├── ...
│   ├── featureplot_male-bc_er+bc-epi_er+bc_vs_male-bc.pdf
│   ├── heatmap_male-bc_er+bc-epi_er+bc_vs_male-bc.pdf
│   ├── heatmap_male-bc_er+bc-epi_er+bc_vs_male-bc_zscore.pdf
│   ├── heatmap_peak2gene_legend.pdf
│   ├── piedonut_peak_call_summary.pdf
│   ├── umap_male-bc_cluster_labels_atac.pdf
│   ├── umap_male-bc_cluster_labels_rna.pdf
│   ├── umap_male-bc_cluster_types_atac.pdf
│   ├── umap_male-bc_cluster_types_rna.pdf
│   ├── ...
│   └── volcanoplot_male-bc_sc-rna-seq_female-bc-vs-male-bc.enhancer_overlap.pdf
├── r
├── rds
├── reference
├── tsv
│   ├── df_dge.cse.tsv
│   └── male-bc_er+bc-epi_er+bc_vs_male-bc.tsv
├── txt
└── xlsx
    ├── ...
    └── male-bc_er+bc-epi_er+bc_vs_male-bc.xlsx

The scRNA-seq pipeline and scATAC-seq pipeline are actively developed. Other single cell data analysis projects will use the current version with different parameters or upgraded version of these pipelines.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

maleBC_single_cells

Installing

scRNA-seq data processing

scATAC-seq data processing

Jupyter notebook

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
png		png
r		r
sc-atac-seq_pipeline		sc-atac-seq_pipeline
sc-rna-seq_pipeline		sc-rna-seq_pipeline
tf_motif_analysis		tf_motif_analysis
README.md		README.md
figure1.ipynb		figure1.ipynb
figure2_dge.ipynb		figure2_dge.ipynb
figure3_01_piedonut.ipynb		figure3_01_piedonut.ipynb
figure3_02_venn.ipynb		figure3_02_venn.ipynb
figure3_03_heatmap_peak2gene.ipynb		figure3_03_heatmap_peak2gene.ipynb
figure3_04_normal-vs-cancer_enhancers.ipynb		figure3_04_normal-vs-cancer_enhancers.ipynb
figure4_01_boxplot_for_browser_track.ipynb		figure4_01_boxplot_for_browser_track.ipynb
figure4_02_volcanoplot_dge.cse.ipynb		figure4_02_volcanoplot_dge.cse.ipynb
figure4_03_heatmap_dge.cse.ipynb		figure4_03_heatmap_dge.cse.ipynb
figure4_04_browser_track.ipynb		figure4_04_browser_track.ipynb
figure_s1_01_sc-rna-seq_qc.ipynb		figure_s1_01_sc-rna-seq_qc.ipynb
figure_s1_02_sc-atac-seq_qc.ipynb		figure_s1_02_sc-atac-seq_qc.ipynb
figure_s1_03_sc-atac-seq_qc.ipynb		figure_s1_03_sc-atac-seq_qc.ipynb

hyunsoo77/maleBC_single_cells

Folders and files

Latest commit

History

Repository files navigation

maleBC_single_cells

Installing

scRNA-seq data processing

scATAC-seq data processing

Jupyter notebook

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages