BME-237-spr2021: Repository of R scripts & terminal commands used to complete RNA Bioinformatics problem sets

This readme lists each file and describes what they are in relation to my assignments

List of projects from most recent->oldest

Final project: Performance evaluation of different differential expression pipelines using publically available sequencing data

Dataset: NBCI-SRA SRP126155
Pipelines: STAR>HTSEQ-count>DESq2 ; Kallisto>Sleuth
Analytical benchmarks & QC: FastP, FastQC, diffexp comparison between housekeeping genes in human lung bronchial epithelial cells (HBECs) not expected to be differentially expressed (negative ctrl) & genes validated by qPCR to be differentially expressed in HBECs exposed to cigarette smoke (positive ctrl).
Conclusion: Only Kallisto/Sleuth was able to recapitulate differential expression of positive control gene CYP1A1 in its heatmap. This gene may have failed to show up in DESeq2's heatmap clustering despite appearing in DESeq2's significantly differentially expressed genes because of an outlier HTSeq-count .txt result in CSC sample 3.

Problem set 7: Reproduce published results of eCLIP peaks found in different regions of pre-mRNA or non-coding RNAs for RNA binding proteins UPF1 & PTBP1. https://www.nature.com/articles/s41586-020-2077-3

Dataset: Reproducible eCLIP peaks from K562 cell line. Encode accession #s ENCFF597NVR (UPF1 peaks), ENCFF907HNN (PTBP1 peaks)
Pipelines: bedtools
Conclusion: The fractions of reproducible eCLIP peaks for UPF1 and PTBP1 in the K562 cell line is similar to the fractions of peaks for these proteins in the HepG2 cell line.

Problem set 2: Re-analysis of HOXA1 KD effect on gene expression using DESeq2

Dataset: Illumina MiSeq of human primary lung fibroblasts WT & KD for HOXA1. Encode accession #s SRR493372, SRR493373, SRR493374 (WT); SRR493375, SRR493376, SRR493377 (KD)
Pipelines: HISAT2>HTSeq-count>DESeq2
Conclusions: Trapnell et al. found TBX3, CDC14B, CDK2, and ORC6 to be differentially expressed between WT & KD cells. Only TBX3 & CDK2 were found by my analysis with FDR < 1%. The other genes could have been missed by my analysis because of our different aligners (Trapnell et al. used Tophat2, Hisat2's predecessor). When Tophat2 alignment is compared with HISAT2 on human fetal lung fibroblast short-read RNA seq data in "HISAT: a fast spliced aligner with low memory requirements", Tophat2 aligns fewer reads than HISAT2. Tophat2's alignment accuracy for human RNA-seq appears lower than HISAT2 in that comparison so the extra genes may have been inaccurately counted by Tophat2.

Problem set 1: Comparison of splice junctions found by alignment methods STAR vs. HISAT2

Dataset: paired-end (76bp) mRNA sequencing of lung fibroblast primary cells. Encode accession #s ENCFF000IJA (read 1), ENCFF000IJH (read 2)
Pipelines: STAR>Regtools ; HISAT2>Regtools
Conclusions: STAR>Regtools detected 146054 unique splice junctions, while HISAT2 detected 159451. HISAT2 may have detected more unique SJs than STAR because HISAT2 is better at aligning intron-spanning reads with small anchors (1–7 bp aligned to one of the exons) but does not perform as well for reads with long anchors. It’s possible the junctions not detected by HISAT2 had longer anchors.

File & Description

Final-project-terminal-commands Terminal commands from downloading dataset to getting Kallisto & HTSeq-count output for R
final_deseq2_script.R R script for running DESeq2 on HTSeq-count files
final_kallistosleuth_script.R R script for running Sleuth from Kallisto files
ps7-terminal-commands Terminal commands for using bedtools to find reproducible eCLIP peaks for UPF1 & PTBP1
ps2-terminal-and-R-commands Terminal & R commands for using Hisat2>HTSeq-count>DESeq2
ps1-terminal-commands Terminal commands for comparing splice junctions found by STAR vs. HISAT2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BME-237-spr2021: Repository of R scripts & terminal commands used to complete RNA Bioinformatics problem sets

List of projects from most recent->oldest

File & Description

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
Final-project-terminal-commands		Final-project-terminal-commands
README.md		README.md
final_deseq2_script.R		final_deseq2_script.R
final_kallistosleuth_script.R		final_kallistosleuth_script.R
ps1-terminal-commands		ps1-terminal-commands
ps2-terminal-and-R-commands		ps2-terminal-and-R-commands
ps7-terminal-commands		ps7-terminal-commands

cindyeliang/BME-237-spr2021

Folders and files

Latest commit

History

Repository files navigation

BME-237-spr2021: Repository of R scripts & terminal commands used to complete RNA Bioinformatics problem sets

List of projects from most recent->oldest

File & Description

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages