TCGA pipeline: (pan)cancer stats.
A series of python scripts to download select pieces of TCGA, store them as a local MySql database and extract various statistics.
Its current use is as a prep step for merging with icgc pipeline. The only two subdirs remaining in use are 00_common_tasks and 01_somatic_mutations.
'Common tasks' refer to tasks needed to make a functional local subset of TCGA. The only non-obsolete piece remaining is 200_find_maf_files_in_GDC.py that can be used to download somatic mutation tables from GDC - a repository of legacy TCGA data.
001_drop_maf_tables though 002_create_maf_tables
003_maf_meta through 012_drop_annotation
Some samples in TCGA have serious problems with assembly or data interpretation. Example:
broad.mit.edu_LIHC.IlluminaGA_DNASeq_automated.Level_2.1.0.0/
An_TCGA_LIHC_External_capture_All_Pairs.aggregated.capture.tcga.uuid.curated.somatic.maf
273933 RPL5 Frame_Shift_Del p.K270fs
273933 RPL5 Frame_Shift_Del p.K277fs
273933 RPL5 Frame_Shift_Del p.R279fs
273933 RPL5 Frame_Shift_Del p.Q282fs
Such samples stand out pretty sharply and here we detect them as having two frameshift mutations within 5 nucleotides from each other reported more than a 100 times. Such samples are marked in 003_maf_meta.py. Later we decided to drop them in 014_drop_stuttering_samples.py
After this point we can move to ICGC - TCGA data will be fused into the combined dataset over there.
... provided by 020_db_stats.py through 027_patient_freqs.py.