List of group members:
- Krishna Bathina
- Guillaume Dury
- Xuan Wang
- R. Taylor Raborn (co-advisor)
Agenda items from 9-12 meeting
- Read and become familiar with background literature, especially Hoskins and Lenhard et al (2011 and 2012, respectively).
- To decide which PopGen dataset(s) to use in our study, and to become familiar with it (/them).
- To determine whether peaked/broad promoter classifications are available from Hoskins or whether this needs to be via Taylor's pipeline.
Agenda items from 9-19 meeting
- Agreed on using the Clark population data from Fly Nexus database.
- Plan of attack: create a figure that evaulates pi within the core promoter.
- Need to add:i) pi calculator, which can then be incorporated into larger scripts, and ii) the promoter data (TR).
- Agreed on using github for code and and small files and Box for very large (>50MB) files.
Misc updates 9-22
- Krishna added a pi calculator script: py.pi (thanks to Guillaume for adding his code on Monday). Taylor tweaked the code very slightly so it automatically outputs pi given the embedded test data.
- Our storage allocation is now available at /N/dc2/projects/PromoterPopGen . Please check to see if you i) can navigate into the directory and ii) encounter any permissions issues creating a test directory/file.
Misc updates 10-4
- Meeting scheduled for 10-11 at 5:30PM (location TBA).
- RTR added a relevant Population Genomics link to reading list.
- RTR wrote and added 'importSeq', an R file that imports fasta data derived from SEQ data.
Misc updates 10-5
- RTR added seqConvert.sh, which converts all SEQ files to FASTA format; the header is the filename before the .seq prefix.
- RTR converted all SEQ files in our
/N/dc2/projects/PromoterPopGen/
folder to FASTA format using seqConvert.sh.
Misc updates 10-11
- Our meeting was extremely productive. We discussed our current progress, the data structure (DNAstring object in R) and the things that need to be done.
- We agreed to focus on the following two issues: i) how to manipulate the DNAstring object to calculate pi and ii) subsetting the DNAstring object to retrieve only the intervals of interest (i.e. promoters)
- RTR added an example of the DNAstring object (as an R binary in .RData format) to our folder on UITS (see /data). Load the file into your R workspace using the following command:
R load("DNAstrings_example.RData")
Misc updates 10-20
- Guillaume and Taylor met to get through our problem with accessing strings from DNAStringSet objects. We're happy to report that we have solved this; we'll report how below.
- RTR modified immportSEQ.R to operate on a single combined (multiple SEQ files concatenated toegher) fasta file at a time.
- Character information can be extracted from a DNAStringSet object (please see the sample object in /data) as follows:
R load("DNAstring_example.RData") as.character(sample_DNAstring[[1]][20000:20100]) [1] "TGTGGCCGAATTTATTCTAAACTGAAAATAATAATAAAAATTAATCAAATTTTCAATAAGTAAAAAATTAAAAAGGAACTTGTATATTTTTTCACTCTTAT"
- Guillaume is now re-implementing the pi caculator in R, and RTR is writing a function to extract sequences from a given interval (e.g. from a BED file).
Misc updates 11-1
- Taylor committed updates, which built on Guillaume's commit from last week.
- At present, we have a workflow to extract pi for a given set of coordinates from a DNAString object. The function retrievePi() does this by calling other subfunctions.
- We had a good meeting (5:30 to 6:25PM) discussing next steps. Our major priority is acessing intervals from a BED file containing promoters.
- A sample DNAString object was added to the top level of our shared /N/dc2/projects folder.
Misc updates 11-5/6
- Taylor wrote bedToPi.R, which is the 'master' script we needed to pull information from BED files and calculate pi across the intervals.
- bedToPi.R was run on our promoter dataset with sample data: 'sample_DNAstring' (Chr2L; 10 individuals), and the results are in the newly-created results/ directory, along with the command used to generate the file.
- We're getting there! We now need to ramp this up on the entire dataset (85 indv. x all chromosome arms).
Updates Dec 8. Things to do:
- Make sure it doesn't vary over population
- Make sure each site doesn't differ over various regions
- Shape values - Top 10% is peaked, bottom 10% is broad, and middle 80% is unclassified (KB)
- Make comparisons with other parts of the genome - How does the mean H value compare between TSR and coding regions?