-
Notifications
You must be signed in to change notification settings - Fork 34
Pipeline tutorial
This tutorial takes you through running 2 example pipelines, where you start with fastq files and end up with mapped bam files. You can see all the pipelines by peeking in here; they all work in essentially similar ways, so just read their individual POD documentation for the differences.
The system will need to know information about your fastq files before it can proceed. It will store this information in a mysql database. We have a script that takes the information from simple flat-file formats that you can create for yourself and imports it in to the database.
A sequence.index file holds the information about each fastq file. Your fake_sequence.index must be valid in format (24 tab-separated columns), with the following required fields (column indexes shown are 0-based):
- [0] FASTQ_FILE
- [1] MD5
- [2] RUN_ID (a common substring of the fastq files for this run that is unique to those fastq files)
- [3] STUDY_ID
- [5] CENTER_NAME (eg. 'SC' for sanger)
- [8] SAMPLE_ID (can be same as next column)
- [9] SAMPLE_NAME (aka 'individual')
- [12] INSTRUMENT_PLATFORM ('ILLUMINA' or 'LS454')
- [14] LIBRARY_NAME
- [17] INSERT_SIZE
- [19] PAIRED_FASTQ (filename of paired fastq)
- [20] WITHDRAWN (empty if not withdrawn, which is the typical case)
- [23] READ_COUNT (can be left blank if calculate_missing_information option will be used with HierarchyUpdate pipeline)
- [24] BASE_COUNT (can be left blank if calculate_missing_information option will be used with HierarchyUpdate pipeline) Empty columns should just be left truly empty.
A samples.txt file holds information about the samples your fastqs were sequenced from. Your fake_samples.txt is tab-delimited with columns:
- 1 lookup name
- 2 acc (e.g. ERS000116. Can be empty)
- 3 individual name (e.g. CAST/EiJ)
- 4 alias (used for genotyping)
- 5 population name (e.g. CAST_EiJ)
- 6 species name (e.g. Mus musculus castaneus)
- 7 taxon id (e.g. 10091)
- 8 sex (e.g. F)
As a one-time operation, you need to create and prepare a MySQL database to hold the meta-information:
echo 'create database vrtrack_new' | mysql -uXX -pXX -hXX -P 3306
perl -MVRTrack::VRTrack -e 'foreach (VRTrack::VRTrack->schema()) { print }' | mysql -uXX -pXX -hXX -P 3306 vrtrack_new
Now, each time you update your fake_sequence.index (eg. because you did more sequencing and have some new fastq files), update the database with this new meta-information:
update_vrmeta.pl --index fake_sequence.index --database vrtrack_new --samples fake_samples.txt
Each pipeline you want to run will needs its own config file. See the module's POD for examples of the contents of these config files. In our example we want to run the HierarchyUpdate pipeline which will import (copy) the fastqs into a certain directory structure suitable for use with subsequent pipelines. Once imported, we want to run the Mapping pipeline to map the fastqs against the human genome and generate bam alignment files.
following the pipeline module POD, create hierarchyupdate.conf and mapping.conf files with your desired settings echo 'VRTrack_HierarchyUpdate hierarchyupdate.conf' > my.pipeline echo 'VRTrack_Mapping mapping.conf' >> my.pipeline
You now have a single file 'my.pipeline' that can be used to instruct run-pipeline to run both pipelines.
run-pipeline -c my.pipeline -v -v
Use the -h option to run-pipeline to see other possible arguments, but the above command will run each pipeline mentioned in the -c file in succession, sleep for half and hour, then try running the pipelines again. It will keep looping for infinity, even after both pipelines have ostensibly finished. The idea is that you could leave this running somewhere, and every time you update the database with new fastq files they would be automatically mapped.
Running both these pipelines at once works because the Mapping pipeline will do nothing until the database tells it there are imported but unmapped fastqs; when it finishes mapping a fastq (pair) it notes this in the database and so will not try to map it again. The HierarchyUpdate pipeline will only do something when it sees there are fastq files that have not been imported; once it has imported one it will note that in the database for the Mapping pipeline to pick up on.