The scripts that UPHL uses to download things from basespace and run respective workflows
accession_for_clarity.py is a script that creates an accessioning file for Clarity that has the new samples found in the Labware lims system that will be ready for the lab to process. The output of the script is a csv file, that must be saved as an excel file before Clarity will accept it. To be able to run the script you must have these two servers mounted on your computer; smb://172.16.109.9 and smb://168.180.220.43 or //LABWARE/ and ///DDCP/UPHL/. You also need to log into Clarity and download all samplesheets from projects with a name that starts with COVIDSeq, including the project COVIDSeq_From_TEST_DO_NOT_PUT_IN_WORKFLOW. Do this by going to the 'Projects and Samples' tab, choosing the correct project, and clicking the modify samples button. This will insure we do not place duplicates in Clarity which causes many issues.
Files will be downloaded from clicking the modify samples button.
Now to run the script
python accession_for_clarity.py "path/to/files/fromclarity" "path/to/files/fromclarity.etc" ...
The csv file that the script creates will be saved to the directory that you are running the script from. You must open the file in excel and save it to the excel file. Now back in Clarity click the button 'Upload Sample List' and choose the excel file.
Now click select group for the samples you just uploaded. Then choose 'Covidseq v1.0' after clicking 'Assign to Workflow'. And finally you are done!
This script is written to run continuously on the Linux workstation on a screen. It's main purpose is to use the bs
or "The BaseSpace Sequence Hub CLI tool" to look for new sequencing runs starting on any of the sequencers. It uses a .txt file called 'experiments_done.txt' to record which sequencing runs it has already seen. Once it has identified the new run; it will use the Clarity API to find which species are among the samples. The script will then open a new screen and start the next script screen_run.py
. If it was able to find the run on Clarity it will provide an argument for the run type; currently either mycosnp or grandeur.
USAGE: Run this on a screen. It is an infinite loop and looks for new runs.
EXAMPLE:
python3 analysis_for_run.py
This script is written to run on a screen named after a sequencing run and will end the screen once it has completed.
To run effectively, a run name should be supplied. If the ultimate goal is run through an established workflow, there are flags that can be set that provide additional functionality.
EXAMPLE:
# for grandeur
python3 screen_run.py -r UT-M70330-240131 --grandeur
# for mycosnp
python3 screen_run.py -r UT-M70330-240131 --mycosnp
This script is used to gather sample metadata from recent sequencing runs of C. auris and to extract any additional data required in C. auris WGS analysis requests.
The C. auris LIMS export consists of two reports:
- ‘C_AURIS_Positive_Colony_Daily.csv’ which contains patient metadata for detection of colonization of Candida auris.
- ‘C_AURIS_Positive_Isolates_Daily.csv’ which contains patient metadata for confirmed C. auris isolates.
Both reports are generated daily and found on the LABWARE server currently located at the path below and must be copied to the location/directory of the merge_c_auris_LIMS_export_files.py script.
PATH:
(smb://172.16.109.9) at ‘/Volumes/LABWARE/Shared_Files/PHT/C_AURIS_DAILY’
This script:
- Uses the pandas library to read both CSV reports into dataframes.
- Opts out of redundant columns.
- Merges both reports.
- Exports the merged data into an Excel file.
- Creates a parent directory (C_auris_LIMS_export_date) for the saved Excel file.
- Deletes the copied C. auris LIMS csv reports
EXAMPLE:
python merge_c_auris_LIMS_export_files.py
The resulting merged Excel file assists in gathering City/State data for each sequenced C. auris sample as well as gathering collection dates and specimen types.
This script is used to determine the 'Healthcare_Facility_of_origin' city that is associated with various C. auris samples.
The script expects the following two input files, which must be in the same location/directory as the 'healthcare_facility_of_origin_city.py' script. Currently they are located at:
PATH:
/Volumes/IDGenomics_NAS/pulsenet_and_arln/investigations/C_auris/complete_UPHL_analysis/C_auris_LIMS_export
- 'samples.txt': This file contains data about the C. auris samples that is gathered from the C. auris LIMS export. It must include the following headers: ARLN_Specimen_ID, Healthcare_facility_of_origin_name, Healthcare_facility_of_origin_state.
- 'C_auris_Healthcare_Facility_of_origin_name_city_state.txt': This file contains a list of healthcare facilities along with their corresponding cities and states. As additional cities are determined in future, they will be added by the user who can find the city through Google search. Please note that if the city cannot be determined, then null should be used.
This script:
-
Uses pandas to read the input files and if there's an error in reading the files, it logs the error and exits the script.
-
Merges the two dataframes on the Healthcare_facility_of_origin_name and Healthcare_facility_of_origin_state columns. If merging fails, it logs the error and exits.
-
In cases where the city information is missing, it fills these gaps with 'NULL'.
-
Writes the merged data to an Excel file (facility_city_output.xlsx). If there's an error during this process, the script logs this error.
-
Uses the logging module for error logging, which helps in debugging and maintaining the script.
-
The script uses the sys module for system-level operations like exiting the script upon encountering an error.
EXAMPLE:
python healthcare_facility_of_origin_city.py
The resulting Excel file helps reduce time in determining what city the 'Healthcare_facility_of_origin_name' is located in.
This script aids the user in gathering sample VCF (Variant Call Format) files from mycoSNP result directories. It is designed to efficiently copy and organize .g.vcf.gz and .g.vcf.gz.tbi files for specific samples and runs, facilitating subsequent WGS analysis. Users must create an empty directory named 'vcf_files' and a file named 'samples.txt' in the same location as this script. The samples.txt should contain two tab-separated columns: the first being the sample IDs and the second the run IDs. Example format: 302****\tUT-M07101-2112**.
This script:
-
Copies VCF Files: The copy_files function is the core of the script. It searches for VCF files in two specific directories, copying them to 'vcf_files/' if they don't already exist there.
-
Search Directories: Searches the following two directories for vcf files. Primary Directory: /Volumes/IDGenomics_NAS/fungal/{run_id}/ Secondary Directory: /Volumes/IDGenomics_NAS/fungal/mycosnp_vcfs_211208-230726
-
Error and Success Handling: Prints a message for each sample, indicating whether copying was successful or if it failed.
-
Parallel Processing: Uses GNU Parallel to process multiple entries from samples.txt, enhancing efficiency.
EXAMPLE (Ensure that 'samples.txt' is correctly formatted and present in the same directory as the script as well as the 'vcf_files' directory for successful execution):
bash gather_vcfs.sh
The script will populate the vcf_files directory with the desired VCF files. Each outcome (success or failure) is communicated to the user via terminal messages.
This script streamlines the process of collecting and organizing necessary data for mycoSNP analysis. It significantly reduces manual data handling and potential errors.
This script updates sequence identifiers in the 'vcf-to-fasta.fasta' file, which is an output file of mycoSNP. It ensures that sequence IDs in the 'vcf-to-fasta.fasta' file, used in creating a Newick file, align with the specific naming conventions set by the CDC's Mycotic Disease Branch. Users must place three specific files in the same directory as this script:
- seqid.txt: A user-created, tab-separated file with headers "current_seqid" and "alternate_seqid".
Example:
current_seqid alternate_seqid
3437***_S24 UT-UPHL-CAU-2*_Houston_TX
3437***_S25 UT-UPHL-CAU-3*_Houston_TX
-
original.fasta: A copy of the vcf-to-fasta.fasta file that needs ID alteration.
-
corrected.fasta: An empty file where the script will write the modified data.
The script:
-
Dictionary Creation: Reads the seqid.txt file and constructs a dictionary mapping the current sequence IDs (current_seqid) to their corresponding alternate IDs (alternate_seqid).
-
File Processing: Opens original.fasta for reading and corrected.fasta for writing.
-
Sequence identifiers in original.fasta are replaced with alternate IDs from the dictionary, preserving the integrity of the FASTA file format.
-
Output: The resulting corrected.fasta file contains the FASTA sequences with updated sequence identifiers. This file is suitable for generating a Newick file for phylogenetic analysis.
Example: Before running the script, ensure that 'seqid.txt', 'original.fasta', and 'corrected.fasta' are correctly prepared and located in the same directory as the script.
python changeseqids.py
This script aids in the preparation of C. auris sequence data for phylogenetic analysis. It automates the often tedious and error-prone process of manually updating sequence identifiers via MEGA 11 (Molecular Evolutionary Genetics Analysis version 11).
This script will download fastq files when given a run name with -r
and an output directory with -o
. This script should only be used if there were no sample sheet issues. Otherwise, manual downloading is recommended.
EXAMPLE:
python3 bssh_reads_by_run.py -r UT-M70330-240131 -o /Volumes/IDGenomics_NAS/pulsenet_and_arln/UT-M70330-240131/reads
This script will download the sample sheet of a run when given a run name with -r
and an output directory with -o
. This script should only be used if there were no sample sheet issues. Otherwise, manual downloading is recommended.
EXAMPLE:
python3 bssh_sample_sheet.py -r UT-M70330-240131 -o /Volumes/IDGenomics_NAS/pulsenet_and_arln/UT-M70330-240131/reads
This python script will take a MiSeq sample sheet, look for a specific string that should designate the header, and then create a pandas datafame for use in other scripts.
If used as a standalone, will create a csv file without the extra header needed for fastq generation.
EXAMPLE:
python3 samplesheet_to_df.py -s sample_sheet.csv -o new_sample_sheet.csv
This script will take a run_name, sample sheet, and directory of reads to create a sample sheet compatible with aws.
EXAMPLE:
aws_samplesheet_grandeur_create.py -r UT-M03999-240627
Afterward, the idea is to just cd
into the directory of the reads and upload everything to the AWS S3 bucket.
run=UT-M03999-240627
directory=/Volumes/IDGenomics_NAS/pulsenet_and_arln/$run/reads
cd $directory
aws s3 cp --profile 155221691104_dhhs-uphl-biongs-dev --region us-west-2 aws_sample_sheet.csv s3://dhhs-uphl-omics-inputs-dev/$run/aws_sample_sheet.csv
ls *fastq.gz | parallel aws s3 cp --profile 155221691104_dhhs-uphl-biongs-dev --region us-west-2 {} s3://dhhs-uphl-omics-inputs-dev/$run/{}