Skip to content

Commit

Permalink
Update tutorial.md
Browse files Browse the repository at this point in the history
  • Loading branch information
subinamehta authored Nov 14, 2024
1 parent ac96caf commit 9df4aa8
Showing 1 changed file with 36 additions and 70 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -200,9 +200,11 @@ PepQuery2 is a tool used to validate novel peptides and proteins by searching ma
>
{: .question}

## Sub-step with **Query Tabular**
## Filtering Tabular Data with **Query Tabular**

> <hands-on-title> Task description </hands-on-title>
Query Tabular is a tool used to query tabular datasets using SQL-like commands. In this step, the tool is used to filter and extract specific data from the results generated by PepQuery2 (specifically the psm_rank_txt dataset). The query provided selects specific columns (e.g., c1 and c4) from the table where a condition(e.g., c20 = 'Yes') is met. This allows the user to filter the peptide-spectrum matches (PSMs) based on certain criteria, such as identifying PSMs that were validated with a "Yes" condition. The tool helps organize the data into a more manageable and relevant format for further analysis.

> <hands-on-title> Query tabular </hands-on-title>
>
> 1. {% tool [Query Tabular](toolshed.g2.bx.psu.edu/repos/iuc/query_tabular/query_tabular/3.3.2) %} with the following parameters:
> - In *"Database Table"*:
Expand All @@ -214,72 +216,59 @@ WHERE (c20 = 'Yes')
`
> - *"include query result column headers"*: `No`
>
> ***TODO***: *Check parameter descriptions*
>
> ***TODO***: *Consider adding a comment or tip box*
>
> > <comment-title> short description </comment-title>
> >
> > A comment about the tool or something else. This box can also be in the main text
> {: .comment}
>
{: .hands_on}

***TODO***: *Consider adding a question to test the learners understanding of the previous exercise*

> <question-title></question-title>
>
> 1. Question1?
> 2. Question2?
> 1. Why is the option "include query result column headers" set to 'No'?
> 2. What does the SQL query SELECT c1, c4 FROM t1 WHERE (c20 = 'Yes') do?
>
> > <solution-title></solution-title>
> >
> > 1. Answer for question1
> > 2. Answer for question2
> > 1. The option is set to 'No' because, in this case, the user may not need column headers in the output file. This can be useful when the output is being further processed or integrated into another system where the headers are not required.
> > 2. This SQL query selects specific columns (c1 and c4) from the table t1 where the value in column c20 is equal to 'Yes'. Essentially, it filters the dataset to retrieve only the rows where a certain condition (e.g., peptide validation) has been met.
> >
> {: .solution}
>
{: .question}

## Sub-step with **Tabular-to-FASTA**
## Converting Data to FASTA Format with **Tabular-to-FASTA**
Tabular-to-FASTA is a tool used to convert tabular data into the FASTA format, commonly used for storing sequence data. In this step, the tool takes the output of the Query Tabular step, which contains the filtered peptide data, and converts it into a FASTA format. The parameters specify which columns in the tabular data correspond to the sequence (c1) and the title (c['2']). The output will be a FASTA file where each peptide sequence is accompanied by its associated title or identifier. This conversion is crucial for further analysis, such as alignment or database searches, using sequence data in FASTA format. This database is being generated for BLAST-P searching.

> <hands-on-title> Task description </hands-on-title>
> <hands-on-title> Tabular to FASTA </hands-on-title>
>
> 1. {% tool [Tabular-to-FASTA](toolshed.g2.bx.psu.edu/repos/devteam/tabular_to_fasta/tab2fasta/1.1.1) %} with the following parameters:
> - {% icon param-file %} *"Tab-delimited file"*: `output` (output of **Query Tabular** {% icon tool %})
> - *"Title column(s)"*: `c['2']`
> - *"Sequence column"*: `c1`
>
> ***TODO***: *Check parameter descriptions*
>
> ***TODO***: *Consider adding a comment or tip box*
>
> > <comment-title> short description </comment-title>
> >
> > A comment about the tool or something else. This box can also be in the main text
> {: .comment}
>
{: .hands_on}

***TODO***: *Consider adding a question to test the learners understanding of the previous exercise*

> <question-title></question-title>
>
> 1. Question1?
> 2. Question2?
> 1. Why is there a need to convert the files to FASTA for BLASTP searches?
>
> > <solution-title></solution-title>
> >
> > 1. Answer for question1
> > 2. Answer for question2
> > 1. We need to convert data to FASTA format for BLASTP because it is the required input format for the tool. FASTA provides a standardized way to represent protein sequences with a header and sequence lines, making it compatible with BLASTP’s alignment algorithms. This format allows BLASTP to properly parse, compare, and search the query sequence against protein databases.
> >
> {: .solution}
>
{: .question}

## Sub-step with **NCBI BLAST+ blastp**
## BLAST-P Sequence Alignment
In this step, the NCBI BLAST+ blastp tool is used for performing protein sequence alignment. It compares the query protein sequence (generated from the Tabular-to-FASTA tool) against a locally installed BLAST database of protein sequences. The alignment is optimized for shorter protein queries (less than 30 residues) using the blastp-short option. The tool uses a PAM30 scoring matrix, along with specific gap costs, and generates tabular output with extended 25 columns, allowing users to examine the results in detail.

> <hands-on-title> Task description </hands-on-title>
**Parameters:**
- Protein query sequence(s): The input protein sequence in FASTA format.
- Subject database/sequences: The BLAST database against which the query sequence will be compared.
- Type of BLAST: Optimized for short protein queries.
- Expectation value cutoff: Specifies the cutoff for statistical significance.
- Advanced options: Includes scoring matrix (PAM30), gap costs, and word size for the BLAST algorithm.

> <hands-on-title> NCBI BLAST+ blastp </hands-on-title>
>
> 1. {% tool [NCBI BLAST+ blastp](toolshed.g2.bx.psu.edu/repos/devteam/ncbi_blast_plus/ncbi_blastp_wrapper/2.14.1+galaxy2) %} with the following parameters:
> - {% icon param-file %} *"Protein query sequence(s)"*: `output` (output of **Tabular-to-FASTA** {% icon tool %})
Expand All @@ -299,36 +288,29 @@ WHERE (c20 = 'Yes')
> - *"Restrict search of database to a given set of ID's"*: `Taxonomy identifiers (TaxId's)`
> - {% icon param-file %} *"Restrict search of database to list of TaxId's"*: `output` (Input dataset)
>
> ***TODO***: *Check parameter descriptions*
>
> ***TODO***: *Consider adding a comment or tip box*
>
> > <comment-title> short description </comment-title>
> >
> > A comment about the tool or something else. This box can also be in the main text
> {: .comment}
>
{: .hands_on}

***TODO***: *Consider adding a question to test the learners understanding of the previous exercise*

> <question-title></question-title>
>
> 1. Question1?
> 2. Question2?
> 1. What is the significance of using the blastp-short option in this tool?
> 2. Why is the PAM30 matrix used in this step?
>
> > <solution-title></solution-title>
> >
> > 1. Answer for question1
> > 2. Answer for question2
> > 1. The blastp-short option is used because the query protein sequence is shorter than 30 residues, and this setting optimizes the alignment for such sequences.
> > 2. The PAM30 matrix is a substitution matrix suitable for shorter protein sequences and is specifically designed for those with fewer than 50 amino acids.
> >
> {: .solution}
>
{: .question}

## Sub-step with **Query Tabular**
## Refining Data with **Query Tabular**
The Query Tabular tool is used in this workflow to filter and organize data from previous steps. In this case, it processes the output from the NCBI BLAST+ blastp tool, which contains protein sequence alignment results. The tool allows the user to execute SQL-like queries on the tabular data, enabling filtering and sorting based on specific criteria. For example, the query selects distinct peptides from the pep table that are not perfectly aligned (i.e., with a sequence identity of 100%) and have certain mismatches or gaps in the alignment. The filtered results are then used for further analysis.

> <hands-on-title> Task description </hands-on-title>
The tool helps to refine the data by removing sequences that meet specific criteria, such as perfect alignment, and ensuring that the remaining sequences meet the necessary quality standards for further exploration.

> <hands-on-title> Query Tabular</hands-on-title>
>
> 1. {% tool [Query Tabular](toolshed.g2.bx.psu.edu/repos/iuc/query_tabular/query_tabular/3.3.2) %} with the following parameters:
> - In *"Database Table"*:
Expand Down Expand Up @@ -362,41 +344,25 @@ AND (blast.pident < 100
ORDER BY pep.pep`
> - *"include query result column headers"*: `No`
>
> ***TODO***: *Check parameter descriptions*
>
> ***TODO***: *Consider adding a comment or tip box*
>
> > <comment-title> short description </comment-title>
> >
> > A comment about the tool or something else. This box can also be in the main text
> {: .comment}
>
{: .hands_on}

***TODO***: *Consider adding a question to test the learners understanding of the previous exercise*

> <question-title></question-title>
>
> 1. Question1?
> 2. Question2?
> 1. Why is the query filtering out sequences with 100% identity from the BLAST results?
> 2. What is the significance of using SQL-like queries in Query Tabular?
>
> > <solution-title></solution-title>
> >
> > 1. Answer for question1
> > 2. Answer for question2
> > 1. Sequences with 100% identity are filtered out to focus on those that show variability in the alignment, which might indicate novel or significant biological differences worth exploring.
> > 2. SQL-like queries allow for efficient data manipulation and filtering, enabling users to extract only the most relevant results based on complex conditions (such as mismatches, gaps, or alignment length).
> >
> {: .solution}
>
{: .question}


## Re-arrange

To create the template, each step of the workflow had its own subsection.

***TODO***: *Re-arrange the generated subsections into sections or other subsections.
Consider merging some hands-on boxes to have a meaningful flow of the analyses*

# Conclusion

Sum up the tutorial and the key takeaways here. We encourage adding an overview image of the
Expand Down

0 comments on commit 9df4aa8

Please sign in to comment.