-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #42 from TheJacksonLaboratory/task/add-analysis-to…
…ols-pages Adding image assets and analysis tool pages
- Loading branch information
Showing
219 changed files
with
1,954 additions
and
106 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,20 @@ | ||
|
||
## ABBA | ||
|
||
Given a set of interesting genes, do other genes have similar relationships to known sets of genes? For example, given a set of genes known to be related to drug abuse, what other genes share similar expression patterns in drug abuse gene sets? By answering this question, it becomes possible to elucidate under-studied or obfuscated genes that may play a role in complex phenotypes. | ||
|
||
We have developed a new GeneWeaver tool to address this question, which we call __Anchored Biclique of Biomolecular Associations (ABBA)__. This tool takes advantage of the large number of collected data and cross-species integration to find new genes for investigation. | ||
|
||
The search begins with a user-provided list of genes of interest, such as highly-studied genes with known pathways and relationships. The database then finds any gene sets that contain at least N of the genes in the provided list. From the resulting list of gene sets, ABBA then isolates any genes that occur in at least M GeneSets but not in the initial list. These resulting genes share similar gene set overlap with the original input set, but may not have been previously considered in relation to the gene set of interest. | ||
|
||
!["ABBA applied to a set of 4 genes of interest"](../assets/images/abba.png) | ||
|
||
In the above figure, the lighter nodes indicate less overlap. Using N=2 produces a collection of 37 GeneSets as of 7 July 2010. For brevity, only the top 5 results are shown above. With M=15, the following table lists genes in the result having similar relationships to the input set. | ||
|
||
|
||
![](../assets/images/abba_2.png) | ||
|
||
|
||
Without reasonable thresholds, the results quickly become overwhelming. As of this writing, a simple set of 4 genes of interest results in 555 GeneSets and over 38,000 genes in the candidate list. Increasing the input set to 7 genes of interest results in 983 GeneSets and almost 40,000 genes. Simply requiring gene sets to contain at least 3 genes significantly reduces the search space to 11 and 37 GeneSets, respectively. | ||
|
||
![](../assets/images/abba_3.png) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,91 @@ | ||
## Boolean Algebra | ||
|
||
The Boolean Algebra Tool performs basic set operations on at least two Gene Sets. | ||
Results are displayed as lists of genes belonging to one of the three different types of | ||
set operations: Union, Intersect, and Symmetric Difference. Furthermore, results allow | ||
users to quickly determine new relationships between Gene Sets and create a new Gene Set | ||
based on set-derived findings. | ||
|
||
### Using the Boolean Algebra Tool | ||
|
||
Access the Boolean Algebra Tool through | ||
the [Analyze Genesets](index.md#analyze-gene-sets-tab) tab, located in the left-hand | ||
column and distinguished by the Venn diagram icon. | ||
|
||
![](../assets/images/boolean_algebra_options.png) | ||
|
||
To generate Boolean Algebra results, select either a Project of two or more Gene Sets or | ||
at least two individual Gene Sets from a project. Next, select the appropriate Boolean | ||
Algebra function. These functions are based on basic _Set Algebra_: **Union**, | ||
**Intersection**, **Symmetric Difference**. | ||
|
||
* **Union**: This tool generates a set of all genes located in all sets. It removes | ||
duplicates by default. The results will display what homology mapping was used to | ||
generate a gene entry. | ||
|
||
This result shows the union of three Gene Sets, two mouse and one human. | ||
|
||
![](../assets/images/boolean_algebra_union.png) | ||
|
||
* **Intersection**: This option will cause the Boolean tool to return all genes in | ||
common with the selected Gene Set inputs. It has an additional option (_"Genes must | ||
intersect in at least X"_) that specifies the minimal amount of overlaps required to | ||
return a result. If a minimal overlap is set to _3_, for example, only Gene Sets that | ||
intersect with 3 or more genes will be evaluated, and only the intersecting genes will | ||
be returned. In addition, results are divided into separate groups based on the number | ||
of genes in their intersections. | ||
|
||
These three Gene Sets have 4 genes in common. All of them are homologs between mouse and | ||
human. | ||
|
||
![](../assets/images/boolean_algebra_intersect.png) | ||
|
||
Changing the overlap to 2 created two sets of results, those in all 3 Gene Sets and | ||
those in only 2 of the Gene Sets. | ||
|
||
![](../assets/images/boolean_algebra_intersect3.png) | ||
|
||
* **Symmetric Difference**: This tool will create a set of genes that are unique to the | ||
Gene Sets selected as input. It effectively finds the Union of all Gene Sets minus the | ||
intersection of those Gene Sets. | ||
|
||
In this example, there is a result set of unique genes for each input Gene Set. | ||
|
||
![](../assets/images/boolean_algebra_except.png) | ||
|
||
### Managing Results | ||
|
||
A table located just below the circle overlap diagram and above the results is intended | ||
to display a broad survey of genes included in the input Gene Sets, categorized by | ||
species. It lists: _Genes Specific to Species_, _Genes In Common with at Least One Other | ||
Species_, and _Total Number of Genes_. These values are based on the total number of | ||
genes in the input sets, and may not specifically represent results. The table is | ||
intended to help aid in the selection of which species to map the results in cases where | ||
new Gene Sets are created. | ||
|
||
![](../assets/images/boolean_algebra_table.png) | ||
|
||
Genes returned by the Boolean Algebra tool can be added to new Gene Sets. To do this, | ||
click on the **Create New Gene Set From Results** button for the group you want. | ||
|
||
Since results can contain genes from a mixed set of species, a species must be selected | ||
for mapping the genes in the new Gene Set. | ||
|
||
![](../assets/images/boolean_algebra_select_species.png) | ||
|
||
The standard Upload GeneSet page will open. The genes will be listed in the gene | ||
information section. If no species is selected, no genes will be listed. You can now | ||
edit any of the fields to change the Gene Set name, description, etc. Follow | ||
the [Upload GeneSet](#uploading-gene-sets) procedure. It is also important to note that | ||
very large gene lists may take a few moments to load, during which time the user may | ||
experience a dimmed 'Loading' screen. | ||
|
||
### Circle Overlap Diagram | ||
|
||
If the user selects 10 or fewer Gene Sets, a gene overlap diagram will appear near the | ||
top of the results page. The **Circle Overlap** representation is an approximation of | ||
Euler fractional overlaps. It represents how the input genesets relate to each other. It | ||
uses the same homology mapping as the Boolean Algebra tool to render the approximate | ||
fractional overlap of the genes shared between each set. | ||
|
||
![](../assets/images/bool_image.png) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,239 @@ | ||
**Clustering** | ||
============== | ||
|
||
|
||
Why Use the Clustering Tool | ||
---------- | ||
|
||
Clustering is one of the most powerful tools in bioinformatics, where classifications are too strict for data distinction, clustering helps give the user an evaluation that is not so distinct. | ||
|
||
|
||
### Using the Clustering Tool | ||
|
||
1. Select the gene sets from your list of projects that you would like | ||
to analyze. | ||
- You need a minimum of 3 gene sets in total to run the tool. | ||
|
||
2. Select if homology is to be included or excluded. | ||
- Homology is included by default. | ||
|
||
3. Select the method of clustering. | ||
- Average is the default method of clustering. | ||
- There are five methods of clustering. They are listed in the | ||
methods section. | ||
|
||
### Understanding your Results | ||
|
||
#### Visualization Types | ||
|
||
There are two methods for visualizing your clustering results. | ||
|
||
**Force Directed Graph** | ||
|
||
![](../assets/images/Forced-directed-graph.png "fig:Forced-directed-graph.png") | ||
|
||
- Tree representation of each cluster. | ||
- Clear depiction of hierarchy. | ||
- The most opaque node of a tree represents the clusters root. | ||
|
||
- Each node is classified as one of the following: | ||
- **Cluster** - Grouping of gene sets | ||
- The opacity of the nodes is based on the Jaccard Similarity of its children. The more similar the gene sets, the darker the cluster. | ||
- On Hover: Reveals Jaccard Similarity of its child nodes. Reveals set notation of the containing hierarchy. | ||
|
||
![](../assets/images/Cluster-onHover.png "fig:Cluster-onHover.png") | ||
|
||
* On Click: Collapses (absorbs its children). | ||
|
||
|
||
![](../assets/images/Cluster-onClick.png "fig:Cluster-onClick.png") | ||
|
||
- **Gene Set** - A set of genes | ||
- Colored based on the species contained in the gene set study. | ||
- Sized based on the relative size of the gene set. | ||
- On Hover: Reveals abbreviated gene set information. | ||
- On Click: Reveals and cycles through genes in groups of ten. | ||
- On Double Click: Opens a new page containing extensive gene set information. | ||
- **Gene** | ||
- On Hover: Reveals the name of the gene. | ||
|
||
- **Edges** | ||
- Connects nodes to its children. | ||
- The opacity of edges leading from cluster nodes is based on the cluster nodes Jaccard Similarity, following the same scale as above. | ||
|
||
**Partitioned Sunburst** | ||
|
||
![](../assets/images/Partitioned-sunburst.png "fig:Partitioned-sunburst.png") | ||
|
||
- Top-down view of each tree. | ||
- Center represents the root. | ||
- Partitioned sub-circles represent clusters, gene set or gene. | ||
|
||
- **Partition** | ||
- Partitions are the equivalent to nodes in a tree | ||
- Each parition is classified as one of the following: | ||
- **Cluster** - Grouping of gene sets | ||
- On Hover: Reveals Jaccard Similarity of its child | ||
partition and highlights all nodes within the cluster. | ||
- On Right Click: Opens a new "View GeneSet Overlap" page | ||
using all gene sets in the cluster as input. | ||
- **Gene Set** - A set of genes | ||
- Colored based on the species contained in the gene | ||
set study. | ||
- Drawn arc sizes are based on the relative size of the | ||
gene set. | ||
|
||
- On Right Click: Opens a new "View GeneSet Details" page for the | ||
given gene set. | ||
- **Rings** | ||
- Each Ring represents a level in the tree. | ||
- The outer most levels are gene sets. | ||
- The levels leading up to a gene set represents the hierarchy of | ||
the cluster. | ||
|
||
|
||
Clustering Methods | ||
------------------ | ||
|
||
Listed below are the six different methods that the user can choose from | ||
while running the tool. The first five are different clustering methods | ||
that will run on the selected genesets and display a force directed tree | ||
and a partitioned sunburst based on the clustered genesets. | ||
|
||
All five of the given clustering methods are agglomerative hierarchical | ||
clustering methods that start with each geneset belonging to its own | ||
cluster. They then combine the clusters at each iteration based off of a | ||
described linkage method that determines how the distance between two | ||
clusters is defined. The clusters are combined until there are no more | ||
clusters that are similar to each other (the distance between them is | ||
too large). | ||
|
||
### McQuitty | ||
|
||
The McQuitty clustering method uses a linkage method where distance | ||
depends on the combination of clusters instead of the individual | ||
genesets within each cluster. When two clusters are joined together, the | ||
distance of the new cluster to any other cluster is calculated as the | ||
average distance between the two clusters that are being joined and the | ||
other cluster. For example, if clusters 2 and 4 have the greatest | ||
similarity and we are going to combine them into a new cluster called | ||
2+4, then the distance from 2+4 to 1 is the average of the distances | ||
from 2 to 1 and 4 to 1. | ||
|
||
- **Algorithm** | ||
- Each gene set is initialized as its own cluster. | ||
- The initial similarity between each cluster is the Jaccard | ||
Similarity of the two genesets. | ||
- While we still have similar clusters: | ||
- Clusters with highest similarity are clustered together. | ||
- Calculates the similarity between the new cluster and all | ||
the rest based on the McQuitty linkage method | ||
- **Time Complexity** | ||
- O(n^2^ log n) | ||
- This method is the most time efficient. | ||
|
||
### Ward | ||
|
||
The Ward clustering method uses a linkage method where the distance | ||
between two clusters is based off of the Jaccard Similarity score | ||
between them. When two clusters are joined together, the new cluster | ||
will take the union of the genesets in the two clusters that are being | ||
joined and set that as its geneset. It will then calculate the new | ||
geneset's similarity score against all the other cluster's genesets and | ||
that will be set as the distance between the new cluster and all the | ||
other clusters. | ||
|
||
- **Algorithm** | ||
- Each gene set is initialized as its own cluster | ||
- The initial distance between clusters is the Jaccard Similarity | ||
score between each of the cluster's genesets | ||
- While we have clusters that are similar to each other: | ||
- Clusters with highest similarity are clustered together. | ||
- The new cluster contains a geneset which is the union of its | ||
children's genesets | ||
- Recalculates the Jaccard Similarity score between the new | ||
cluster and all the other clusters | ||
- **Time Complexity** | ||
- O(n^3^) | ||
|
||
### Complete | ||
|
||
The Complete clustering method uses a linkage method where the distance | ||
between two clusters is the lowest similarity score between any of the | ||
genesets in one cluster compared to any of the genesets in the other | ||
cluster. When two clusters are combined, the genesets within each of the | ||
clusters are put into a new cluster. No new calculations are needed at | ||
each iteration because we are simply reusing the similarity scores of | ||
all the genesets compared to each other. | ||
|
||
- **Algorithm** | ||
- Each gene set is initialized as its own cluster. | ||
- The similarity scores of all the genesets compared to each | ||
other are saved in a matrix | ||
- While we still have clusters that are similar: | ||
- Determine which two clusters to join: | ||
- The distance between two clusters is the lowest | ||
similarity score between a geneset in one cluster and a | ||
geneset in the other cluster | ||
- The highest of these distances determines which two | ||
clusters will be joined | ||
- Combines the two clusters to create a new cluster that has | ||
all the genesets that were present in the two children | ||
clusters | ||
- **Time Complexity** | ||
- O(n^3^) | ||
|
||
### Average | ||
|
||
The Average clustering method uses a linkage method where the distance | ||
between two clusters is the average similarity score between all of the | ||
genesets in one cluster compared to all of the genesets in the other | ||
cluster. When two clusters are combined, the genesets within each of the | ||
clusters are put into a new cluster. No new calculations are needed at | ||
each iteration because we are simply reusing the similarity scores of | ||
all the genesets compared to each other. | ||
|
||
- **Algorithm** | ||
- Each gene set is initialized as its own cluster. | ||
- The similarity scores of all the genesets compared to each | ||
other are saved in a matrix | ||
- While we still have clusters that are similar: | ||
- Determine which two clusters to join: | ||
- The distance between two clusters is the average | ||
similarity score between every geneset in one cluster | ||
and every geneset in the other cluster | ||
- The highest of these distances determines which two | ||
clusters will be joined | ||
- Combines the two clusters to create a new cluster that has | ||
all the genesets that were present in the two children | ||
clusters | ||
- **Time Complexity** | ||
- O(n^3^) | ||
|
||
### Single | ||
|
||
The Single clustering method uses a linkage method where the distance | ||
between two clusters is the highest similarity score between any of the | ||
genesets in one cluster compared to any of the genesets in the other | ||
cluster. When two clusters are combined, the genesets within each of the | ||
clusters are put into a new cluster. No new calculations are needed at | ||
each iteration because we are simply reusing the similarity scores of | ||
all the genesets compared to each other. | ||
|
||
- **Algorithm** | ||
- Each gene set is initialized as its own cluster. | ||
- The similarity scores of all the genesets compared to each | ||
other are saved in a matrix | ||
- While we still have clusters that are similar: | ||
- Determine which two clusters to join: | ||
- The distance between two clusters is the highest | ||
similarity score between any geneset in one cluster and | ||
any geneset in the other cluster | ||
- The highest of these distances determines which two | ||
clusters will be joined | ||
- Combines the two clusters to create a new cluster that has | ||
all the genesets that were present in the two children | ||
clusters | ||
- **Time Complexity** | ||
- O(n^3^) |
Oops, something went wrong.