Michael G. Campana
Smithsonian Conservation Biology Institute
Ruby implementation of CorrSieve
Original Ruby source code (CorrSieve versions <= 1.6-5) copyright (c) Michael G. Campana, 2010-2011 is licensed under the GNU General Public License (version 3 or later). See included LICENSE file for details.
Public domain updates by Michael G. Campana (2019) to the original Ruby source code (CorrSieve versions >= 1.7-0) are United States government works. These modifications are annotated in the modified source code.
CorrSieve is a Ruby and R package that filters Q value output from the programs STRUCTURE (Pritchard et al. 2000) and INSTRUCT (Gao et al. 2007) by correlation values. It outputs matrices showing significant correlations between individual runs for each K. It can also calculate ΔK (Evanno et al. 2005), mean FSTs and ΔFST. These measures help identify meaningful values of K.
rubyCorrSieve is compatible with Windows, Linux, and UNIX (including macOS) operating systems. rubyCorrSieve requires the Ruby interpreter. Installation files are available at www.ruby-lang.org/en/downloads. Install the appropriate interpreter for your operating system.
Clone this repository to your system. Using a Linux/UNIX command line this can be performed using git
:
git clone https://github.com/campanam/rubyCorrSieve
You may need to make the CorrSieve-1.7-0.rb file executable. On Linux/UNIX, enter:
chmod +x rubyCorrSieve/*.rb
Move the CorrSieve-1.7-0.rb executable and LICENSE file to your chosen execution directory.
-
Prepare input for CorrSieve. CorrSieve reads directly from STRUCTURE and INSTRUCT output files, but requires that all files be in a single folder. Do not place other files in this folder. All files should end in ‘_f’ without an extension, e.g:
TEST_11_f
TEST_12_f
TEST_13_f
TEST_21_f
TEST_22_f
TEST_23_f -
Launch a terminal window (Linux/Unix) or command prompt (Windows). On Windows, ensure that you launch the command prompt with the Ruby interpreter that came with the installed Ruby package.
-
Execute the CorrSieve script. If CorrSieve is in your $PATH (Linux/Unix), you can omit the
ruby
command:
ruby CorrSieve-1.7-0.rb
-
The splash screen will load. Enter 'C' to continue with program execution, 'L' to see the licensing information, and 'X' to exit the program and then press ENTER. Capitalization of command choices does not matter for any CorrSieve prompts.
-
Once you continue execution, the program will prompt you for the path to the folder containing the STRUCTURE or INSTRUCT raw data. If the folder is in the same folder that the script is currently located in (i.e. both are on the desktop), simply type the name of the folder. Otherwise, type the full file path (e.g.
C:\Users\<username>\Desktop
on Windows or/Users/<username>/Desktop/
on macOS). -
The program will ask you to enter the name of the run. This is the name of the output files generated by CorrSieve. Type the name and press ENTER.
-
The program will ask you to enter the the path to the folder in which to save the output files. Type the folder path and press ENTER. Pressing ENTER without previously typing a folder name will place the files in the current directory. If the folder does not exist, CorrSieve will create the directory.
-
The program will prompt you if you wish to calculate the Q matrix correlations. Type ‘yes’/‘y’ or ‘no’/‘n’ and press ENTER to continue.
-
If the Q matrix correlations are calculated, the program will ask for the minimum Pearson correlation value (r value) to be considered significant. Enter the appropriate value and press ENTER to continue.
-
If the Q matrix correlations are calculated, the program will then ask if you also wish to filter the data by the significance level. Type ‘yes’/‘y’ or ‘no’/‘n’ and press ENTER to continue.
NOTE: The average maximum correlation algorithm ignores non-significant values as potential maximum correlations. The columns-and-rows method filters first by correlation and then again by significance.
- The program will ask if you wish to estimate the p value or calculate an exact p. Selecting yes will estimate the p and prompt asking for the number of permutations to estimate p. Selecting no will calculate the exact p. Type ‘yes’/‘y’ or ‘no’/‘n’ and press ENTER to continue.
WARNING: For large data sets, calculating the exact p will be EXTREMELY slow. This should only be used if necessary.
-
If the p-value filter was selected, it will ask for the maximum p value to be considered significant. Enter the appropriate value and press ENTER to continue.
-
The program will prompt you to decide between the average maximum correlation filter method outlined in Cockram et al. (2008) or the columns-and-rows method described in Campana et al. (2011). Type ‘yes’/‘y’ or ‘no’/‘n’ and press ENTER to continue.
-
The program will then ask if you wish to output the unfiltered correlation matrices. If yes, the program will output the raw correlations (and p-values if selected) in a separate file. Type ‘yes’/‘y’ or ‘no’/‘n’ and press ENTER to continue.
-
The program will ask if you wish to summarise Ln P(D) and calculate ΔK. If yes, this will output these statistics in a separate file. Type ‘yes’/‘y’ or ‘no’/‘n’ and press ENTER to continue.
-
The program will ask if you wish to calculate FST statistics (and ΔFST). If yes, this will output these statistics in a separate file. Type ‘yes’/‘y’ or ‘no’/‘n’ and press ENTER to continue.
NOTE: FST statistics are only available from STRUCTURE data generated under the admixture model. Output generated in INSTRUCT (even under the admixture model) or under other STRUCTURE models will cause an error.
-
If you opted to calculate FST statistics, Since FST output will not necessarily be in the same order each run, the program will ask you to determine the optimization procedure to best order FST values. Selecting 1 will use no optimization procedure. Option 2 will order the raw FSTs by value, while option 3 will order these data using the matrix correlations.
-
The program will then process and output the data. The files containing the filtered matrices, the ΔK and Ln P(D) values, the FSTs and the unfiltered correlation matrices will be named “-filtered.txt”, “-deltaK.txt”, “-Fst.txt” and “- matrix_correlations.txt” respectively.
NOTE: The filtered matrices, ΔK and FST output files are tab-delimited text files. They can therefore be directly opened in spreadsheet programs such as Microsoft Excel.
NOTE: For K = 1, STRUCTURE will always generate a Q value of 1.0. This causes a divide by zero error (the meaning of ‘NaN’ in the raw matrix correlations), resulting in a non-significant correlation.
NOTE: In the FSTs output file, the 'Overall Mean' and 'Overall Standard Deviation' calculate the mean and standard deviation of FSTs ignoring cluster assignation. The 'St. Dev. of Means' calculates the standard deviation between the mean FSTs of the individual clusters. The 'Mean St. Dev.' is the mean of the standard deviation of the FSTs within individual clusters. The 'St. Dev. of St. Devs.' is the standard deviation between the standard deviation of the FSTs within individual clusters.
Please report all bugs (and any suggestions for improvements) to Michael G. Campana (campanam@si.edu).
Campana, M.G. et al. 2011. CorrSieve: software for summarizing and evaluating Structure output. Mol. Ecol. Resour. 11:349-352. https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1755-0998.2010.02917.x
Rita Cannas helpfully checked the method for calculating ΔK. Dent Earl and Michał Żmihorski identified bugs in the software.
Campana, M.G. et al. 2011. CorrSieve: software for summarizing and evaluating Structure output. Mol. Ecol. Resour. 11:349-352. https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1755-0998.2010.02917.x
Cockram et al. 2008. Association mapping of partitioning loci in barley. BMC Genet. 9: 16–29. https://bmcgenet.biomedcentral.com/articles/10.1186/1471-2156-9-16.
Evanno et al. 2005. Detecting the number of clusters of individuals using the software STRUCTURE: a simulation study. Mol. Ecol. 14: 2611-2620. https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1365-294X.2005.02553.x.
Gao et al. 2007. A Markov Chain Monte Carlo approach for joint inference of population structure and inbreeding rates from multilocus genotype data. Genetics. 176: 1635-1651. https://www.genetics.org/content/176/3/1635.
Pritchard et al. 2000. Inference of population structure using multilocus genotype data. Genetics. 155: 945–49. https://www.genetics.org/content/155/2/945.