summary, inputs, outputs of scritps
Scripts are classified by step (see Table of contents) and programmatic language(BASH , PYTHON, JULIA, R).
- Filter raw data
- Georeferenced sequences alignments by species
- Species sequence pairwise comparison
- Genetic Diversity calculation
- Statistical analysis
- Taxonomy and habitat attributed to each individual sequences
- filter_raw_data.sh : it keeps only the CO1 sequences with lat/lon information.
- input :
- seqbold_data.tsv : georeferenced barcode sequences from the supergroup "actinopterygii" from BOLD
- output :
- co1_ssll_seqbold_data.tsv : table of fitlered CO1 sequences with lat/lon and BOLD's taxonomy information
- input :
- get_geonames_coordinates.sh : it uses geonames.org to attribute coordinates lat/lon of individual sequences from their textual information of location when lat/lon is missing.
- lat_long_DMS_DD_converter.py : it converts from DMS format to DD format the given coordinates.
- seq_alnt_filtered_data.sh : aligns sequences from the same species with MUSCLE and creates coordinates .coord file for each sequence.
- inputs :
- co1_ssll_seqbold_data.tsv : table of fitlered CO1 sequences with lat/lon
- outputs :
- {species}.fasta : alignment files of each {species} species
- {species}.coords : coordinates lat/lon of each individual sequences of each {species} species
- inputs :
- cluster_freshwater_vs_marine.sh : according to a list of marine species, moves the fasta and coords files into marine, freshwater repertories.
- inputs :
- 05-species_alnt : species fasta and coordinates files
- marine_actinopterygii_species.txt : list of "actinopterygii" saltwater species according to fishbase
- outputs :
- 06-species_alnt_cluster/freshwater : freshwater species fasta & coordinates files
- 06-species_alnt_cluster/marine : marine species fasta & coordinates files
- inputs :
- fasta_coords_files_species_generator.py : extracts sequences and associated coordinates from the filtered data.
- input :
- co1_ssll_seqbold_data.tsv : table of fitlered CO1 sequences with lat/lon
- outputs :
- {species}.fasta : alignment files of each {species} species
- {species}.coords : coordinates lat/lon of each individual sequences of each {species} species
- input :
- equalareacoords.R : attributes at each individual sequences an ID of cell of the shapefile of worldmap equal area projection from its coordinates.
- inputs :
- 06-species_alnt_cluster/freshwater/{species}.coords : coordinates files by freshwater {species} species
- 06-species_alnt_cluster/marine/{species}.coords : coordinates files by marine {species} species
- grid_equalarea200km : shapefile of worldmap equal area projection epsg:4326 with nested equal area grids (cell sizes of 200km)
- outputs :
- 06-species_alnt_cluster/freshwater/{species}.equalareacoords : cells of freshwater {species} species
- 06-species_alnt_cluster/marine/{species}.equalareacoords : cells of marine {species} species
- inputs :
- Lib_Compare_Pairwise.jl : functions to compute the Genetic Diversity value from a set of sequences.
- Lib_Create_Master_Matrices.jl : functions to create master data matrices that are used to compute genetic diversity.
- master_matrices.jl : generates master data matrices from species sequences alignments.
- input :
- 06-species_alnt_cluster : .fasta, .equalareacoords files for each species
- output :
- 07-master_matrices : individual sequences pairwise comparison data matrices for each species for each cell
- input :
- gdval_by_cell.sh : generates CSV files with 2 columns : cell ID and mean genetic diversity per species into the cell.
-
Lib_GD_summary_functions.jl : functions to calculate genetic diversity at species level and cell level
-
equalarea_numbers.jl : attributes mean genetic diversity at each equal area grid cell. Genetic diversity is calculated from master data matrices.
- input :
- 07-master_matrices : individual sequences pairwise comparison data matrices for each species for each cell
- output :
- equalarea_numbers.csv : genetic diversity by cell
- input :
-
metrics_by_area_and_species.jl: it generates files for statistical analysis at next step : mean genetic diversity per cell, genetic diversity per species per cell, number of individuals per species, number of species per cell, cell coordinates, cell ID...
- input :
- equalarea_numbers.csv : genetic diversity by cell
- gdval_by_area.csv : CSV files with 2 columns : cell ID and mean genetic diversity per species into the cell
- outputs :
- metrics_by_area_marine.csv : table of ID_cell,ISO3,is_sea,cloMeanVal,cloMinVal,cloMaxVal,bathyVal,AP,HDI_2015,fshD information attributed to each cell with marine species
- metrics_by_area_freshwater.csv : table of ID_cell,ISO3,is_sea,cloMeanVal,cloMinVal,cloMaxVal,bathyVal,AP,HDI_2015,fshD information attributed to each cell with freshwater species
- input :
-
latband_numbers.jl : attributes mean genetic diversity at each latitudinal band.
- input :
- pairwise_latbands.csv : genetic diversity per species by 10° latitudinal band
- outputs :
- 08-genetic_diversity/marine_latbands_numbers.csv : table of ID_latband,ISO3,is_sea,cloMeanVal,cloMinVal,cloMaxVal,bathyVal,AP,HDI_2015,fshD information attributed to each latitudinal latband with marine species
- 08-genetic_diversity/freshwater_latbands_numbers.csv : table of ID_latband,ISO3,is_sea,cloMeanVal,cloMinVal,cloMaxVal,bathyVal,AP,HDI_2015,fshD information attributed to each latitudinal latband with freshwater species
- marine_latbands_bootstraps.csv : table of ID_latband,ISO3,is_sea,cloMeanVal,cloMinVal,cloMaxVal,bathyVal,AP,HDI_2015,fshD information attributed to each latitudinal latband with marine species
- freshwater_latbands_bootstraps.csv : table of ID_latband,ISO3,is_sea,cloMeanVal,cloMinVal,cloMaxVal,bathyVal,AP,HDI_2015,fshD information attributed to each latitudinal latband with freshwater species
- input :
-
Lib_bootstrap.jl : functions to bootstrap by species latitudinal band genetic diversity
-
functions.R : library of R functions required to run figures scripts.
-
descripteurs.R : from genetic data and geographic,environmental data, this script generates a table with cell as row and genetic,environmental,geographic variables as column.
- inputs :
- metrics_by_area_freshwater.csv : table of ID_cell,ISO3,is_sea,cloMeanVal,cloMinVal,cloMaxVal,bathyVal,AP,HDI_2015,fshD information attributed to each cell with freshwater species
- metrics_by_area_marine.csv : table of ID_cell,ISO3,is_sea,cloMeanVal,cloMinVal,cloMaxVal,bathyVal,AP,HDI_2015,fshD information attributed to each cell with marine species
- marine_bo_o2dis.asc : spatial layer of marine oxygen concentration [mol/l]
- marine_bo_sst_mean.asc : spatial layer of sea surface temperature
- marine_velocity_mean.asc : spatial layer of velocity of velocity (marine)
- freshwater_wc2.0_bio_10m_01.tif : spatial layer of global mean temperature
- freshwater_velocity_mean.tif : spatial layer of velocity (freshwater)
- datatoFigshare : shapefile of drainage basins
- datacell_grid_descriteurs.csv : table of ID_cell,ISO3,is_sea,cloMeanVal,cloMinVal,cloMaxVal,bathyVal,AP,HDI_2015,fshD information attributed to each cell
- output :
- total_data_genetic_diversity_with_all_descripteurs.tsv : table of center of cell (xy) coordinates, ID of cell, mean of Genetic diversity, number of species, mean/sd number of individuals by species into each cell, bathymetry, chlorophyll concentration, oxygen concentration, temperature, drainage basin surface area into each cell
- inputs :
-
figure1.R : from table of genetic,environmental,geographic variables by cell and shapefiles, it generates maps of the global distribution of genetic diversity as a tiff file.
- inputs :
- total_data_genetic_diversity_with_all_descripteurs.tsv : table of center of cell (xy) coordinates, ID of cell, mean of Genetic diversity, number of species, mean/sd number of individuals by species into each cell, bathymetry, chlorophyll concentration, oxygen concentration, temperature, drainage basin surface area into each cell
- grid_equalarea200km : shapefile of worldmap equal area projection epsg:4326 with nested equal area grids (cell sizes of 200km)
- ne_50m_land : shapefile of worldcoast from (http://www.naturalearthdata.com)
- ne_50m_rivers_lake_centerlines_scale_rank : shapefile of riverlines from (http://www.naturalearthdata.com)
- GSHHS_h_L2.shp : shape polygon file of big lakes from (http://www.naturalearthdata.com)
- output :
- inputs :
-
figure2.R : from table of genetic,environmental,geographic variables by cell and species diversity , this script generates figures of the congruence between fish genetic and species diversity.
- inputs :
- total_data_genetic_diversity_with_all_descripteurs.tsv
- grid_equalarea200km : shapefile of worldmap equal area projection epsg:4326 with nested equal area grids (cell sizes of 200km)
- ne_50m_land : shapefile of worldcoast
- ne_50m_rivers_lake_centerlines_scale_rank : shapefile of riverlines
- GSHHS_h_L2.shp : shape polygon file of big lakes
- equalarea_id_coordsCA_FWRS_MR_RS.csv : Table of species diversity by 200km square cell
- output :
- inputs :
-
figure3.R : from table of genetic,environmental,geographic variables by cell, this script generates figures of determinant of the patterns of fish genetic diversity.
- inputs :
- total_data_genetic_diversity_with_all_descripteurs.tsv
- grid_equalarea200km : shapefile of worldmap equal area projection epsg:4326 with nested equal area grids (cell sizes of 200km)
- EnvFreshwater.csv : slope and flow information for each geographical cell with a river
- distanceCote : distance from shore for each cell
- output :
- inputs :
-
figureS1.R : from table of genetic,environmental,geographic variables by cell and shapefiles, it generates Spatial autocorrelogramme based on the I-Moran coefficient figure.
- inputs :
- total_data_genetic_diversity_with_all_descripteurs.tsv : table of center of cell (xy) coordinates, ID of cell, mean of Genetic diversity, number of species, mean/sd number of individuals by species into each cell, bathymetry, chlorophyll concentration, oxygen concentration, temperature, drainage basin surface area into each cell
- grid_equalarea200km : shapefile of worldmap equal area projection epsg:4326 with nested equal area grids (cell sizes of 200km) output :
- figureS1.pdf
- inputs :
-
figureS2.R : from table of genetic,environmental,geographic variables by cell, this script generates figure of global distribution of higher and lower percentiles.
- inputs :
- total_data_genetic_diversity_with_all_descripteurs.tsv : table of center of cell (xy) coordinates, ID of cell, mean of Genetic diversity, number of species, mean/sd number of individuals by species into each cell, bathymetry, chlorophyll concentration, oxygen concentration, temperature, drainage basin surface area into each cell
- grid_equalarea200km : shapefile of worldmap equal area projection epsg:4326 with nested equal area grids (cell sizes of 200km)
- ne_50m_land : shapefile of worldcoast
- ne_50m_rivers_lake_centerlines_scale_rank : shapefile of riverlines
- GSHHS_h_L2.shp : shape polygon file of big lakes
- figureS2.tiff
- inputs :
-
figureS3.R : it generates barplot of species diversity distribution accross latitudinal band (10°) and a boxplot of marine, freshwater species diversity by cells.
- inputs :
- total_data_genetic_diversity_with_all_descripteurs.tsv
- grid_equalarea200km : shapefile of worldmap equal area projection epsg:4326 with nested equal area grids (cell sizes of 200km)
- equalarea_id_coordsCA_FWRS_MR_RS.csv : Table of species diversity by 200km square cell
- output :
- inputs :
-
figureS4.R : from table of genetic,environmental,geographic variables by cell, this script generates figure of regional effect on the global genetic diversity pattern.
- input :
- output :
-
figureS5.R : it generates maps of the number of species by cell, number of sequences by cell and number of sequences by species by cell.
-
figureS6.R : barplot of sequences number and species number by taxonomic family/order
- input :
- output :
-
figureS7.R : it generates maps of the taxonomic coverage by cells for marine and freshwater species
- inputs :
- total_data_genetic_diversity_with_all_descripteurs.tsv
- grid_equalarea200km : shapefile of worldmap equal area projection epsg:4326 with nested equal area grids (cell sizes of 200km)
- ne_50m_land : shapefile of worldcoast
- ne_50m_rivers_lake_centerlines_scale_rank : shapefile of riverlines
- GSHHS_h_L2.shp : shape polygon file of big lakes
- equalarea_id_coordsCA_FWRS_MR_RS.csv : Table of species diversity by 200km square cell
- output :
- inputs :
-
figureS8.R : it generates barplot of species diversity distribution accross latitudinal band (10°) at species level
- inputs :
- marine_latbands_bootstraps.csv : table of ID_latband,ISO3,is_sea,cloMeanVal,cloMinVal,cloMaxVal,bathyVal,AP,HDI_2015,fshD information attributed to each latitudinal latband with marine species
- freshwater_latbands_bootstraps.csv : table of ID_latband,ISO3,is_sea,cloMeanVal,cloMinVal,cloMaxVal,bathyVal,AP,HDI_2015,fshD information attributed to each latitudinal latband with freshwater species
- output :
- inputs :
- rename_family_bold_to_ncbi.sh : from cured table of individual sequences, cure family column by renaming BOLD family by its equivalent into NCBI taxonomy.
- input :
- cured_sequences_withdemerpelag.csv : cured table of individual sequences with habitat column
- output :
- cured_family_sequences_withdemerpelag.csv : cured taxonomy/family table of individual sequences with habitat column
- input :
- sequences_table.py : from table of genetic,environmental,geographic variables by cell, name of fasta files into 06-species_alnt_cluster and CO1 sequences with lat/lon information, it writes a table of sequences with geographical cell localisation .
- inputs :
- total_data_genetic_diversity_with_all_descripteurs.tsv : table with cell as row and genetic,environmental,geographic variables as column
- 06-species_alnt_cluster : folder containing {species}.fasta files with {species} as BOLD's name of the species
- co1_ssll_seqbold_data.tsv : table of fitlered CO1 sequences with lat/lon and BOLD's taxonomy information
- output :
- map_marine_sequences.csv : table of individual sequences with geographical cell localisation
- inputs :
- sequences_taxonomy.py : from cured table of individuals sequences and data table used for the different models, write a table of number of species/number of sequences by taxonomic order/family used for each model.
- inputs :
- cured_family_sequences_withdemerpelag.csv : cured taxonomy/family table of individual sequences with habitat column
- models : folder which contains model's table of cells
- output :
- watertype_all_modeles_effectives_family.csv : table of number of species/number of sequences by taxonomic order/family used for each model
- inputs :
- check_freshwater_assignation.R : from model's table of cells and table of sequences, it writes a list of species with wrong watertype assignment according to the model using rfishbase.
- inputs :
- map_marine_sequences.csv : table of individual sequences with geographical cell localisation
- freshwaterDG.txt : freshwater model's table of cells (model's table of cells are stored into folder models)
- output :
- wrong_freshwater_sequences.csv : table of individual sequences with wrong watertype assignment
- inputs :
- sequences_demerpelag.R : from table of sequences with geographical cell localisation, it assigns habitat (demersal, pelagic...) information based on species name attributed to the sequence and write a new table with habitat column.
- input :
- map_marine_sequences.csv : table of individual sequences with geographical cell localisation
- output :
- sequences_withdemerpelag.csv : table of individual sequences with habitat column
- input :
- sequences_cure_species_name.R : cure failed habitat assignment and wrong species name which are not recognized by fishbase database.
- input :
- sequences_withdemerpelag.csv : table of individual sequences with habitat column
- output :
- cured_sequences_withdemerpelag.csv : cured table of individual sequences with habitat column
- input :