Skip to content

General Backgroud Information

Marie Oestreich edited this page May 12, 2022 · 5 revisions

Count files, annotation files & reference files

Count files

Count files are those files that contain your gene expression tables. You don't have to load your gene expression from a file, you can also load them from existing R-environments. But if you do load them from files, then you have to set the folder where those files are in the init_wd() function using the parameter dir_count_data. The data tables must have samples as columns and genes as rows. When loading from file, there also needs to be an additional column that contains the gene symbols. When you are using existing R data frames instead, then the row names must be gene symbols and there should not exist any other column other than the samples (e.g., no column that contains gene symbols or IDs) and the column names of the data frame must be sample IDs that are the same as those present in the corresponding annotation file.

Annotation files

The annotation files are those that contain your meta data tables. Just like with the expression data, you don't have to load it from files, your can also use existing R objects (data frames). However, if you do load the annotation data from file, samples must be rows and columns must be meta data categories (e.g., age, sex, etc.). There mus be one column that contains unique sample IDs. If you are using existing R data frames, then the row names must be sample IDs.

Reference files

Reference files are files that can be used as a reference for gene enrichment analyses or transcription factor analysis. You can find the reference files in the repository in a folder named reference_files. After downloading, provide the path to that folder to the init_wd() function (parameter dir_reference_files) and hCoCena will access them automatically when needed.

Network Integration

The previously constructed networks are now being integrated. Here, the function build_integrated_network() offers the parameter mode, which defines the integration method.

  • Union-based integration - mode = 'u': The function builds an edge list representing a multigraph from the union of all layer-specific networks. Thus, also network parts that are unique to some layers will be present in the resulting integrated network, creating an integrated network that provides a wholistic, cross-dataset view on gene co-expression.
  • Intersection-based integration - mode = 'i': If you are interested in the co-expression network of one dataset and how that particular network changes in the other datasets, you have the option for an intersection-based integration. You define which of your networks is to serve as your reference by setting the with parameter to the respective layer number. The vertices of the resulting network will be identical to the reference network, but the edges connecting them will be greatly impacted by the other datasets.

The network graph is a multigraph if a pair of nodes is connected in more than one of the layer-specific networks. These multi-edges can be simplified using different settings of the multi_edges parameter:

  • multi_edges = 'min' (default): The smallest correlation found between two genes in any of the datasets is used as the final edge weight.
  • multi_edges = 'mean': The mean correlation found between two genes in all of the datasets is used as the final edge weight.
  • multi_edges = 'max': The highest correlation found between two genes in any of the datasets is used as the final edge weight.