- Docker
- Docker Compose plugin
- A reverse proxy such as NGINX, Trafik, or similar (configuring this is out of scope for this guide)
- A valid HTTPS certificate (configuring this is out of scope for this guide)
The following assay types can be ingested into an EpiVar node:
RNA-seq
ATAC-seq
H3K4me1
H3K4me3
H3K27ac
H3K27me3
-
A metadata file for the bigWig tracks, which can be one of the following:
-
An XLSX file with one or more sheets (see an example for the Aracena et al. dataset), each with the following headers:
file.path
: relative path tobigWig
, without theEPIVAR_TRACKS_DIR
environment variable directory prefixethnicity
: ethnicity / population group ID (not name!)- if set to
Exclude sample
, sample will be skipped
- if set to
condition
: condition / experimental group ID (not name!)sample_name
: Full sample name, uniquely indentifying the sample withinassay
,condition
,donor
, andtrack.view
variablesdonor
: donor ID (i.e., individual ID)track.view
: literal value, one ofsignal_forward
,signal_reverse
, orsignal_unstranded
track.track_type
: must be the literal valuebigWig
assay.name
: one of the available assays
The file may have additional headers, but these will be discarded internally.
-
OR, a JSON file containing a list of objects with the following keys, mapping to the above headers in order:
path
ethnicity
condition
sample_name
donor
view
type
assay
-
-
A dataset configuration file, which takes the form described in the example configuration file. Here, assays available in this node can be specified, as well as experimental conditions, population groups, and functions for interacting with the genotype VCF file.
This file specifies information about the dataset being hosted by the EpiVar node, including dataset title, sample groups and experimental treatments (in both of these, each entry has an ID and a name), assembly ID (
hg19
orhg38
), and how to find samples in the genotype VCF file. -
A human-readable dataset description file, in Markdown format, to show in the
About Dataset
tab in the browser. See an example for the Aracena et al. dataset.
-
A bgzipped, Tabix-indexed VCF containing sample variants, using one of two available reference genomes (
hg19
/hg38
). -
A set of normalized signal matrices: one per assay, each containing columns of samples and rows of features (see an example for ATAC-seq.)
-
A set of bigWigs, one or two (in the case of RNA-seq; forward/reverse view) per sample-assay pair.
-
Peak and gene-peak-link CSV files, respectively containing the following:
-
Peak files are, by default, named according to the template:
<qtls-directory>/QTLs_complete_$ASSAY.csv
, where$ASSAY
is one of the available assays. This template naming can be changed (keeping the$ASSAY
replaceable value) using theEPIVAR_QTLS_TEMPLATE
environment variable.They are CSVs where each row contains data about the SNP, the p-value associations between the genotypes and each treatment, and the corresponding assay. There are a couple truncated example files in
/input-files/qtls
. The required headings are the following:rsID
: The rsID of the SNPsnp
: The SNP in the SNP-peak association; formatted likechr#_######
(UCSC-formatted chromosome name, underscore, position)feature
: The feature name - eitherchr#_startpos_endpos
orGENE_NAME
pvalue.*
where*
is the ID of the condition, as specified in themetadata.json
file (see above.)- These are floating point numbers
feature_type
: The assay the peak is from - e.g.,RNA-seq
As an example, the header row for the Aracena et al. dataset's RNA-seq QTLs file is the following:
rsID,snp,feature,pvalue.NI,pvalue.Flu,feature_type
-
Some peaks are associated with genes, and their links should be provided in a gene-peak-link CSV file. This takes the form of a CSV with header row:
"symbol","peak_ids","feature_type"
wheresymbol
is gene symbol, and should be unique;peak_ids
is a feature string composed of an underscore-separated contig/start position/end position (e.g.,chr1_9998_11177
); andfeature_type
is the name of the assay (see available assays.)See the version of this file for the hg19 Aracena et al. dataset for an example.
-
In order to follow this guide, you should have experience deploying Docker containers, including configuring volumes, networks, and environment variables.
To turn a metadata XLSX file into a JSON file, run the following command:
# the -i is important here; otherwise, the metadata.json file will be blank!
docker run -i ghcr.io/c3g/epivar-server node ./scripts/metadata-to-json.js < path/to/metadata.xlsx > data/metadata.json
Alternatively, generate a metadata JSON matching the required format directly.
In a production instance, you will need the multiple volumes/bind-mounts from the host filesystem to the server Docker container.
- Your dataset's config file should be bound to
/app/config.js
inside the container. - Your dataset's about file (in Markdown format) should be bound to
/app/data/about.md
inside the container. - The genotype
.vcf.gz
and.vcf.gz.tbi
should be bound to/app/data/genotypes.vcf.gz
and/app/data/genotypes.vcf.gz.tbi
, respectively. - Your
metadata.json
file, pre-existing or as created above, should be bound to/app/data/metadata.json
.
- The node must be provided with a tracks folder containing subfolders and bigWig files matching the paths specified
in the
metadata.json
file described above. This folder can be bound as read-only, and should be bound to/tracks
, e.g.:/path/to/tracks-dir:/tracks:ro
. - The node must be provided with a readable/writable volume in which to place merged tracks, computed on-the-fly for
visualization, bound to
/mergedTracks
inside the container, e.g.:/volumes/mergedTracks:/mergedTracks
. - The Redis container, used for caching, does not strictly require a filesystem mount. However, to preserve the cache
across system restarts, a readable/writable volume should be bound to the Redis container's
/data
path, e.g.:/volumes/redis:/data
. - The database must be persisted to the host filesystem, bound to
/var/lib/postgresql/data
inside the container, e.g.:/volumes/db:/var/lib/postgresql/data
.
There are a few required environment variables that do not have default values that must be configured when deploying a node. Some of these required environment variables are secrets, so they should not be shared or made public:
EPIVAR_NODE_BASE_URL=https://my-node-url.example.org
EPIVAR_SESSION_SECRET=some-long-secret-value-do-not-share-me
POSTGRES_PASSWORD=my-secure-password
EPIVAR_PG_CONNECTION=postgresql://postgres:my-secure-password@epivar-db:5432/postgres
-
EPIVAR_NODE_BASE_URL
should not have a trailing slash, nor the/api
suffix even though API endpoints are off of this suffix. -
These environment variables can either be configured in a
.env
file and attached to a Docker Compose container via theenv_file
directive attached to both the EpiVar server container and the Postgres container, or put into theenvironment
directive in the Compose file directly -EPIVAR_SESSION_SECRET
andEPIVAR_PG_CONNECTION
are for the EpiVar server container, andPOSTGRES_PASSWORD
is for the Postgres container. Either way, make sure not to commit them to any public repository. -
Several other configuration options are available, and documented in commented code, in the
/envConfig.js
file. A lot of the default values here match how the Docker container is configured, especially the filesystem path options, so change with caution.
Assuming you have set up a Docker Compose file, similar to the one we provide as an example, you can start the node using the following command:
docker compose up -d
First, import the assembly gene list and gene-peak association data into the database using the following command:
docker compose exec -i epivar-server node ./scripts/import-genes.mjs < ./input-files/flu-infection-gene-peaks.csv
Then, import peaks and pre-computed peak matrix values into the database using the following command:
docker compose exec epivar-server node ./scripts/import-peaks.js
Note: This will take a while.
Then, calculate summary data for the peaks:
# Aggregate data for peaks grouped by SNP and gene, used for autocomplete:
docker compose exec epivar-server node ./scripts/calculate-peak-groups.mjs
# Used to generate Manhattan plots for chromosome/assay pairs, binned by SNP position:
docker compose exec epivar-server node ./scripts/calculate-top-peaks.mjs
Finally, ensure the cache is clear in case any values have been added accidentally during data ingestion, or are remaining from prior data ingestions:
docker compose exec epivar-server node ./scripts/clear-cache.js
In order to connect an EpiVar node to the EpiVar Browser portal, the node must be publicly accessible with a valid HTTPS certificate and a reverse proxy passing traffic to the EpiVar server (the configuration of which is out of scope for this guide.)
Then, contact us at epivar@computationalgenomics.ca, including information about your node, your dataset, and including the domain name + path of your instance, and we will decide whether to include your node in the list of nodes available in the EpiVar Browser.