Skip to content

Latest commit

 

History

History
414 lines (320 loc) · 18.7 KB

toml.md

File metadata and controls

414 lines (320 loc) · 18.7 KB

TOML files

Readfish experiments are configured using TOML files, which are minimal and easy-to-read markup files. Our configuration only uses tables and arrays of tables populated with key-value pairs. The TOML file contains almost all the information required for running readfish, such as information on barcodes, control/analysis regions, what basecaller or aligner to use. There are several example TOMLS, with comments explaining what each field does, as well as the overall purpose of the TOML file here - https://github.com//LooseLab/readfish/tree/main/docs/_static/example_tomls.

Data model

To understand each section of the TOML file it helps to understand the work that readfish does:

  1. Initialise a connection to the MinKNOW API via the Read Until API
  2. Initialise a basecaller
  3. Initialise an aligner with a pre-computed reference file
  4. Begin streaming live data from the live sequencing experiment
  5. For each batch of the Read Until API read chunks:
    1. Basecall the raw signal chunks
    2. Align the basecalled chunks
    3. Parse the alignments using the experiment configuration, determining whether the read is on or off target, then pass the decision to the Read Until API for it to be effected on the sequencer
    4. (Optionally) Check for updated configuration parameters with a new reference or targets
:name: data-model
:alt: Flow diagram showing a MinION, computer, and box labelled 'readfish'. The flow show 'raw data' going from the minion, through the computer, into readfish. Where it passes through a basecaller and an Aligner. The return arrow then passes from readfish, to the computer, to the MinION; this line is labelled unblock commands.
:align: center

Flow diagram of data flow from the sequencer to readfish and back.

As readfish is designed around the need to both basecall and align data we have two TOML sections for configuring these tools. The first, caller_settings, controls the basecaller and the second, mapper_settings controls the aligner. These are expanded on in the {ref}plugin configuration <plugins-config> section.

In addition readfish also manages barcodes and analysis regions on the flow cell. These allow for arbitrary configuration schemas that match your experimental needs. These are expanded on in the {ref}barcode configuration <barcodes-config> and {ref}analysis regions <analysis-regions> sections.

(plugins-config)=

Plugin configuration

Readfish uses "plugins" to configure basecalling and alignment. We provide some default plugins using Guppy (via ont-pyguppy-client-lib) and mappy (via mappy), for basecalling and alignment respectively. The way that you tell readfish what plugin to load is by specifying it after the table name, for example:

[caller_settings.guppy]
               # ^^^^^^
               # Chooses the Guppy plugin

This indicates that readfish should use Guppy for basecalling.

Basecaller

Guppy

This section is specific to the Guppy plugin for basecalling. The caller_settings.guppy [table] specifies the basecalling parameters used by the basecaller. An example table is given below:

[caller_settings.guppy]
config = "dna_r10.4.1_e8.2_400bps_hac"
address = "ipc:///tmp/.guppy/5555"
debug_log = "basecalled_chunks.fq" #optional

The only required keys are address and config these indicate to Guppy what server to connect to and what basecalling model to load. The config parameter must a valid Guppy configuration excluding the file extension; these are found in the Guppy data folder (on linux ls /opt/ont/guppy/data/*.cfg). The address is the path to the IPC socket Guppy is listening on, for Guppy servers run by MinKNOW the default is ipc:///tmp/.guppy/5555. The debug_log is an optional file that the basecalled FASTQ are written to.

Key Type Description Required
config string Name of base calling config True
address string Address of Guppy socket - default ipc:///tmp/.guppy/5555 True
debug_log string Optional - Filepath to write out base-calls to - should end in a FASTQ suffix. False

Any extra key value pairs are passed varbatim to the PyGuppyClient instance upon initialisation as keyword arguments.

For example:

server_file_load_timeout = 60

Will set the server_file_load_timeout parameter to 60 seconds for Guppy.

Example Guppy server parameters
.. automethod:: pyguppy_client_lib.pyclient.PyGuppyClient.set_params
    :noindex:

To see parameters that your Guppy installation has run the following python snippet in your readfish environment

PAGER=cat python -c "import pyguppy_client_lib.pyclient as pgc; help(pgc.PyGuppyClient.set_params)"

Basecaller no operation (no_op)

In addition to the Guppy basecaller we provide a "no operation" basecaller. This plugin does nothing. It will only iterate the live chunks of data and pass through the minimal amount of data that is needed for the next steps. This is added using:

[caller_settings.no_op]

Aligner

minimap2

Currently we provide two mapping plugins, mappy, and mappy-rs. Happily these both take the same fields in the TOML file, so you can switch between them without changing your TOML file parameters.

Currently, these plugins take the same TOML file plugins section, and which Aligner is used is chosen based on the TOML mapper_settings section name. .. warning:: It is worth noting that for PromethION runs, mappy will not be able to keep up and mappy-rs with at least four threads is recommended.

For mappy (single threaded)

[mapper_settings.mappy]
               # ^^^^^^
               # Chooses the mappy plugin

For mappy-rs (multi threaded)

[mapper_settings.mappy_rs]
               # ^^^^^^
               # Chooses the mappy-rs plugin

The fn_idx_in field is the path to the reference, accepting either a precomputed minimap2 index, an .mmi file, or a uncompressed or gzipped fasta file. We strongly prefer an already created MMI. The debug_log field is the file name that the mappings produced by mappy-rs/mappy are written into, in the PAF format.

mappy-rs also takes an optional n_threads field, which determines the number of threads which are used for alignment. n_threads defaults to 1, so if you need more you must set it!

[mapper_settings.mappy_rs]
fn_idx_in = "/Path/To/Reference.mmi"
n_threads = 4
debug_log = "mapped_chunks.paf"
Key Type Description Required
fn_idx_in string Absolute path to the reference FASTA/mmi True
debug_log string File to write PAF records to from alignments seen False
n_threads int Optional - if using mappy_rs, number of threads for alignment False

Any extra key value pairs are passed varbatim to the Aligner instance upon initialisation as keyword arguments.

For example:

best_n = 1

Will set the best_n parameter on the Aligner instance to 1.

All available keyword arguments can be seen on the mappy documentation

Aligner no operation (no_op)

In addition to the minimap2 based aligners we provide a "no operation" aligner. This plugin does nothing. It will only iterate the Results from the basecaller and pass them through to the next steps. This is added using:

[mapper_settings.no_op]

(analysis-regions)=

Analysis regions

Analysis regions are the arrays of tables that define different experimental conditions over the surface of the flow cell. For experiments that do not use barcodes you must have at least one analysis region.

The physical layout of each flow cell constrains how many experimental conditions can be used as the number of channels in each region must be equal. The number of regions subtables determines how many times the flow cell is divided. The maximum number of regions for MinION flow cells is 32 and for PromethION flow cells is 120. The number of conditions must be a factor of the number for the selected combination.

As an example applying two analysis regions to the layout below would split the flow cell into left/right regions.

+------------+     +------+------+
|  1  2  3  4|     |  1  2|  3  4|
|  5  6  7  8| --> |  5  6|  7  8|
|  9 10 11 12|     |  9 10| 11 12|
+------------+     +------+------+

Experiments with one region

When wanting to apply a single targeting strategy over the flow cell a single region can be provided. This can be specified like so:

[[regions]]
name = "condition name"
min_chunks = 1
max_chunks = 4
targets = "path/to/targets.bed"
single_on = "stop_receiving"
multi_on =  "stop_receiving"
single_off = "unblock"
multi_off = "unblock"
no_seq = "proceed"
no_map = "proceed"

A full table of keys that can be given can be found in region sub-table.

Experiments with more than region

Experiments with multiple regions (for example those with an analysis and control condition) regions can be specified for each condition. For example:

[[regions]]
name = "analysis"
control = false
...

[[regions]]
name = "control"
control = true
...

(regions-sub-tables)=

Regions sub-tables

Each Regions sub-table must contain all of the required keys, these are the same between barcoded and non-barcoded toml files.

Key Type Values Description Required
name string N/A The name given to this condition True
control bool N/A Is this a control condition. If true all reads will be ignored in this region, default false False
min_chunks int N/A The minimum number of read chunks to evaluate True
max_chunks int N/A The maximum number of read chunks to evaluate True
targets string or array N/A The genomic targets to accept or reject; see target formats True
single_on string [unblock, stop_receiving, proceed] The action to take when a read has a single on-target mapping True
multi_on string [unblock, stop_receiving, proceed] The action to take when a read has multiple mappings with at least one on-target True
single_off string [unblock, stop_receiving, proceed] The action to take when a read has a single off-target mapping True
multi_off string [unblock, stop_receiving, proceed] The action to take when a read has multiple mappings that are all off-target True
no_seq string [unblock, stop_receiving, proceed] The action to take when a read does not basecall True
no_map string [unblock, stop_receiving, proceed] The action to take when a read does not map to your reference True
below_min_chunks string [unblock, stop_receiving, proceed] Optional. The action to take when a read has not been seen at least min_chunks times, default proceed False
above_max_chunks string [unblock, stop_receiving, proceed] Optional. The action to take when a read has been seen more than max_chunks times, default unblock False

The meanings of unblock, stop_receiving, and proceed are given below.

  1. proceed: Allow one more chunk to be captured, before trying to make another decision, i.e proceed for now
  2. unblock: Unblock the read
  3. stop_receiving: Allow the read to be sequenced, and stop receiving signal chunks for it.

(target-formats)=

Target Formats

The targets field on a region can accept either a string or an array of strings. If a string is provided this should be a fully qualified path to a text file that consists of genomic targets in either as CSV or BED entries. When an array is given all the elements in the array must conform with the non-BED formats. Targets will be collapsed where they overlap.

BED or CSV targets

If the targets entry is a path to a file it must either be a six-column-BED formatted file or a CSV. If a CSV file there should be no header and the coordinates should be given as

contig,start,stop,strand

As a concrete example:

chr1,1000,2000,+
chr2,3000,4000,-

Targets given in this format will only select (for or against) reads where the alignment start position is within the region on the given strand.

TOML array targets

If specifying targets in a TOML array they can be any of contigs or coordinates:

For example contigs:

  • >chr1 Human chromosome 1 becomes chr1

Targets given in this format specify the entire contig as a target to select for or against.

(barcodes-config)=

Barcode specific configuration

If you are using Guppy for basecalling then the additional barcode_kits parameter is required on the caller_settings table.

[caller_settings.guppy]
config = "dna_r10.4.1_e8.2_400bps_fast"
address = "ipc:///tmp/.guppy/5555"
debug_log = "basecalled_chunks.fq"
barcode_kits = "EXP-NBD196"

Note, to specify multiple barcode kits, the format to use is a space separated string, not an array. For example, "KIT1 KIT2".

[[regions]] are not needed when using a barcoded schema. Instead a TOML table named barcodes is required. This table has two required sub-tables barcodes.classified and barcodes.unclassified. These are the fallback configurations for unspecified barcodes. For specific barcodes a full configuration table should be specified using the barcode name, e.g. barcodes.barcode01 for "barcode01".

[barcodes.classified]
name = "classified_reads"
control = false
min_chunks = 0
max_chunks = 2
targets = []
single_on = "unblock"
multi_on = "unblock"
single_off = "unblock"
multi_off = "unblock"
no_seq = "proceed"
no_map = "proceed"

[barcodes.unclassified]
name = "unclassified_reads"
control = false
min_chunks = 0
max_chunks = 2
targets = []
single_on = "unblock"
multi_on = "unblock"
single_off = "unblock"
multi_off = "unblock"
no_seq = "proceed"
no_map = "proceed"

[barcodes.barcode01]
name = "barcode01_targets"
control = false
min_chunks = 0
max_chunks = 2
targets = ["chr1", "chr2"]
single_on = "stop_receiving"
multi_on = "stop_receiving"
single_off = "unblock"
multi_off = "unblock"
no_seq = "proceed"
no_map = "proceed"

Validating a TOML

We provide an entry point to validate TOML files - readfish validate

readfish validate experiment_conf.toml

Any errors that occur while loading the configuration will be written to the terminal.

As an example - if the reference is missing.

2023-06-27 16:07:03,041 readfish /home/adoni5/mambaforge/envs/readfish_dev/bin/readfish validate docs/_static/example_tomls/human_chr_selection.toml
2023-06-27 16:07:03,041 readfish check_plugins=True
2023-06-27 16:07:03,041 readfish command='validate'
2023-06-27 16:07:03,042 readfish log_file=None
2023-06-27 16:07:03,042 readfish log_format='%(asctime)s %(name)s %(message)s'
2023-06-27 16:07:03,042 readfish log_level='info'
2023-06-27 16:07:03,042 readfish prom=False
2023-06-27 16:07:03,042 readfish toml='docs/_static/human_chr_selection.toml'
2023-06-27 16:07:03,042 readfish.validate eJxtUNtOxCAUfOcrGp4NLU1N1MSnffcHmoZQOLRkuS0Xo38vdOOqWc8TM8wMc5gFNwYiS5CzdlsiWwnhc0HCO6W37rXD0nEW6UAmQhk8kZFNw7CGxHYuMOJSRkip6XQQL33fZxv6a0j/WAcjCWvZmPFHWIaUWQQulU47UReM0Gx5CL8bNFwb3Pv2YrljYm9aAyJr70jgqmXMEbYK07Igxy00y1VzyEfKxhEjq5u7uHPrS5HlHz9wQpnHDXI7z/jtxIY640AoxQ/dDVPyjBeUak0DzCvV3iluNV6ca3wxWd+x32J3dMo+1PUF6PdK3yz/3jlf17y0mxC9AJAHVT/nD9XmC4XQlII=
2023-06-27 16:07:03,047 readfish.validate Loaded TOML config without error
2023-06-27 16:07:03,047 readfish.validate Initialising Caller
2023-06-27 16:07:03,069 readfish.validate Caller initialised
2023-06-27 16:07:03,069 readfish.validate Initialising Aligner
2023-06-27 16:07:03,070 readfish.validate Aligner could not be initialised, see below for details
2023-06-27 16:07:03,071 readfish.validate Did not create or open an index

This command has a flag --no-check-plugins which will disable the loading and validation of plugins. However this is only recommended if you absolutely must not validate the plugins. For example the above TOML, missing the reference, would pass validation if using the --no-check-plugins flag. The validate command also describes the experiment, and will error if contigs are listed in the targets that are not in the reference. This can be disabled by passing the --no-describe flag.