-
Notifications
You must be signed in to change notification settings - Fork 3
CoreGx Design Documentation
TODO::
A TreatmentResponseExperiment
(TRE
) has list-like and table-like behaviors. For table-like behaviors, rows are defined by one or more key columns which uniquely identify each row of the data.table
in the rowData
slot. These columns are referred to as the rowIDs
and are concatenated together with the ':' character to make pseudo-rownames
. The same is true for the colData
table, with associated colIDs
and pseudo-colnames
.
Use of such pseudo-dimnames
allows a TRE
to be subset analogously to a base data.frame
by specifying the dimension names of the "rows" or "columns" of the object. As a result the [
method exploits the table-like behaviours of the object. In addition to data.frame
like subsets, two additional mechanism for sub-setting have been implemented. Firstly, pseudo-dimnames
can be specified using glob
or regex
patterns, which are matched against the pseudo-dimnames
before returning the subset. Secondly, the [
method allows use of data.table
style subsets using expressions, with the caveat that any expression subset query need to be wrapped in the .()
function to protect calls from early evaluation during S4-method dispatch. These protect expressions are then passed through to the i
argument of the rowData
or colData
data.table
s.
The assays
slot of a TRE
contains the measurements of interest in the object and posses list-like behaviors. You can access and assign an assay via the $
and [[
methods. However, table-like subsets on the object via [
or subset
do the necessary internal work to subset each item in the assays
list as well.
The assay index table was introduced to allow aggregation operations over rowKey
and colKey
values to be stored inside a TreatmentResponseExperiment
. Previously assays were keyed directly by the values of rowKey
and colKey
and thus no assay could store a summary over the rowID
or colID
columns. This effectively made it impossible to store interesting aggregations, for example summaries over dose or replicates, inside a TreatmentResponseExperiment
object.
To resolve this issue, two additional pieces of structural metadata have been added to the .intern
slot. The assayIndex
is a table which maps from rowKey
and colKey
combinations to an integer key for each assay table. The assayKeys
are a list of rowIDs
and colIDs
which are required to uniquely identify a measurement in an assay. The assayKeys
are used to define an integer assay key column in each assay data.table
. This prevents unnecessary repetition of character metadata columns inside the assays
of a TRE
and acts as a form of compression vs storing the data in a single, long-format data.table
. Initial tests indicate about a 50% reduction in object size vs the long-format data.table
, which will increase with the number of rowData
and colData
columns, but decrease slightly with the number of assays
in a TRE
.
Summaries inside of a specific assay can be stored by repeating the value of the associated assayKey
in the corresponding column of the assayIndex
. This ensures that the data which has been aggregated over can still be retrieved while also allowing storage of summaries over some subset of rowKey
and colKey
values. For now, the assayIndex
will contain a column for each assay
in the TRE
, even if the assays is "parallel" to other assays (i.e., keyed by the same columns). While this does slightly increase the size of the object due to storing repeated information, it greatly simplifies the logic required for subsets, as well as for assigning new assays or computing summaries over an existing assay. The cost of this is on the order of 3.3 MB per million assay rows per assay.
To prevent the assayIndex
from becoming convoluted, we have implemented the reindex
method. This method takes in a TRE
object and updates the rowKey
, colKey
and assayKeys
such that they are the smallest possible set of consecutive integers. To maintain referential integrity, these keys need to be updated both in the assayIndex
as well as in each slot of the object. To make comparison of objects after reindexing simple, a default ordering needs to be implemented for each of the internal slots such that reordering the assayIndex data.table
will not result in different results from the TRE
accessor methods.
The default ordering in various conditions is outlined below:
Slot | Condition | Keys |
---|---|---|
rowData | internal | rowKey |
colData | internal | colKey |
assays | internal | assayKey |
rowData | accessed; withDimnames=TRUE | rowIDs |
rowData | accessed; withDimnames=FALSE | rowKey |
colData | accessed | colIDs |
assays | accessed; withDimnames=TRUE | rowIDs, colIDs |
assays | accessed; withDimnames=FALSE | rowKey, colKey |
assays | accessed; withDimnames=FALSE & key=FALSE | assayKey |
assayIndex | always | assayKeys |
Potential issues: sorting is a (relatively) expensive operation. While data.table
uses radix sort and is very efficient it has the potential to slow down accessors. To avoid these sorts, you can use the secret
argument raw=TRUE
, which can be passed to the rowData
, colData
, assay
and assays
accessor methods
to short circuit and return the result of @<slotName>
. If you do this, make sure you honour that your
method applies the appropriate sort before the final return statement. If your function may be used inside
of other CoreGx accessors, also make sure to add code for the raw=TRUE
secret argument to ensure wasteful
sorting can be avoided.
Introduction of the assay index to a TRE
has implications for the way subset operations will work. This section will define the requisite operations to subset along different TRE
dimensions.
Subsetting by one of the table-like dimensions