-
Notifications
You must be signed in to change notification settings - Fork 5
Dataset and curation
The dataset module provides functions and classes to retrieve, transform, and store QM datasets from QCArchive and delivers them as torch.DataSet
or LighningDataSet
to train NNPs.
The dataset module implements actions associated with data storage, caching, and retrieval, as well as the pipeline from the stored hdf5 files to the pytorch dataset
class that can be used for training.
The general workflow to interact with public datasets will be as follows:
- obtaining the dataset
- processing the dataset and storing it in a hdf5 file with standard naming and units
- uploading to Zenodo and updating the retrieval link in the
dataset
implementation
The specific dataset classes like QM9Dataset
or SPICEDataset
download a hdf5 file with defined key names and values in a particular format from Zenodo and load the data in memory. The values in the dataset need to be specified in the [openMM unit system](http://docs.openmm.org/6.2.0/userguide/theory.html#units).
For each uploaded Zenodo dataset (in hdf5 format) we will generate a README.md that contains all labels and their respective units.
The public API for creating a TorchDataset
is implemented in the specific data classes (e.g., QM9Dataset
) and the DatasetFactory
.
The TorchDataset
can be loaded in a Pytorch
Dataloader.
modelforge
datasets/ # defines the interaction with public datasets
dataset.py
TorchDataset(torch.utils.data.Dataset) # A custom dataset class to wrap numpy datasets for PyTorch.
* __init__(self, dataset: np.ndarray, property_name: PropertyNames)
* __len__()
* __getitem__(self, idx:int) -> Dict[str, torch.Tensor]
HDF5Dataset() # Base class for data stored in HDF5 format.
* _{to|from}_file_cache() # write/read high-performance numpy cache file (can change a lot)
* _{from}_hdf5() # read our HDF5 format (reproducible and archival) (also supports gzipped files)
* _perform_transformations(label_transform: Optional[Dict[str, Callable]], transforms: Dict[str, Callable]) # transform any entry of the dataset using a custom function
DatasetFactory # Factory class for creating Dataset instances.
* create_dataset(data: HDF5Dataset, label_transform: Optional[Dict[str, Callable]],transform: Optional[Dict[str, Callable]]) -> TorchDataset
# Creates a TorchDataset instance given an HDF5Dataset.
TorchDataModule(pl.LighningDataModule) # A custom data module class to handle data loading and preparation for PyTorch Lightning training.
* def __init__(self, data: HDF5Dataset, SplittingStrategy: SplittingStrategy,batch_size)
* prepare_data()
* setup()
* {train|val|test}_dataloader()-> DataLoader
{qm9|spice}.py
QM9Dataset(HDF5Dataset) # Data class for handling QM9 data.
* properties_of_interest() -> List[str] # [getter|setter], entries in dataset that are retrieved
* available_properties() -> List[str] # list of available properties in the dataset
* _download() # Download the hdf5 file containing the data from source.
transformation.py # transformation functions applied to entries in dataset
* default_transformations
utils.py
RandomSplittingStrategy(SplittingStrategy)
* split(dataset:TorchDataset) -> Tuple[Subset, Subset, Subset] # Splits the provided dataset into training, validation, and testing subsets
The curation module provides functionality to retrieve source datasets and generate HDF5 datafiles with a consistent format and units, to be loaded by the dataset module.
The purpose of including this module in the package is to encapsulate all routines used to generate the input datafile, including any and all manipulation of the underlying data (e.g., unit conversion, summing of quantities, calculation of reference energy, etc.), to ensure transparency and reproducibility.
The HDF5 files generated by the curation module have units (with openff-units compatible names) defined in the "u" attribute for each quantity. For efficient data writing/reading conformers are grouped together into a single entry.
Furthermore, a descriptor is provided for each quantity in a given record, that informs the dataset module how to parse the underlying arrays. This description allows us to understand what the axes of each quantity we load represents, rather than attempting to infer this information or hard code it in. This allows the dataloader to be more general and thus work with different datasets where the names of the underlying quantities may vary. The descriptor is simply two strings concatenated together. The first string tells us how to handle axis=0, with options are series
or single
; series indicates that we will loop over axis=0 to retrieve information for each conformer, whereas single tells us this quantity applies to all conformers. The second string tells us about axis=1 (if available), and has options of rec
, mol
or atom
; rec
tells us that the information is a descriptor for the entire record and the quantity is not stored in as an array (e.g., SMILES string or molecular formula); mol
tells us that whatever quantity is encoded, is calculated on a per-molecule basis (e.g., energy); and atom
tells us that the underlying quantity is a per-atom property (e.g., partial charge).
The possible combinations, with examples:
- "single_rec": states that the quantity encodes a single value that is applicable to all conformers in the record, e.g., molecular formula or SMILES string.
- "single_mol" states this quantity applies to all conformers, and that the underlying value is per-molecule, e.g., reference energy.
- "single_atom" states this quantity applies to all conformers, with atom-wise values encoded, e.g., the atomic numbers (for methane the underlying array would be
[[6],[1],[1],[1],[1]]
with shape[n_atoms,1]
) - "series_mol" states that the quantity of interest depends on the conformer (i.e., axis=0 will allow us to index into different conformers) and the values are per-molecule, e.g., energy. This will be of shape
[n_configs, x]
where n_configs is the number of conformers and X denotes variable dimension (e.g., a quantity such as energy would have x=1, but a rotational constant would have x=3). - "series_atom" status the quantity of interest depends on the conformer, and the values are per-atom e.g., partial charges. This will be of shape
[n_configs, n_atoms, x]
where again x denotes a variable size.
As an example, let us load the first record for the QM9 dataset
from modelforge.curation.qm9_curation import QM9Curation
qm9_dataset = QM9Curation(
hdf5_file_name="qm9_dataset.hdf5",
output_file_dir="datasets/hdf5_files",
local_cache_dir="datasets/qm9_dataset_raw",
)
qm9_dataset.process(max_records=1)
In all the curated datasets, a list named data
is generated. Each entry in the list corresponds to a specific molecule, where the molecule information is stored as a dictionary.
For example, we can access all of the properties stored in the dataset as follows:
for data_point in qm9_dataset.data:
for key, val in data_point.items():
print(f"{key} : {val} : {qm9_dataset._record_entries_series[key]}")
Note this also accesses the _record_entries_series
dictionary in the dataset, which stores the descriptor
discussed above.
Let us examine a small selection of the stored data to discuss the specific format and common elements between all datasets.
In all datasets, each entry in the data
list will contain several keys:
-
name
-- unique identifying string of the molecule, typically taken from the original dataset -
n_configs
-- number of configurations/conformers for the molecule -
atomic_numbers
-- array of atomic numbers (in order) of the molecule. -
geometry
-- array of atomic positions of the conformers
name
and n_configs
are both considered to be of format single_rec
(see above) as these values apply to all data in the molecule and are not conformer dependent.
name : dsgdb9nsd_000001 : <class 'str'> : single_rec
n_configs : 1 : <class 'int'> : single_rec
atomic_numbers
is marked as a single_atom
, as this array applies to all conformers (order of the atomic indices cannot change), but is also a per-atom property, hence why we consider it a single_atom
as opposed to single_rec
.
Note as can be seen below, the shape of atomic_numbers
is (n_atoms,1), where in this case n_atoms=5. We defined this as (n_atoms, 1) instead of (n_atoms) for consistency with other per-atom properties:
atomic_numbers :
[
[6]
[1]
[1]
[1]
[1]
]
<class 'numpy.ndarray'>
single_atom
The geometry
is of format series_atom
as we will have a unique set of coordinates for each conformer. This is of shape (n_configs, n_atoms, 3), which since n_configs=1, is of shape (1,5,3). Note that this is a numpy.ndarray with units attached (using openff-units, based on pint).
geometry :
[[
[-0.0012698135899999999 0.10858041577999998 0.00080009958]
[0.00021504159999999998 -0.0006031317599999999 0.00019761204]
[0.10117308433 0.14637511618 2.7657479999999996e-05]
[-0.05408150689999999 0.14475266137999998 -0.08766437151999999]
[-0.05238136344999999 0.14379326442999998 0.09063972942]
]] nanometer :
<class 'pint.util.Quantity'> : <class 'numpy.ndarray'> :
series_atom
Note, for data of format series_atom
, the final dimension is variable. For example, the charges in this dataset are series_atom
, but only a single charge is associated with each atom, rather than a vector of a shape 3. Hence, we have an entry of shape (n_configs, n_atoms, 1).
charges :
[[
[-0.535689]
[0.133921]
[0.133922]
[0.133923]
[0.133923]
]] elementary_charge :
<class 'pint.util.Quantity'> : <class 'numpy.ndarray'> :
series_atom
Datasets will also contain information about the energy, although the name of this will depend on the dataset itself. For example, in QM9, we have internal_energy_at_0K
, which is of format series_mol
, meaning there will be a single unique value for each conformer, hence of shape (n_configs, 1) in this case.
internal_energy_at_0K :
[
[-106277.4161215308]
] kilojoule_per_mole :
<class 'pint.util.Quantity'> : <class 'numpy.ndarray'> :
series_mol
`
Again, as the last dimension of the shape of series_mol
entries are variable (and will be inferred during data load), and can represent not just a single float value per molecule, but also a vector. For example, harmonic vibrational frequencies is of length (n_configs, 9) in this case:
harmonic_vibrational_frequencies :
[
[1341.307 1341.3284 1341.365 1562.6731 1562.7453 3038.3205 3151.6034 3151.6788 3151.7078]
] / centimeter :
<class 'pint.util.Quantity'> : <class 'numpy.ndarray'> :
series_mol
This data array, along with the "format" information, is written to an HDF5 file, in roughly the same general structure. HDF5 files can be accessed in a very similar fashion to dictionaries using h5py. The key differences in the datastructure are as follows: the name
field is used to create a top level key in the HDF5 datastructure, with properties stored the level below this. Units are no longer attached to values/arrays, but instead stored in the attributes (attrs
) associated with each property; the format (e.g., series_mol
) is also stored as an attribute. A sketch of the hierarchy is as follows:
1- name
2-- property
3--- attrs: units as "u", format
The following script demonstrates how to access the data (although in general, users will not need to directly access files, as these will be automatically loaded in the dataset classes).
import h5py
filename = "datasets/hdf5_files/qm9_dataset.hdf5"
with h5py.File(filename) as h5:
for molecule_name in h5.keys():
print("molecule_name:", molecule_name)
for property in h5[molecule_name].keys():
print("-Property:", property)
print(h5[molecule_name][property].attrs["format"])
if "rec" not in h5[molecule_name][property].attrs["format"]:
print(h5[molecule_name][property].shape)
print(h5[molecule_name][property][()])
if "u" in h5[molecule_name][property].attrs:
print(h5[molecule_name][property].attrs["u"])
The first few outputs are as follows:
molecule_name: dsgdb9nsd_000001
-Property: atomic_numbers
single_atom
(5, 1)
[[6]
[1]
[1]
[1]
[1]]
-Property: charges
series_atom
(1, 5, 1)
[[[-0.535689]
[ 0.133921]
[ 0.133922]
[ 0.133923]
[ 0.133923]]]
elementary_charge
Note, in this format, units are written as strings; openff units allows these to be easily reattached to the quantity of interest, simply by passing the string to Quantity
.
from openff.units import Quantity
value_without_units = h5[molecule_name][property][()]
units_string = h5[molecule_name][property].attrs["u"]
value_with_units = value_without_units* Quantity(units_string)