PyTorch-based Neural Graph Fingerprint for Organic Molecule Representations
This repository is an implementation of Convolutional Networks on Graphs for Learning Molecular Fingerprints in PyTorch.
It includes a preprocessing function to convert molecules in smiles representation into molecule tensors.
There are several implementations of this paper publicly available:
- by HIPS using autograd
- by debbiemarkslab using theano
- by [GUR9000] 3 using keras
- by ericmjl using autograd
- by DeepChem using tensorflow
- by keiserlab using Keras
The closest implementation is the implementation by GUR9000 and keiserlab in Keras. However this repository represents moleculs in a fundamentally different way. The consequences are described in the sections below.
This codebase uses tensor matrices to represent molecules. Each molecule is described by a combination of the following three tensors:
-
atom matrix, size:
(max_atoms, num_atom_features)
This matrix defines the atom features.Each column in the atom matrix represents the feature vector for the atom at the index of that column.
-
edge matrix, size:
(max_atoms, max_degree)
This matrix defines the connectivity between atoms.Each column in the edge matrix represent the neighbours of an atom. The neighbours are encoded by an integer representing the index of their feature vector in the atom matrix.
As atoms can have a variable number of neighbours, not all rows will have a neighbour index defined. These entries are filled with the masking value of
-1
. (This explicit edge matrix masking value is important for the layers to work) -
bond tensor size:
(max_atoms, max_degree, num_bond_features)
This matrix defines the atom features.The first two dimensions of this tensor represent the bonds defined in the edge tensor. The column in the bond tensor at the position of the bond index in the edge tensor defines the features of that bond.
Bonds that are unused are masked with 0 vectors.
This codes deals with molecules in batches. An extra dimension is added to all of the three tensors at the first index. Their respective sizes become:
- atom matrix, size:
(num_molecules, max_atoms, num_atom_features)
- edge matrix, size:
(num_molecules, max_atoms, max_degree)
- bond tensor size:
(num_molecules, max_atoms, max_degree, num_bond_features)
As molecules have different numbers of atoms, max_atoms needs to be defined for the entire dataset. Unused atom columns are masked by 0 vectors.
- RDKit This dependency is necessary to convert molecules into tensor representatins, once this step is conducted, the new data can be stored, and RDkit is no longer a dependency.
- PyTorch Requires PyTorch >= 1.0
- NumPy Requires Numpy >= 0.19
- Pandas Optional for examples
- Implementation is based on Duvenaud et al., 2015.
- Feature extraction scripts were implemented from the original implementation
- Data preprocessing scripts were rewritten from keiserlab
- Graphpool layer adopted from Han, et al., 2016