Convolutional Neural Graph Fingerprint

PyTorch-based Neural Graph Fingerprint for Organic Molecule Representations

This repository is an implementation of Convolutional Networks on Graphs for Learning Molecular Fingerprints in PyTorch.

It includes a preprocessing function to convert molecules in smiles representation into molecule tensors.

Related work

There are several implementations of this paper publicly available:

by HIPS using autograd
by debbiemarkslab using theano
by [GUR9000] 3 using keras
by ericmjl using autograd
by DeepChem using tensorflow
by keiserlab using Keras

The closest implementation is the implementation by GUR9000 and keiserlab in Keras. However this repository represents moleculs in a fundamentally different way. The consequences are described in the sections below.

Molecule Representation

Atom, bond and edge tensors

This codebase uses tensor matrices to represent molecules. Each molecule is described by a combination of the following three tensors:

atom matrix, size: (max_atoms, num_atom_features) This matrix defines the atom features.

Each column in the atom matrix represents the feature vector for the atom at the index of that column.
edge matrix, size: (max_atoms, max_degree) This matrix defines the connectivity between atoms.

Each column in the edge matrix represent the neighbours of an atom. The neighbours are encoded by an integer representing the index of their feature vector in the atom matrix.

As atoms can have a variable number of neighbours, not all rows will have a neighbour index defined. These entries are filled with the masking value of -1. (This explicit edge matrix masking value is important for the layers to work)
bond tensor size: (max_atoms, max_degree, num_bond_features) This matrix defines the atom features.

The first two dimensions of this tensor represent the bonds defined in the edge tensor. The column in the bond tensor at the position of the bond index in the edge tensor defines the features of that bond.

Bonds that are unused are masked with 0 vectors.

Batch representations

This codes deals with molecules in batches. An extra dimension is added to all of the three tensors at the first index. Their respective sizes become:

atom matrix, size: (num_molecules, max_atoms, num_atom_features)
edge matrix, size: (num_molecules, max_atoms, max_degree)
bond tensor size: (num_molecules, max_atoms, max_degree, num_bond_features)

As molecules have different numbers of atoms, max_atoms needs to be defined for the entire dataset. Unused atom columns are masked by 0 vectors.

Dependencies

RDKit This dependency is necessary to convert molecules into tensor representatins, once this step is conducted, the new data can be stored, and RDkit is no longer a dependency.
PyTorch Requires PyTorch >= 1.0
NumPy Requires Numpy >= 0.19
Pandas Optional for examples

Acknowledgements

Implementation is based on Duvenaud et al., 2015.
Feature extraction scripts were implemented from the original implementation
Data preprocessing scripts were rewritten from keiserlab
Graphpool layer adopted from Han, et al., 2016

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
.idea		.idea
NeuralGraph		NeuralGraph
dataset		dataset
output		output
LICENSE		LICENSE
README.md		README.md
example.py		example.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Convolutional Neural Graph Fingerprint

Related work

Molecule Representation

Atom, bond and edge tensors

Batch representations

Dependencies

Acknowledgements

About

Releases

Packages

Languages

License

XuhanLiu/NGFP

Folders and files

Latest commit

History

Repository files navigation

Convolutional Neural Graph Fingerprint

Related work

Molecule Representation

Atom, bond and edge tensors

Batch representations

Dependencies

Acknowledgements

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages