Skip to content
/ NGFP Public

PyTorch-based Neural Graph Fingerprint for Organic Molecule Representations

License

Notifications You must be signed in to change notification settings

XuhanLiu/NGFP

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

26 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Convolutional Neural Graph Fingerprint

PyTorch-based Neural Graph Fingerprint for Organic Molecule Representations

This repository is an implementation of Convolutional Networks on Graphs for Learning Molecular Fingerprints in PyTorch.

It includes a preprocessing function to convert molecules in smiles representation into molecule tensors.

Related work

There are several implementations of this paper publicly available:

The closest implementation is the implementation by GUR9000 and keiserlab in Keras. However this repository represents moleculs in a fundamentally different way. The consequences are described in the sections below.

Molecule Representation

Atom, bond and edge tensors

This codebase uses tensor matrices to represent molecules. Each molecule is described by a combination of the following three tensors:

  • atom matrix, size: (max_atoms, num_atom_features) This matrix defines the atom features.

    Each column in the atom matrix represents the feature vector for the atom at the index of that column.

  • edge matrix, size: (max_atoms, max_degree) This matrix defines the connectivity between atoms.

    Each column in the edge matrix represent the neighbours of an atom. The neighbours are encoded by an integer representing the index of their feature vector in the atom matrix.

    As atoms can have a variable number of neighbours, not all rows will have a neighbour index defined. These entries are filled with the masking value of -1. (This explicit edge matrix masking value is important for the layers to work)

  • bond tensor size: (max_atoms, max_degree, num_bond_features) This matrix defines the atom features.

    The first two dimensions of this tensor represent the bonds defined in the edge tensor. The column in the bond tensor at the position of the bond index in the edge tensor defines the features of that bond.

    Bonds that are unused are masked with 0 vectors.

Batch representations

This codes deals with molecules in batches. An extra dimension is added to all of the three tensors at the first index. Their respective sizes become:

  • atom matrix, size: (num_molecules, max_atoms, num_atom_features)
  • edge matrix, size: (num_molecules, max_atoms, max_degree)
  • bond tensor size: (num_molecules, max_atoms, max_degree, num_bond_features)

As molecules have different numbers of atoms, max_atoms needs to be defined for the entire dataset. Unused atom columns are masked by 0 vectors.

Dependencies

  • RDKit This dependency is necessary to convert molecules into tensor representatins, once this step is conducted, the new data can be stored, and RDkit is no longer a dependency.
  • PyTorch Requires PyTorch >= 1.0
  • NumPy Requires Numpy >= 0.19
  • Pandas Optional for examples

Acknowledgements

About

PyTorch-based Neural Graph Fingerprint for Organic Molecule Representations

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages