Skip to content

Datasets

Stefan Verhoeven edited this page Aug 31, 2016 · 14 revisions

Chembl

The Chembl Postgresql database can be created in the virtual machine by running the following script:

chembldb_create <chembl_version>

When the script is run:

  1. Downloads Chembl dump
  2. Imports dump
  3. Creates rdkit tables with molecules (mols_rdkit) and fingerprints (fps_rdkit).

It takes several hours for the script to finish.

Connect to database on the command line using (replace <chembl version> with version printed by script):

PGPASSWORD=chembl psql -h localhost -U chembl chembl_<chembl version>

Kripo

The data sets can be queried using the kripodb command line utility, see https://github.com/3D-e-Chem/kripodb for more information.

Tiny

A tiny data set is available in /data/kripo/tiny directory.

  • fragments.sqlite - Fragments sqlite database containing a small number of fragments with their smiles string and molblock.
  • fingerprints.sqlite - Fingerprints sqlite database with fingerprint stored as fastdumped intbitset
  • distances.h5 - HDF5 file with distance matrix of fingerprints using modified tanimoto coefficient

GPCR

A GPCR data set is available in /data/kripo/gpcr directory. All fragments based on GPCR proteins compared with all proteins in PDB.

  • kripo.gpcrandhits.sqlite - Fragments sqlite database
  • kripo.gpcr.h5 - HDF5 file with distance matrix

The data set has been published at DOI

PDB

The PDB fragment data set is available in /data/kripo/pdb directory. All fragments form all proteins in PDB compared with all.

Clone this wiki locally