The intended usage is to be able to add molecular vectorization directly into scikit-learn pipelines, so that the final model directly predict on RDKit molecules or SMILES strings
As example with the needed scikit-learn and -mol imports and RDKit mol objects in the mol_list_train and _test lists:
pipe = Pipeline([('mol_transformer', MorganTransformer()), ('Regressor', Ridge())])
pipe.fit(mol_list_train, y_train)
pipe.score(mol_list_test, y_test)
pipe.predict([Chem.MolFromSmiles('c1ccccc1C(=O)C')])
>>> array([4.93858815])
The scikit-learn compatibility should also make it easier to include the fingerprinting step in hyperparameter tuning with scikit-learns utilities
The first draft for the project was created at the RDKIT UGM 2022 hackathon 2022-October-14
- Descriptors
- MolecularDescriptorTransformer
- Fingerprints
- MorganFingerprintTransformer
- MACCSKeysFingerprintTransformer
- RDKitFingerprintTransformer
- AtomPairFingerprintTransformer
- TopologicalTorsionFingerprintTransformer
- MHFingerprintTransformer
- SECFingerprintTransformer
- AvalonFingerprintTransformer
- Conversions
- SmilesToMol
- Standardizer
- Standardizer
- Utilities
- CheckSmilesSanitazion
Users can install latest tagged release from pip
pip install scikit-mol
Bleeding edge
pip install git+https://github.com:EBjerrum/scikit-mol.git
There are a collection of notebooks in the notebooks directory which demonstrates some different aspects and use cases
- Basic Usage and fingerprint transformers
- Descriptor transformer
- Pipelining with Scikit-Learn classes
- Molecular standardization
- Sanitizing SMILES input
- Integrated hyperparameter tuning of Scikit-Learn estimator and Scikit-Mol transformer
- Using parallel execution to speed up descriptor and fingerprint calculations
- Testing different fingerprints as part of the hyperparameter optimization
There are more information about how to contribute to the project in CONTRIBUTION.md
Probably still, please check issues at GitHub and report there
- Esben Jannik Bjerrum @ebjerrum, esbenbjerrum+scikit_mol@gmail.com
- Carmen Esposito @cespos
- Son Ha, sonha@uni-mainz.de
- Oh-hyeon Choung, ohhyeon.choung@gmail.com
- Andreas Poehlmann, @ap--
- Ya Chen, @anya-chen
- Rafał Bachorz @rafalbachorz
- Adrien Chaton @adrienchaton
- @VincentAlexanderScholz
- @RiesBen