Skip to content

Commit

Permalink
deploy: 65e9809
Browse files Browse the repository at this point in the history
  • Loading branch information
SGenheden committed Dec 18, 2024
0 parents commit 31fc6c7
Show file tree
Hide file tree
Showing 62 changed files with 16,134 additions and 0 deletions.
4 changes: 4 additions & 0 deletions .buildinfo
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
# Sphinx build info version 1
# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
config: 5050767e857502e1f711c6d56b459cae
tags: 645f666f9bcd5a90fca523b33c5a78b7
Empty file added .nojekyll
Empty file.
Binary file added _images/sample_reaction.PNG
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
65 changes: 65 additions & 0 deletions _sources/index.rst.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
rxnutils documentation
============================

rxnutils is a collection of routines for working with reactions, reaction templates and template extraction

Introduction
------------

The package is divided into (currently) three sub-packages:

* `chem` - chemistry routines like template extraction or reaction cleaning
* `data` - routines for manipulating various reaction data sources
* `pipeline` - routines for building and executing simple pipelines for modifying and analyzing reactions
* `routes` - routines for handling synthesis routes

Auto-generated API documentation is available, as well as guides for common tasks. See the menu to the left.

Installation
------------

For most users it is as simple as

.. code-block::
pip install reaction-utils
`For developers`, first clone the repository using Git.

Then execute the following commands in the root of the repository

.. code-block::
conda env create -f env-dev.yml
conda activate rxn-env
poetry install
the `rxnutils` package is now installed in editable mode.

Lastly, make sure to install pre-commits that are run on every commit

.. code-block::
pre-commit install
Limitations
-----------

* Some old RDKit wheels on pypi did not include the `Contrib` folder, preventing the usage of the `rdkit_RxnRoleAssignment` action
* The pipeline for the Open reaction database requires some additional dependencies, see the documentation for this pipeline
* Using the data piplines for the USPTO and Open reaction database requires you to setup a second python environment
* The RInChI capabilities are not supported on MacOS


.. toctree::
:hidden:

templates
uspto
ord
pipeline
routes
rxnutils
7 changes: 7 additions & 0 deletions _sources/modules.rst.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
rxnutils
========

.. toctree::
:maxdepth: 4

rxnutils
106 changes: 106 additions & 0 deletions _sources/ord.rst.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,106 @@
Open reaction database
=======================

``rxnutils`` contain two pipelines that together imports and prepares the reaction data from the `Open reaction database <https://open-reaction-database.org/>`_ so that it can be used on modelling.

It is a complete end-to-end pipeline that is designed to be transparent and reproducible.

Pre-requisites
--------------

The reason the pipeline is divided into two blocks is because the dependencies of the atom-mapper package (``rxnmapper``) is incompatible with
the dependencies ``rxnutils`` package. Therefore, to be able to use to full pipeline, you need to setup two python environment.

1. Install ``rxnutils`` according to the instructions in the `README`-file

2. Install the ``ord-schema`` package in the `` rxnutils`` environment

conda activate rxn-env
python -m pip install ord-schema

3. Download/Clone the ``ord-data`` repository according to the instructions here: https://github.com/Open-Reaction-Database/ord-data

git clone https://github.com/open-reaction-database/ord-data.git .

Note down the path to the repository as this needs to be given to the preparation pipeline

4. Install ``rxnmapper`` according to the instructions in the repo: https://github.com/rxn4chemistry/rxnmapper


.. code-block::
conda create -n rxnmapper python=3.6 -y
conda activate rxnmapper
conda install -c rdkit rdkit=2020.03.3.0
python -m pip install rxnmapper
5. Install ``Metaflow`` and ``rxnutils`` in the new environment


.. code-block::
python -m pip install metaflow
python -m pip install --no-deps --ignore-requires-python .
Usage
-----

Create a folder for the ORD data and in that folder execute this command in the ``rxnutils`` environment


.. code-block::
conda activate rxn-env
python -m rxnutils.data.ord.preparation_pipeline run --nbatches 200 --max-workers 8 --max-num-splits 200 --ord-data ORD_DATA_REPO_PATH
and then in the environment with the ``rxnmapper`` run


.. code-block::
conda activate rxnmapper
python -m rxnutils.data.mapping_pipeline run --data-prefix ord --nbatches 200 --max-workers 8 --max-num-splits 200
The ``-max-workers`` flag should be set to the number of CPUs available.

On 8 CPUs and 1 GPU the pipeline takes a couple of hours.


Artifacts
---------

The pipelines creates a number of `tab-separated` CSV files:

* `ord_data.csv` is the imported ORD data
* `ord_data_cleaned.csv` is the cleaned and filter data
* `ord_data_mapped.csv` is the atom-mapped, modelling-ready data


The cleaning is done to be able to atom-map the reactions and are performing the following tasks:
* Ignore extended SMILES information in the SMILES strings
* Remove molecules not sanitizable by RDKit
* Remove reactions without any reactants or products
* Move all reagents to reactants
* Remove the existing atom-mapping
* Remove reactions with more than 200 atoms when summing reactants and products

(the last is a requisite for ``rxnmapper`` that was trained on a maximum token size roughly corresponding to 200 atoms)


The ``ord_data_mapped.csv`` files will have the following columns:

* ID - unique ID from the original database
* Dataset - the name of the dataset from which this is reaction is taken
* Date - the date of the experiment as given in the database
* ReactionSmiles - the original reaction SMILES
* Yield - the yield of the first product of the first outcome, if provided
* ReactionSmilesClean - the reaction SMILES after cleaning
* BadMolecules - molecules not sanitizable by RDKit
* ReactantSize - number of atoms in reactants
* ProductSize - number of atoms in products
* mapped_rxn - the mapped reaction SMILES
* confidence - the confidence of the mapping as provided by ``rxnmapper``
140 changes: 140 additions & 0 deletions _sources/pipeline.rst.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,140 @@
Pipeline
========

``rxnutils`` provide a simple pipeline to perform simple tasks on reaction SMILES and templates in a CSV-file.


The pipeline works on `tab-separated` CSV files (TSV files)


Usage
-----

To exemplify the pipeline capabilities, we will have a look at the pipeline used to clean the USPTO data.

The input to the pipeline is a simple YAML-file that specifies each action to take. The actions will be executed
sequentially, one after the other and each action takes a number of input arguments.

This is the YAML-file used to clean the USPTO data:

.. code-block:: yaml
trim_rxn_smiles:
in_column: ReactionSmiles
out_column: ReactionSmilesClean
remove_unsanitizable:
in_column: ReactionSmilesClean
out_column: ReactionSmilesClean
reagents2reactants:
in_column: ReactionSmilesClean
out_column: ReactionSmilesClean
remove_atom_mapping:
in_column: ReactionSmilesClean
out_column: ReactionSmilesClean
reactantsize:
in_column: ReactionSmilesClean
productsize:
in_column: ReactionSmilesClean
query_dataframe1:
query: "ReactantSize>0"
query_dataframe2:
query: "ProductSize>0"
query_dataframe3:
query: "ReactantSize+ProductSize<200"
The first action is called ``trim_rxn_smiles`` and two arguments are given: ``in_column`` specifying which column to use as input and ``out_column`` specifying which column
to use as output.

The following actions ``remove_unsanitizable``, ``reagents2reactants``, ``remove_atom_mapping``, ``reactantsize``, ``productsize`` works the same way, but might use other columns to specified for output.

The last three actions are actually the same action but executed with different arguments. They therefore have to be postfixed with 1, 2 and 3.
The action ``query_dataframe`` takes a ``query`` argument and removes a number of rows not matching the query.

If we save this to ``clean_pipeline.yml`` and given that we have a tab-separated file with USPTO data called ``uspto_data.csv`` we can run the following command

.. code-block::
python -m rxnutils.pipeline.runner --pipeline clean_pipeline.yml --data uspto_data.csv --output uspto_cleaned.csv
or we can alternatively run it from a python method like this

.. code-block::
from rxnutils.pipeline.runner import main as validation_runner
validation_runner(
[
"--pipeline",
"clean_pipeline.yml",
"--data",
"uspto_data.csv",
"--output",
"uspto_cleaned.csv",
]
)
Actions
-------

To find out what actions are available, you can type

.. code-block::
python -m rxnutils.pipeline.runner --list
Development
-----------

New actions can easily be added to the pipeline framework. All of the actions are implemented in one of four modules


* ``rxnutils.pipeline.actions.dataframe_mod`` - actions that modify the dataframe, e.g., removing rows or columns
* ``rxnutils.pipeline.actions.reaction_mod`` - actions that modify reaction SMILES
* ``rxnutils.pipeline.actions.dataframe_props`` - actions that compute properties from reaction SMILES
* ``rxnutils.pipeline.actions.templates`` - actions that process reaction templates


To exemplify, let's have a look at the ``productsize`` action


.. code-block:: python
@action
@dataclass
class ProductSize:
"""Action for counting product size"""
pretty_name: ClassVar[str] = "productsize"
in_column: str
out_column: str = "ProductSize"
def __call__(self, data: pd.DataFrame) -> pd.DataFrame:
smiles_col = global_apply(data, self._row_action, axis=1)
return data.assign(**{self.out_column: smiles_col})
def __str__(self) -> str:
return f"{self.pretty_name} (number of heavy atoms in product)"
def _row_action(self, row: pd.Series) -> str:
_, _, products = row[self.in_column].split(">")
products_mol = Chem.MolFromSmiles(products)
if products_mol:
product_atom_count = products_mol.GetNumHeavyAtoms()
else:
product_atom_count = 0
return product_atom_count
The action is defined as a class ``ProductSize`` that has two class-decorators.
The first ``@action`` will register the action in a global action list and second ``@dataclass`` is dataclass decorator from the standard library.
The ``pretty_name`` class variable is used to identify the action in the pipeline, that is what you are specifying in the YAML-file.
The other two ``in_column`` and ``out_column`` are the arguments you can specify in the YAML file for executing the action, they can have default
values in case they don't need to be specified in the YAML file.

When the action is executed by the pipeline the ``__call__`` method is invoked with the current Pandas dataframe as the only argument. This method
should return the modified dataframe.

Lastly, it is nice to implement a ``__str__`` method which is used by the pipeline to print useful information about the action that is executed.
68 changes: 68 additions & 0 deletions _sources/routes.rst.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
Routes
======

``rxnutils`` contains routines to analyse synthesis routes. There are a number of readers that can be used to read routes from a number of
formats, and there are routines to score the different routes.

Reading
-------

The simplest route format supported is a text file, where each reaction is written as a reaction SMILES in a line.
Routes are separated by new-line

For instance:

.. code-block::
CC(C)N.Clc1cccc(Nc2ccoc2)n1>>CC(C)Nc1cccc(Nc2ccoc2)n1
Brc1ccoc1.Nc1cccc(Cl)n1>>Clc1cccc(Nc2ccoc2)n1
Nc1cccc(NC(C)C)n1.Brc1ccoc1>>CC(C)Nc1cccc(Nc2ccoc2)n1
CC(C)N.Nc1cccc(Cl)n1>>Nc1cccc(NC(C)C)n1
If this is saved to ``routes.txt``, these can be read into route objects with

.. code-block::
from rxnutils.routes.readers import read_reaction_lists
routes = read_reaction_lists("reactions.txt")
If you have an environment with ``rxnmapper`` installed and the NextMove software ``namerxn`` in your PATH then you can
add atom-mapping and reaction classes to these routes with

.. code-block::
# This can be set on the command-line as well
import os
os.environ["RXNMAPPER_ENV_PATH"] = "/home/username/miniconda/envs/rxnmapper/"
for route in routes:
route.assign_atom_mapping(only_rxnmapper=True)
routes[1].remap(routes[0])
The last line of code also make sure that the second route shares mapping with the first route.


Other readers are available

* ``read_aizynthcli_dataframe`` - for reading routes from aizynthcli output dataframe
* ``read_reactions_dataframe`` - for reading routes stored as reactions in a dataframe


For instance, to read routes from a dataframe with reactions. You can do something like what follows.
The dataframe has column ``reaction_smiles`` that holds the reaction SMILES, and the individual routes
are identified by a ``target_smiles`` and ``route_id`` column. The dataframe also has a column ``classification``,
holding the NextMove classification. The dataframe is called ``data``.

.. code-block::
from rxnutils.routes.readers import read_reactions_dataframe
routes = read_reactions_dataframe(
data,
"reaction_smiles",
group_by=["target_smiles", "route_id"],
metadata_columns=["classification"]
)
Loading

0 comments on commit 31fc6c7

Please sign in to comment.