microsoft · kmaziarz · Aug 2, 2023 · Jul 26, 2023 · Jul 26, 2023 · Jul 26, 2023
diff --git a/.gitignore b/.gitignore
@@ -32,3 +32,6 @@ MANIFEST
 # Unit test / coverage reports
 .coverage
 .coverage.*
+
+# Cloned single-step model repositories
+syntheseus/reaction_prediction/environments/external/
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -9,7 +9,7 @@ and the project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.
 
 ### Added
 
-- Release single-step evaluation framework and wrappers for several model types ([#14](https://github.com/microsoft/syntheseus/pull/14)) ([@kmaziarz])
+- Release single-step evaluation framework and wrappers for several model types ([#14](https://github.com/microsoft/syntheseus/pull/14), [#15](https://github.com/microsoft/syntheseus/pull/15)) ([@kmaziarz])
 - Add option to terminate search when the first solution is found ([#13](https://github.com/microsoft/syntheseus/pull/13)) ([@austint])
 - Add code to extract routes in order found instead of by minimum cost ([#9](https://github.com/microsoft/syntheseus/pull/9)) ([@austint])
 - Declare support for type checking ([#4](https://github.com/microsoft/syntheseus/pull/4)) ([@kmaziarz])

diff --git a/README.md b/README.md
@@ -6,17 +6,20 @@
 [![License](https://img.shields.io/badge/license-MIT-green.svg)](LICENSE)
 
 Syntheseus is a package for retrosynthetic planning.
-It contains implementations of common search algorithms
-and a simple API to wrap custom reaction models and write
-custom algorithms.
+It contains implementations of common search algorithms, a simple API to wrap custom reaction models and write
+custom algorithms, and wrappers for many state-of-the-art reaction models from the literature.
 It is meant to allow for simple benchmarking of the components
 of retrosynthesis algorithms.
 
-## Installation
+## Setup
 
-Currently `syntheseus` is not hosted on PyPI
-(although this will likely change in the future).
-To install, please run:
+We support two installation modes:
+- *core installation* not tied to a specific reaction model allows you to build and benchmark your own models or search algorithms
+- *full installation* backed by one of the supported models allows you to perform end-to-end retrosynthetic search
+
+For full installation we currently support the following reaction models: Chemformer, LocalRetro, MEGAN, MHNreact, RetroKNN and RootAligned SMILES; see [here](syntheseus/reaction_prediction/environments/README.md) for detailed setup instructions.
+
+For core installation simply run
 
 ```bash
 # Clone and cd into the repository.

diff --git a/syntheseus/reaction_prediction/environments/README.md b/syntheseus/reaction_prediction/environments/README.md
@@ -0,0 +1,31 @@
+# Single-step Model Environments
+
+Every single-step model may require a different environment and set of dependencies.
+Here we outline the steps to set up an environment for each of the supported models, which can be then used to run single-step model evaluation or multi-step search.
+
+## Basic setup
+
+All models apart from GLN can be set up using a shared base `conda` environment extended with a few model-specific dependencies. The general workflow is:
+
+```bash
+conda env create -f environment_shared.yml  # Create the shared environment.
+conda activate syntheseus-single-step       # Activate the environment.
+pip install -e ../../../                    # Install `syntheseus`.
+source setup_[MODEL_NAME].sh                # Run the extra setup commands.
+```
+
+If you wish to use several models, it's enough to create the environment once and run all the corresponding setup scripts.
+However, note that RetroKNN depends on LocalRetro, so if you want to use both, it is enough to run just `setup_retro_knn.sh`.
+
+In `environment_shared.yml` and `setup_local_retro.sh` we pinned the CUDA version (to 11.3) for reproducibility.
+If you want to use a different one, make sure to edit these two files accordingly.
+
+The GLN model is not compatible with the others, currently requiring a specialized environment creation which includes building `rdkit` from source.
+We packaged all the necessary steps into a Docker environment defined in `gln/Dockerfile`.
+
+## Back-translation
+
+In `reaction_prediction/cli/eval.py` a forward model may be used for computing back-translation (round-trip) accuracy.
+Currently, Chemformer is the only supported forward model.
+
+To evaluate a particular model with back-translation computed using Chemformer, simply set up an environment for that model and then run `setup_chemformer.sh` on top.
diff --git a/syntheseus/reaction_prediction/environments/environment_shared.yml b/syntheseus/reaction_prediction/environments/environment_shared.yml
@@ -0,0 +1,12 @@
+name: syntheseus-single-step
+channels:
+  - defaults
+  - conda-forge
+  - pytorch
+dependencies:
+  - numpy
+  - pandas
+  - pip
+  - python==3.9.7
+  - pytorch=1.10.2=py3.9_cuda11.3_cudnn8.2.0_0
+  - rdkit=2021.09.4
diff --git a/syntheseus/reaction_prediction/environments/gln/Dockerfile b/syntheseus/reaction_prediction/environments/gln/Dockerfile
@@ -0,0 +1,38 @@
+FROM mcr.microsoft.com/azureml/openmpi3.1.2-cuda10.0-cudnn7-ubuntu18.04
+MAINTAINER krmaziar@microsoft.com
+
+# Set bash, as conda doesn't like dash
+SHELL [ "/bin/bash", "--login", "-c" ]
+
+# Make bash aware of conda
+RUN echo ". /opt/miniconda/etc/profile.d/conda.sh" >> ~/.profile
+
+# Turn off caching in pip
+ENV PIP_NO_CACHE_DIR=1
+
+# Install the dependencies into conda's default environment
+COPY ./environment.yml /tmp/
+RUN conda install mamba -n base -c conda-forge
+RUN mamba env update -p /opt/miniconda -f /tmp/environment.yml && conda clean -ay
+
+# Install RDKit from source
+RUN git clone https://github.com/rdkit/rdkit.git
+WORKDIR /rdkit
+RUN git checkout 7ad9e0d161110f758350ca080be0fc05530bee1e
+RUN mkdir build && cd build && cmake -DPy_ENABLE_SHARED=1 \
+    -DRDK_INSTALL_INTREE=ON \
+    -DRDK_INSTALL_STATIC_LIBS=OFF \
+    -DRDK_BUILD_CPP_TESTS=ON \
+    -DPYTHON_NUMPY_INCLUDE_PATH="$(python -c 'import numpy ; print(numpy.get_include())')" \
+    -DBOOST_ROOT="$CONDA_PREFIX" \
+    .. && make && make install
+WORKDIR /
+
+# Install GLN (this relies on `CUDA_HOME` being set correctly).
+RUN git clone https://github.com/Hanjun-Dai/GLN.git
+WORKDIR /GLN
+RUN git checkout b5bd7b181a61a8289cc1d1a33825b2c417bed0ef
+RUN pip install -e .
+
+ENV PYTHONPATH=$PYTHONPATH:/rdkit:/GLN
+ENV LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/rdkit/lib
diff --git a/syntheseus/reaction_prediction/environments/gln/environment.yml b/syntheseus/reaction_prediction/environments/gln/environment.yml
@@ -0,0 +1,26 @@
+name: gln-env
+channels:
+  - conda-forge
+  - pytorch
+dependencies:
+  - cudatoolkit=10.0
+  - cudatoolkit-dev=10
+  - python=3.7
+  - pytorch==1.2.0
+  - scipy
+  - tqdm
+  # Dependencies below are needed to build `rdkit` from source:
+  - boost
+  - boost-cpp
+  - cairo
+  - cmake
+  - eigen
+  - gxx_linux-64
+  - pillow
+  - pkg-config
+  - py-boost
+  - pip:
+    - torch-cluster==1.4.5
+    - torch-geometric==1.3.2
+    - torch-scatter==1.4.0
+    - torch-sparse==0.4.3
diff --git a/syntheseus/reaction_prediction/environments/setup_chemformer.sh b/syntheseus/reaction_prediction/environments/setup_chemformer.sh
@@ -0,0 +1,11 @@
+#!/bin/bash
+
+# Install extra dependencies specific to Chemformer.
+pip install pytorch-lightning==1.9.4 git+https://github.com/MolecularAI/pysmilesutils.git
+
+export GITHUB_ORG_NAME=MolecularAI
+export GITHUB_REPO_NAME=Chemformer
+export GITHUB_REPO_DIR=chemformer
+export GITHUB_COMMIT_ID=6333badcd4e1d92891d167426c96c70f5712ecc3
+
+source setup_shared.sh
diff --git a/syntheseus/reaction_prediction/environments/setup_local_retro.sh b/syntheseus/reaction_prediction/environments/setup_local_retro.sh
@@ -0,0 +1,12 @@
+#!/bin/bash
+
+# Install extra dependencies specific to LocalRetro.
+conda install dgl-cuda11.3 -c dglteam -y
+pip install dgllife chardet
+
+export GITHUB_ORG_NAME=kaist-amsg
+export GITHUB_REPO_NAME=LocalRetro
+export GITHUB_REPO_DIR=local_retro
+export GITHUB_COMMIT_ID=7dab59f7f85eca8b1c04c18fe8575fb1568ff7ae
+
+source setup_shared.sh
diff --git a/syntheseus/reaction_prediction/environments/setup_megan.sh b/syntheseus/reaction_prediction/environments/setup_megan.sh
@@ -0,0 +1,11 @@
+#!/bin/bash
+
+# Install extra dependencies specific to MEGAN.
+pip install gin-config==0.3.0 tensorflow==2.13.0 torchtext==0.13.1
+
+export GITHUB_ORG_NAME=molecule-one
+export GITHUB_REPO_NAME=megan
+export GITHUB_REPO_DIR=$GITHUB_REPO_NAME
+export GITHUB_COMMIT_ID=bd6179e42052521e46728adb2bb80dea6905bf40
+
+source setup_shared.sh
diff --git a/syntheseus/reaction_prediction/environments/setup_mhn_react.sh b/syntheseus/reaction_prediction/environments/setup_mhn_react.sh
@@ -0,0 +1,8 @@
+#!/bin/bash
+
+# Install extra dependencies specific to MHNreact.
+conda install rdchiral_cpp -c conda-forge -y
+pip install scikit-learn scipy swifter tqdm wandb
+
+# Install our fork of the open-source MHNreact code, which includes some efficiency improvements.
+pip install git+https://github.com/kmaziarz/mhn-react.git
diff --git a/syntheseus/reaction_prediction/environments/setup_retro_knn.sh b/syntheseus/reaction_prediction/environments/setup_retro_knn.sh
@@ -0,0 +1,8 @@
+#!/bin/bash
+
+# Set up LocalRetro first, which RetroKNN depends on.
+source setup_local_retro.sh
+
+# Install extra dependencies specific to RetroKNN.
+conda install faiss-gpu -c pytorch -y
+pip install torch-scatter -f https://data.pyg.org/whl/torch-1.10.0+cu113.html
diff --git a/syntheseus/reaction_prediction/environments/setup_root_aligned.sh b/syntheseus/reaction_prediction/environments/setup_root_aligned.sh
@@ -0,0 +1,11 @@
+#!/bin/bash
+
+# Install extra dependencies specific to RootAligned.
+pip install OpenNMT-py==2.2.0 textdistance==4.2.2
+
+export GITHUB_ORG_NAME=otori-bird
+export GITHUB_REPO_NAME=retrosynthesis
+export GITHUB_REPO_DIR=root_aligned  # Override the repository name to make it less ambiguous.
+export GITHUB_COMMIT_ID=ea3b5729752fdc319b18ea4c65c1a573e24d7320
+
+source setup_shared.sh
diff --git a/syntheseus/reaction_prediction/environments/setup_shared.sh b/syntheseus/reaction_prediction/environments/setup_shared.sh
@@ -0,0 +1,16 @@
+#!/bin/bash
+
+# Make a subdirectory for storing downloaded external repositories.
+mkdir -p external
+
+# Add the `external/` directory to `PYTHONPATH` when the environment is activated.
+mkdir -p $CONDA_PREFIX/etc/conda/activate.d
+echo "export PYTHONPATH=$PWD/external:.:$PYTHONPATH" >> $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh
+source $CONDA_PREFIX/etc/conda/activate.d/env_vars.sh
+
+export GITHUB_NAME="$GITHUB_ORG_NAME/$GITHUB_REPO_NAME"
+export MODEL_DIR="external/$GITHUB_REPO_DIR"
+
+echo "Setting up $GITHUB_NAME under $MODEL_DIR"
+git -C external clone "https://github.com/$GITHUB_NAME.git" $GITHUB_REPO_DIR
+git -C $MODEL_DIR checkout $GITHUB_COMMIT_ID
diff --git a/syntheseus/reaction_prediction/inference/chemformer.py b/syntheseus/reaction_prediction/inference/chemformer.py
@@ -34,15 +34,15 @@ def __init__(
         # There should be exaclty one `*.ckpt` file under `model_dir`.
         chkpt_path = get_unique_file_in_dir(model_dir, pattern="*.ckpt")
 
-        import Chemformer
+        import chemformer
 
         # Fix for Chemformer's relative imports.
-        chemformer_root_dir = get_module_path(Chemformer)
+        chemformer_root_dir = get_module_path(chemformer)
         sys.path.insert(0, chemformer_root_dir)
 
-        import Chemformer.molbart.util as util
-        from Chemformer.molbart.decoder import DecodeSampler
-        from Chemformer.molbart.models.pre_train import BARTModel
+        import chemformer.molbart.util as util
+        from chemformer.molbart.decoder import DecodeSampler
+        from chemformer.molbart.models.pre_train import BARTModel
 
         self._is_forward = is_forward
         self.device = device

diff --git a/syntheseus/reaction_prediction/inference/local_retro.py b/syntheseus/reaction_prediction/inference/local_retro.py
@@ -31,15 +31,15 @@ def __init__(self, model_dir: Union[str, Path], device: str = "cuda:0") -> None:
         - `model_dir/data` contains `*.csv` data files needed by LocalRetro
         """
 
-        import LocalRetro
-        from LocalRetro import scripts
+        import local_retro
+        from local_retro import scripts
 
         # We need to hack `sys.path` because LocalRetro uses relative imports.
-        sys.path.insert(0, get_module_path(LocalRetro))
+        sys.path.insert(0, get_module_path(local_retro))
         sys.path.insert(0, get_module_path(scripts))
 
-        from LocalRetro.Retrosynthesis import load_templates
-        from LocalRetro.scripts.utils import init_featurizer, load_model
+        from local_retro.Retrosynthesis import load_templates
+        from local_retro.scripts.utils import init_featurizer, load_model
 
         data_dir = Path(model_dir) / "data"
         self.args = init_featurizer(
@@ -67,7 +67,7 @@ def get_parameters(self):
 
     def _mols_to_batch(self, mols: List[Molecule]) -> Any:
         from dgllife.utils import smiles_to_bigraph
-        from LocalRetro.scripts.utils import collate_molgraphs_test
+        from local_retro.scripts.utils import collate_molgraphs_test
 
         graphs = [
             smiles_to_bigraph(
@@ -85,8 +85,8 @@ def _mols_to_batch(self, mols: List[Molecule]) -> Any:
     def _build_batch_predictions(
         self, batch, num_results, inputs, batch_atom_logits, batch_bond_logits
     ):
-        from LocalRetro.scripts.Decode_predictions import get_k_predictions
-        from LocalRetro.scripts.get_edit import combined_edit, get_bg_partition
+        from local_retro.scripts.Decode_predictions import get_k_predictions
+        from local_retro.scripts.get_edit import combined_edit, get_bg_partition
 
         graphs, nodes_sep, edges_sep = get_bg_partition(batch)
         start_node = 0
@@ -135,7 +135,7 @@ def _build_batch_predictions(
 
     def __call__(self, inputs: List[Molecule], num_results: int) -> List[BackwardPredictionList]:
         import torch
-        from LocalRetro.scripts.utils import predict
+        from local_retro.scripts.utils import predict
 
         batch = self._mols_to_batch(inputs)
         batch_atom_logits, batch_bond_logits, _ = predict(self.args, self.model, batch)

diff --git a/syntheseus/reaction_prediction/inference/retro_knn.py b/syntheseus/reaction_prediction/inference/retro_knn.py
@@ -53,7 +53,7 @@ def load_data_store(path: Path):
         self.adapter.eval()
 
     def _forward_localretro(self, bg):
-        from LocalRetro.scripts.model_utils import pair_atom_feats, unbatch_feats, unbatch_mask
+        from local_retro.scripts.model_utils import pair_atom_feats, unbatch_feats, unbatch_mask
 
         bg = bg.to(self.args["device"])
         node_feats = bg.ndata.pop("h").to(self.args["device"])

diff --git a/syntheseus/reaction_prediction/inference/root_aligned.py b/syntheseus/reaction_prediction/inference/root_aligned.py
@@ -50,9 +50,10 @@ def __init__(
         for key, value in opt_from_config.items():
             setattr(opt, key, value)
         opt.models = [get_unique_file_in_dir(model_dir, pattern="*.pt")]
+        opt.output = "/dev/null"
         setattr(opt, "synthon", False)
 
-        import score
+        from root_aligned import score
 
         score.opt = opt
 
@@ -79,7 +80,7 @@ def get_parameters(self):
 
     def _mols_to_batch(self, inputs) -> List[bytes]:
         """Map `Molecule`s into SMILES bytes."""
-        from score import smi_tokenizer
+        from root_aligned.score import smi_tokenizer
 
         # Example outcome: b'C C ( = O ) c 1 c c c 2 c ( c c n 2 C ( = O ) O C ( C ) ( C ) C ) c 1\n'.
         return [bytes(smi_tokenizer(input.smiles) + "\n", "utf-8") for input in inputs]
@@ -151,7 +152,7 @@ def __call__(self, inputs, num_results: int, random_augmentation=False) -> List[
                     randomized_mol = Molecule(smiles=randomized_smi, canonicalize=False)
                     augmented_inputs.append(randomized_mol)
         else:
-            from preprocessing.generate_PtoR_data import clear_map_canonical_smiles
+            from root_aligned.preprocessing.generate_PtoR_data import clear_map_canonical_smiles
 
             for input in inputs:
                 product_atom_map_numbers = [i + 1 for i in range(input.rdkit_mol.GetNumAtoms())]
@@ -203,7 +204,7 @@ def __call__(self, inputs, num_results: int, random_augmentation=False) -> List[
             for j in range(len(augmented_predictions[i])):
                 lines.append(augmented_predictions[i][j].replace(" ", ""))
 
-        from score import canonicalize_smiles_clear_map
+        from root_aligned.score import canonicalize_smiles_clear_map
 
         raw_predictions = []
         pool = multiprocessing.Pool(multiprocessing.cpu_count())
@@ -227,7 +228,7 @@ def __call__(self, inputs, num_results: int, random_augmentation=False) -> List[
         ranked_results = []  # shape: `[data_size, augmentation_size x beam_size]`
         ranked_scores = []
 
-        from score import compute_rank
+        from root_aligned.score import compute_rank
 
         for i in range(len(predictions)):
             rank, _ = compute_rank(predictions[i])

diff --git a/syntheseus/reaction_prediction/models/retro_knn.py b/syntheseus/reaction_prediction/models/retro_knn.py
@@ -59,7 +59,7 @@ def __init__(self, dim, k=32):
         nn.init.constant_(self.edge_proj.bias[0], 10.0)
 
     def forward(self, g, nfeat, efeat, ndist, edist):
-        from LocalRetro.scripts.model_utils import pair_atom_feats
+        from local_retro.scripts.model_utils import pair_atom_feats
 
         efeat = reorder_efeat(g, efeat)
         x = self.gnn(g, nfeat, efeat)