PaRoutes is a framework for benchmarking multi-step retrosynthesis methods, i.e. route predictions.
It provides:
- A curated reaction dataset for building one-step retrosynthesis models
- Two sets of 10,000 routes
- Two sets of stock molecules to use as stop-criterion for the search
- Scripts to compute route quality and route diversity metrics
Before you begin, ensure you have met the following requirements:
-
Linux, Windows or macOS platforms are supported - as long as the dependencies are supported on these platforms.
-
You have installed anaconda or miniconda with python 3.7 - 3.9
The tool has been developed on a Linux platform.
First clone the repository using Git.
Then execute the following commands in the root of the repository
conda env create -f env.yml
conda activate paroutes-env
python data/download_data.py
Now all the dependencies and datasets are setup.
PaRoutes provide a list of targets and stock molecules in SMILES format for two sets n1 and n5.
For n1 you find in the data/
folder of the repository
n1-targets.txt
- the target moleculesn1-stock.txt
- the stock molecules
For n5 you find in the data/
folder of the repository
n5-targets.txt
- the target moleculesn5-stock.txt
- the stock molecules
For more information on the files in the data/
folder, please read the README file for that folder.
The predicted route exported by your software need to be converted to a format
that can be read by the analysis tool. This format is outlined in the analysis\README.md
The following command for analysis assumes:
- The current directory is the root of the
paroutes
repo - Your route predictions for the n1 targets in a JSON format is located at
~/output_routes.json
Then you can type
python analysis/route_quality.py --routes ~/output_routes.json --references data/n1-routes.json --output ~/route_analyses.csv
to calculate the route quality metrics. It will print out how many of the targets were solved and the top-1, top-5 and top-10 accuracies (by default). For further details have a look in the data/README.md
file.
To perform clustering on the same dataset, you can type
python analysis/route_clusters.py --routes ~/output_routes.json --model data/chembl_10k_route_distance_model.ckpt --min_density 2 --output ~/cluster_analyses.json
The script will print out the average number of clusters formed for each target. For further details have a look in the data/README.md
file.
Search method | Route set | Solved targets | Top-1 | Top-5 | Top-10 | Routes extracted | Number of clusters |
---|---|---|---|---|---|---|---|
Mcts | set-n1 | 9714 | 0.20 | 0.55 | 0.61 | 273 | 68 |
Mcts | set-n5 | 9676 | 0.09 | 0.34 | 0.42 | 272 | 77 |
Retro* | set-n1 | 9726 | 0.17 | 0.48 | 0.54 | 264 | 68 |
Retro* | set-n5 | 9703 | 0.08 | 0.30 | 0.38 | 149 | 39 |
DFPN | set-n1 | 8475 | 0.19 | 0.33 | 0.33 | 6 | 2 |
DFPN | set-n5 | 7382 | 0.08 | 0.14 | 0.14 | 6 | 2 |
Search method | Route set | Solved targets | Top-1 | Top-5 | Top-10 | Routes_extracted | Number of clusters |
---|---|---|---|---|---|---|---|
Mcts | set-n1 | 9716 | 0.2372 | 0.5107 | 0.5414 | 306 | 109 |
Mcts | set-n5 | 9689 | 0.1237 | 0.3584 | 0.4056 | 311 | 113 |
Retro* | set-n1 | 9728 | 0.2027 | 0.4516 | 0.4847 | 154 | 31 |
Retro* | set-n5 | 9729 | 0.1143 | 0.3365 | 0.3897 | 138 | 26 |
DFPN | set-n1 | 7786 | 0.1705 | 0.2456 | 0.246 | 5 | 2 |
DFPN | set-n5 | 6730 | 0.0753 | 0.1146 | 0.1151 | 5 | 2 |
Notes
- "Top-N" refers to the accuracy, i.e. the capability to recover the reference route among the top-N ranked routes
- "Routes extracted" and "Number of clusters" are median over all targets
We welcome contributions, in the form of issues or pull requests.
If you have a question or want to report a bug, please submit an issue.
To contribute with code to the project, follow these steps:
- Fork this repository.
- Create a branch:
git checkout -b <branch_name>
. - Make your changes and commit them:
git commit -m '<commit_message>'
- Push to the remote branch:
git push
- Create the pull request.
Please use black
package for formatting, and follow pep8
style guide.
Yasmine Nahal is acknowledged for the creation of the PaRoutes logo.
The contributors have limited time for support questions, but please do not hesitate to submit an issue (see above).
The software is licensed under the Apache 2.0 license (see LICENSE file), and is free and provided as-is.
Genheden, S.; Bjerrum, E. PaRoutes: Towards a Framework for Benchmarking Retrosynthesis Route Predictions. Digit. Discov. 2022, 1 (4), 527–539. https://doi.org/10.1039/D2DD00015F.