Home of the PaRoutes framework for benchmarking multi-step retrosynthesis predictions.

Overview

PaRoutes is a framework for benchmarking multi-step retrosynthesis methods, i.e. route predictions.

It provides:

  • A curated reaction dataset for building one-step retrosynthesis models
  • Two sets of 10,000 routes
  • Two sets of stock molecules to use as stop-criterion for the search
  • Scripts to compute route quality and route diversity metrics

Prerequisites

Before you begin, ensure you have met the following requirements:

  • Linux, Windows or macOS platforms are supported - as long as the dependencies are supported on these platforms.

  • You have installed anaconda or miniconda with python 3.7 - 3.9

The tool has been developed on a Linux platform.

Installation

First clone the repository using Git.

Then execute the following commands in the root of the repository

conda env create -f env.yml
conda activate paroutes-env
python data/download_data.py

Now all the dependencies and datasets are setup.

Usage

Performing route predictions

PaRoutes provide a list of targets and stock molecules in SMILES format for two sets n1 and n5.

For n1 you find in the data/ folder of the repository

  • n1-targets.txt - the target molecules
  • n1-stock.txt - the stock molecules

For n5 you find in the data/ folder of the repository

  • n5-targets.txt - the target molecules
  • n5-stock.txt - the stock molecules

For more information on the files in the data/ folder, please read the README file for that folder.

Analysing predictions

The predicted route exported by your software need to be converted to a format that can be read by the analysis tool. This format is outlined in the analysis\README.md

The following command for analysis assumes:

  1. The current directory is the root of the paroutes repo
  2. Your route predictions for the n1 targets in a JSON format is located at ~/output_routes.json

Then you can type

python analysis/route_quality.py --routes ~/output_routes.json --references data/n1-routes.json --output ~/route_analyses.csv

to calculate the route quality metrics. It will print out how many of the targets were solved and the top-1, top-5 and top-10 accuracies (by default). For further details have a look in the data/README.md file.

To perform clustering on the same dataset, you can type

python analysis/route_clusters.py --routes ~/output_routes.json --model data/chembl_10k_route_distance_model.ckpt --min_density 2 --output ~/cluster_analyses.json

The script will print out the average number of clusters formed for each target. For further details have a look in the data/README.md file.

Contributing

We welcome contributions, in the form of issues or pull requests.

If you have a question or want to report a bug, please submit an issue.

To contribute with code to the project, follow these steps:

  1. Fork this repository.
  2. Create a branch: git checkout -b <branch_name>.
  3. Make your changes and commit them: git commit -m '<commit_message>'
  4. Push to the remote branch: git push
  5. Create the pull request.

Please use black package for formatting, and follow pep8 style guide.

Contributors

Yasmine Nahal is acknowledged for the creation of the PaRoutes logo.

The contributors have limited time for support questions, but please do not hesitate to submit an issue (see above).

License

The software is licensed under the Apache 2.0 license (see LICENSE file), and is free and provided as-is.

References

You might also like...

PostgresML is an end-to-end machine learning system. It enables you to train models and make online predictions using only SQL, without your data ever leaving your favorite database.

PostgresML PostgresML is an end-to-end machine learning system. It enables you to train models and make online predictions using only SQL, without you

Nov 24, 2022

Python implementations of clustering algorithms applied on the probability simplex domain (e.g. clustering of softmax predictions from Black-Box source models).

Python implementations of clustering algorithms applied on the probability simplex domain (e.g. clustering of softmax predictions from Black-Box source models).

Clustering Softmax Predictions Updates Paper Simplex Clustering via sBeta with Applications to Online Adjustment of Black-Box Predictions If you find

Sep 15, 2022

Python code for fine-tuning AlphaFold to perform protein-peptide binding predictions

alphafold_finetune Python code for fine-tuning AlphaFold to perform protein-peptide binding predictions. This repository is a collaborative effort: Ju

Nov 5, 2022

Training the Unet++ network on the Weizmann-Horse-Database dataset and performing segmentation predictions.

Unet++ for Weizmann Horse Database This repository uses the Weizmann Horse Database for training and semantic segmentation prediction of standard Unet

Aug 19, 2022

Downloads data from Yahoo Finance, generates features, trains a model and submits the predictions to the tournament.

Numerai Signals Pipeline Downloads data from Yahoo Finance, generates features, trains a model and submits the predictions to the tournament. Running

Nov 8, 2022

Interface with the Roboflow API and Python package for running inference (receiving predictions) and customizing result images from your Roboflow Train computer vision models.

Interface with the Roboflow API and Python package for running inference (receiving predictions) and customizing result images from your Roboflow Train computer vision models.

roboflow-computer-vision-utilities Interface with the Roboflow API and Python package for running inference (receiving predictions) from your Roboflow

Nov 4, 2022

Python JSON benchmarking and correectness.

json_benchmark This repository contains benchmarks for Python JSON readers & writers. What's the fastest Python JSON parser? Let's find out. To run th

Nov 26, 2022

[ICML'22] Benchmarking and Analyzing Point Cloud Robustness under Corruptions

[ICML'22] Benchmarking and Analyzing Point Cloud Robustness under Corruptions

Benchmarking and Analyzing Point Cloud Robustness under Corruptions Jiawei Ren, Lingdong Kong, Liang Pan, Ziwei Liu S-Lab, Nanyang Technological Unive

Nov 18, 2022

Benchmarking toolkit for patch-based histopathology image classification.

ChampKit ChampKit: Comprehensive Histopathology Assessment of Model Predictions toolKit. A benchmarking toolkit for patch-based image classification i

Nov 3, 2022
Comments
  • How to correctly load PaRoutes models?

    How to correctly load PaRoutes models?

    Your benchmark is very interesting and I would like to do some experiments with it, but I haven't found instructions on how to use your pre-trained models. Would you mind telling me whether the following code correctly loads and uses your models?

    import numpy as np
    import pandas as pd
    import h5py
    from tensorflow import keras
    from rdkit.Chem import AllChem
    from rdchiral.main import rdchiralRunText
    
    def get_fingerprint(smiles: str) -> np.ndarray:
        mol = AllChem.MolFromSmiles(smiles)
        assert mol is not None
        fp = AllChem.GetMorganFingerprintAsBitVect(mol, radius=2, nBits=2048)  # QUESTION: is this the right fingerprint?
        return np.array(fp, dtype=float)
    
    # Load templates
    df_templates = pd.read_hdf("./data/uspto_rxn_n5_unique_templates.hdf5", key="table")
    
    # Load model, defining custom metrics because without these it gave an error...
    model = keras.models.load_model(
        "./data/uspto_rxn_n5_keras_model.hdf5", 
        custom_objects={
            "top10_acc": keras.metrics.TopKCategoricalAccuracy(k=10, name="top10_acc"),
            "top50_acc": keras.metrics.TopKCategoricalAccuracy(k=10, name="top50_acc"),
        }
    )
    
    # Example use case: run the best reaction for the first 2 targets
    test_smiles = ["O=C(O)COCCOCCOCCOCCOCCOCCOCC(F)(F)F", "COc1cc(N)c(Cl)cc1C(=O)NCCCC1CN(Cc2ccccc2)CCO1"]
    x = np.stack([get_fingerprint(s) for s in test_smiles])
    template_probs = model(x).numpy()
    most_likely_reactions = template_probs.argmax(axis=1)
    
    for i, sm in enumerate(test_smiles):
        reactants = rdchiralRunText(df_templates["retro_template"].values[most_likely_reactions[i]], sm)
        print(f"{i}: {reactants} >> {sm}")
    

    This code runs and produces the following output (in particular, the second reaction fails). Is this the output that you would expect?

    0: ['CCOC(=O)CBr.OCCOCCOCCOCCOCCOCCOCC(F)(F)F'] >> O=C(O)COCCOCCOCCOCCOCCOCCOCC(F)(F)F
    1: [] >> COc1cc(N)c(Cl)cc1C(=O)NCCCC1CN(Cc2ccccc2)CCO1
    

    Thank you in advance for answering my question. Great manuscript and keep up the good open source work! 💯

    documentation 
    opened by AustinT 1
Releases(v1.0.0)
Owner
AstraZeneca - Molecular AI
Software from the Molecular AI department at AstraZeneca R&D
AstraZeneca - Molecular AI
Step-by-step guide for remove the ThinkPad T470s supervisor password.

ThinkPad T470s UEFI Unlock This is a thorough step-by-step guide for removing this laptop's UEFI password. It requires some Linux knowledge and (in ca

Mia Lilian 6 Jul 20, 2022
Build a twitter-like app in Django, Bootstrap, Javascript, & React.js. Step-by-Step.

Tweetme2 Build a twitter-like app in Django, Bootstrap, Javascript, & React.js. Step-by-Step. Lesson Code Lessons 1-5: no significant code added [6 -

Morteza JahangirSamet 1 May 24, 2022
Let's create your first ever Django project step-by-step. We'll use Python 3 and Django 4.

Your First Django Project Building a web application has never been easier thanks to open source. In this course, we're going to go step-by-step build

Coding For Entrepreneurs 10 Oct 13, 2022
This is a step-by-step guide on deploying a React Application to AWS and setting up a CI/CD pipeline.

Deploy React App to AWS with CI/CD pipeline This is a step-by-step guide on deploying a React static website to AWS and setting up a CI/CD pipeline. M

Yicheng Wei 4 Sep 5, 2022
Display the process of getting the answer step by step.

LinearAlgebraHomeworkHelper Here's a small project I wrote to output step-by-step the process of solving a linear algebra problem so that I can easily

Chσimσε 2 Nov 7, 2022
Multi-Modal Lidar Dataset for Benchmarking General-Purpose Localization and Mapping Algorithms

Multi-Modal Lidar Dataset for Benchmarking General-Purpose Localization and Mapping Algorithms (Left) Front view of the multi-modal data acquisition s

TIERS 79 Nov 2, 2022
Implementation of "compositional attention" from MILA, a multi-head attention variant that is reframed as a two-step attention process with disentangled search and retrieval head aggregation, in Pytorch

Compositional Attention - Pytorch Implementation of Compositional Attention from MILA. They reframe the "heads" of multi-head attention as "searches",

Phil Wang 45 Nov 28, 2022
Multi-Step Deductive Reasoning Over Natural Language: An Empirical Study on Out-of-Distribution Generalisation

Multi-step-Deductive-Reasoning-over-Natural-Language This repository contains the code for the paper "Multi-Step Deductive Reasoning Over Natural Lang

Strong AI Lab 5 Nov 17, 2022
A step handler library for pyrogram framework.

Pyrostep A step handler library for pyrogram framework. Helps you to step handling ... I tried to provide the best speed ... Installing Usage set_step

aWolver 4 Sep 17, 2022
Flask API made to processing incoming API request for medical data to return machine learning predictions and advise.

ML-Medical Flask APIs made to processing incoming API request for medical data to return machine learning predictions. However this is not medical adv

Yap Khai Chuen 1 Apr 16, 2022