Synthetically Accessible Virtual Inventory (SAVI) Database

1st Alpha Release |

Alpha 1 File Series - July 2015

About 610,000 products generated in the first, early alpha, phase of the SAVI project

The SAVI project is an international collaboration of computationally generating a very large database of reliably and inexpensively synthesizable screening sample structures that have desirable properties for the drug development process.

It is based on:
(a) a set of transforms with rich chemical context annotation including functional group reactivity data (LHASA, LLC, U.S.; and Lhasa Limited, UK)
(b) a set of highly annotated building blocks (Sigma-Aldrich, Global Strategic Services)
(c) the chemoinformatics toolkit CACTVS with custom development (Xemistry GmbH, Germany)

The transforms are a set of more than 1,500 rules described in the CHMTRN/PATRAN language for encoding chemical transformations with chemical context and quality criteria added, based ultimately on the pioneering work of E. J. Corey.

These rules, in contrast to simple SMIRKS transforms, allow/provide:
- Computation of whether a reaction, depending on the overall structural features of the target, will work at all.
- Scoring: If the reaction works, how robust it is, taking into account overall structural features.
- Whether protection of interfering groups is required - and these can then already be integrated in the final starting materials queries to prioritize pre-protected starting materials.
- Proposal of suitable context-dependent reaction conditions.
- Textual warnings in specific circumstances, such as potential of multiple products, borderline conditions, etc.

Ancillary information to the rules is a set of functional group reactivity data, i.e. a table describing whether any of the standard functional groups in the rule set is unstable under any of the standard conditions.

The building blocks are a set of several hundred thousand compounds available in gram quantities, and with high reliability, from, or through, Sigma-Aldrich. This set has been annotated with pricing information and other business intelligence type data useful for this project.

The chemoinformatics toolkit CACTVS has been expanded in various ways, e.g. with the capability to read the CHMTRN/PATRAN transforms. An important feature that needed to be implemented was the handling of the reversal of the original LHASA transform direction, without re-writing rules, for the strictly forward-synthetic SAVI project. Another important capability was the initial and final starting material (SM) query handling, i.e. the 4-steps: initial SM query extraction from the 2D patterns in the rules; forward reaction from the 2D patterns; scoring (which is the only original LHASA functionality); final SM query expansion (R-groups, protecting groups, etc.).

For the goal of filtering out structures with less-than-desirable attributes in the drug development context, several additional computed properties regarded as important in current drug design have been implemented, such as the demerit scores based on 275 rules for identifying potentially reactive or promiscuous compounds, published by Bruns and Watson (J. Med. Chem. 2012, 55, 9763?9772);

In the current, very early alpha, stage of this project, and for the file downloadable below, only 11 transforms were used; applied to approx. 230,000 building blocks; in only one-step reactions; and the ~610,000 resulting products have been annotated but not yet filtered with any of the computed or associated molecular properties. To limit the file size, only on the order of one percent of the theoretically possible products (of one-step reactions) have been sampled. A set of very schematic graphical representations of the transforms implemented so far (two of them were not used for product generation) can be downloaded here.

We are ultimately aiming at creating a database of one billion high-quality screening samples that should be easily and cheaply synthesizable. These novel molecules will all be annotated with a proposed simple and high-yield synthetic route, and will have been filtered by all the molecular properties generally recognized as important in cutting-edge drug design that we will have implemented by then. A web GUI is planned that will allow users free access to this database via searches by various criteria including substructure searches. It will also present links to pages where users can place requests for having the molecule(s) synthesized by commercial entities.

The following individuals have so far been contributing to this project:

Lhasa Limited, Leeds, UK:

LHASA, LLC, Newton, MA, U.S.: Sigma-Aldrich: Xemistry GmbH, Königstein, Germany: Novartis: NCI CADD Group:

610,492 SAVI-generated products in SD format. This is a 374 MB .gz file that umcompresses to 4.4 GB.


The downloadable SD file is a very early alpha version of the set of generated products. The structures in this file may or may not be part of the final SAVI database. They are meant to be looked at, and commented on, by early users. Any feedback about individual structures or the entire set, and the data associated with them, is welcome.

If you have any questions regarding potential availability of the generated molecules including access to the synthetic starting materials, please contact Scott Hutton.

M. C. Nicklaus

Last Update: 2016-08-27