PubChem fingerprint revisited

A few years ago we made available our implementation of the PubChem substructure keys fingerprint. Being it was our first attempt, there were a number of shortcomings with the implementation in terms of performance and accuracy. Recently, we have the opportunity to revisit the code and address some of these shortcomings. In this post, we briefly describe some of the improvements. (For the impatient, the latest code is available here.) We should note that since our original implementation, there have been at least two other implementations that we’re aware of. It certainly would be interesting to compare the different implementations; if anything, the comparison should reveal fundamental differences in aromaticity and ring perceptions across various cheminformatics toolkits.

As with most substructure keys-based fingerprints, perhaps the biggest performance bottleneck has to do with repeated subgraph isomorphism searches of a fixed set of substructure keys (encoded as either SMILES or SMARTS) against the target molecule. Here, a naïve approach that sequentially searches each key is unlikely to perform well for all but small key sets (e.g., 100–200). Given that there are roughly 618 (out of 881 total size) substructure keys defined for the PubChem fingerprint, this approach is unlikely to scale, as demonstrated by our original implementation. Efficiently addressing this bottleneck requires an indexing data structure much like suffix-tree or -array in stringology (though currently the approach described here is clearly the most effective). In this spirit, we’ve built upon our recent work on large scale structure searching as the indexing engine for the substructure keys, thereby reducing the problem of multiple substructure searches to that of a single superstructure search.

Another shortcoming of our original implementation centered around ring perception, a topic that has its share of ambiguities within the cheminformatics literature. Our use of smallest set of smallest rings (SSSR) in place of extended set of smallest rings (ESSR) was clearly not adequate. As such, we’ve modified an algorithm for finding the shortest cycle basis of a graph to effectively handle ESSR.

While we’ve strived to be diligent with the implementation, it’s just not possible for have an exact replica of PubChem’s for obvious reasons. Nonetheless, depending on the test set, the current implementation averages less than 1 bit of error per molecule. The complete code and example usages are available here. As always, we’d love to receive feedback on how to further improve it.

An online version of the Lilly reactivity & promiscuity filters

Recently Bruns & Watson published a paper describing a set of rules to filter out (potenially) reactive and promiscuous small molecules. There are 275 rules addressing a variety of possible reasons to reject a molecules: general reactivity, assay interference (such as flourescence), instability and so on. There filtering protocol also includes a set of explicit compounds specifically marked as interfering (even though they may have passed the structural rules).

The authors have also made their code available on Github – it’s C++ code and compiles easily on Unix systems. While it’s quite straightforward to use, we’ve put up a simple web page where you can paste a collection of SMILES (which should include molecule titles) and retrieve a summary report (in HTML, plain text or Excel formats). You can also interact with the page via POST requests to An example of doing this in Python is shown below

import urllib, urllib2
format = 'text' # can be 'excel' or 'html'
url = ''
values = {'format':format, 'queryids':'''[NH2+](C(C)c1ccccc1)CCC(c1ccccc1)c1ccccc1       85273758
O(C(=O)C(C1CCCCC1)c1ccccc1)CC[NH+](CC)CC        85273739
O(C(c1ccccc1)c1ccccc1)CC[NH+](C)C       85273734
n1c(N)c(N=Nc2ccccc2)ccc1N       85273730
S(=O)(=O)(N1CC[NH+](CC1)CC(=O)Nc1c(n[nH]c1C)C)c1cc(OC)c(OC)cc1  85273684
S1(=O)(=O)CC([NH+](CC(=O)Nc2cc(cc(c2)C(OC)=O)C(OC)=O)C)CC1      85273683
FC(F)(F)c1ccc(NC(=O)C([NH+](CC(=O)N2CC[NH+](CC2)Cc2ccccc2)C)C)cc1       85273682'''}
data = urllib.urlencode(values)
req = urllib2.Request(url, data)
response = urllib2.urlopen(req)
page =
print page

Charting chemical space

Charting, in a technical sense, is the process of mapping objects in high dimensional space to lower dimensional subspace, something which we have previously discussed using the FastMap algorithm. In this post, we consider a more literal interpretation of “charting,” namely, that of simply visualizing molecular data on an X-Y plot. While there are numerous plotting packages available, only a few actually understand chemical structures. Such a tool can be quite useful in elucidating structure activity/property relationships.

MolPlot is a tool that we recently develop for “charting” chemical structures. Its only requirement is where to extract the coordinates, which in turn can be any numerical data (e.g., principal axes from principal component analysis, biological readouts, etc.). The tool comes bundled with a kinase panel dataset, so feel free to take it out for a spin. We’d love to get feedback on how to further improve it.

Library synthesizer

Library enumeration and profiling are an integral part of a medicinal chemist’s repertoire during early-stage discovery. Whether it’s synthesis planning for lead optimization or library design for lead identification, having an effective tool to enumerate and profile virtual compounds is ultimately critical to the success of a therapeutic campaign. Library synthesizer is a tool that we recently, working together with a medicinal chemist, develop to address such needs. While the tool is still a work in progress, it’s functional to the point that we thought others might find it useful. Clearly, the potential of this tool is limited only by the imagination (and of course implementation), so we’d love to get feedback.

(A note on the implementation: Due to technical difficulties, we weren’t able to deploy library synthesizer within the Java web start sandbox like other tools available on this site. This means that, upon launching the tool, you’ll be asked to allow an untrust application access to your system. Here, please take our words that it’s safe to say yes!)

Large scale substructure searching

Motivated by the recent renewed interest in substructure searching in the literature, we recently develop a proof-of-concept, self-contained substructure searching engine that can scale to large databases with modest hardware requirements. The current prototype is able to handle PubChem database (>30M structures) with reasonable performance on any modest server with sufficient (>12Gb) RAM. This work is an extension to our recent work on improving fingerprint screening. While it’s tempting to throw out qualitative (and/or unverifiable) performance numbers, we’ll let you be the judge. The prototype hosting the entire PubChem (snapshot taken in September of 2011) is available here. Please bear in mind this is a prototype, so it might not be able to handle DoS-type queries (e.g., c1ccccc1) gracefully. The binary and source code for the entire prototype are also available. We’d love to help you deploy it in-house, so feel free to contact us.

MLBD browser

To date, the NIH Molecular Libraries Program (MLP)—through its comprehensive and specialized centers—has produced an unprecedented amount of experimental data characterizing various levels of interactions of small molecules across a broad range of biological targets. While there has been much enthusiasm for the prospect of new discoveries, the fact remains that there is currently no adequate tool available to effectively distill the content-rich information afforded by the MLP data so as to facilitate new discoveries, serendipitously or otherwise. Herein, we present the Molecular Libraries Biological Database (MLBD) browser, our vision of how the MLP data can be organized and distilled in such a way that directly captures the essential information of a therapeutic project:

  • What’s the current status of the project?
  • How much effort, in terms of chemistry and follow-up screens, has gone into the project?
  • How many different series are there in the project?
  • How much optimization has been performed for a particular series?
  • etc.

By simply (i) using data from PubChem as-is, (ii) applying a few simple heuristics, and (ii) performing rudimentary analysis based on tools available elsewhere on this site, we have thus transformed analysis into browsing.

On faster substructure searching

With the rapid growth of chemical databases such as vendor catalogs and in-house screening collections, the constant desire to improve on (sub-) structure searching performance has never been more justified. Recent developments in this area have focused primarily on advances in hardware architecture, i.e., taking advantage of the many cores available on general-purpose graphics processing units (GPUs). While the reported performance can be quite impressive (i.e., order of magnitude over CPU), such approaches often require specific hardware configurations. Herein we describe a practical improvement that can achieve, depending on the query structure, as much as twice the speed-up over traditional substructure searching.

Before describing the proposed improvement, let’s recall that substructure searching consists of two basic steps: (i) a fast screening step follows by (ii) a relatively slow (sub-) graph isomorphism matching. In (i), molecular graphs are typically encoded as bit strings (or fingerprints) of length multiple of the hardware word size (e.g., 512- or 1024-bit). Such bit strings (where each bit encodes a specific feature of the molecular graph) are formally known as Bloom filters. With some probability of false positives, Bloom filters are extremely efficient data structures for set membership testing. Let p and q denote the Bloom filters of the target and query (molecular) graphs, respectively. Then to determine whether q is a subgraph of p amounts to the following binary logic:

(p & q) == q

Only when the above expression is true that p and q move on to a more rigorous subgraph isomorphism matching.

Although Bloom filters are quite powerful, screening through a large database in the millions requires a linear scan on the order of the database size:

foreach pi in db
  if (pi & q) == q then
    do graph isomorphism

When the Bloom filters for the entire database can fit into main memory, the linear scan above should still be manageable on modern server hardware. In fact, with the rapid lowering cost of memory hardware, we posit that in-memory searching is the only effectively solution that scales.

Instead of working with bit strings, let’s convert them into a decimal system so that it’s more natural to work with. Let |a| and |b| be the unsigned decimal values correspond to p and q, respectively. Now the logical expression above can be approximated as follows:

|a| >= |b|

The conversion effectively endows bit strings with an ordering that is consistent with the basic properties of the Bloom filters. This forms the basis of our proposed improvement, which we summarize as follows:

  • Let P = {p1pN} denote the entire database of N bit strings
  • A bit string pi is defined as an K-dimensional vector of 32-bit (signed or unsigned) integers; e.g., K = 16 and 32 for 512- and 1024-bit strings, respectively.
  • In a preprocessing step, we construct an k-d tree data structure for P.  A depth first traversal of the tree guarantees that, for any two given leaf nodes, the left most node is never contained within the right most node.
  • Now the screening step is reduced to that of (i) locating the appropriate leaf node in the tree and (ii) marching to the right in depth-first order and report all matches.

A self-contained reference implementation in Java is available here. It certainly has lots of room for improvements, so we’d love to get feedback.  Based on our not-so-exhaustive analysis, the following key properties have been observed:

  • The more specific the query is (i.e., the larger the query graph), the faster the search. Because the query bit string tends to contain large number of on bits, its starting leaf node is typically toward the right, thereby pruning a significant portion of the search space.
  • Bit strings with “structure” perform better than those without. Here, what we mean by “structure” is simply that the distribution of the  bits are not uniform. We can, for example, reorganize the split values during the tree construction based on the bit distribution. This also implies that structural keys perform better than hashed fingerprints.
  • The speed-up factor over linear scanning improves as the dimension K increases.


After a somewhat long hiatus (mostly in trying to tie up loose ends for the NPC browser), we’re finally back! Motivated by the feedback that we’ve received from the NPC browser, for the past few weeks we’ve been working on a tool—Siphonify—to help make managing of chemical structures a bit easier. Now one can organize, search, and browse chemical structures in much the same way as those of music files and documents. Please check out these slides for a quick tour of its features.  The tool is still very much in alpha stage, so we appreciate feedback and suggestions.

Below are various screenshots of the current version.

NPC paper

The paper describing the NCGC pharmaceutical collection (NPC) resource has just been published in the current issue of Science Translational Medicine. Among other things, the paper definitively resolves a seemingly simple question of How many drugs are there? For the impatient, to the right is a screen capture of the NPC browser that contains various numbers from the paper.

We also would like to take this opportunity to give a plug for the NPC browser. In its current form, the browser can be used to answer other similar “simple” questions such as

  • How many kinase inhibitors are in the FDA orange book?
  • Can I perform structure searching of interventions in Likewise, what about drug product labels from DailyMed?
  • Which drugs in the drugs@FDA list that are not approved by the UK NHS and vice versa?
  • And many more…

So if any one of these “simple” questions rouses your interest, please feel free to take the NCP browser out for a spin. As is with any tools available here, we’d love to receive suggestions on how to improve the software.

The kinome navigator

In a recent paper, Metz et al. describe a statistical framework for constructing the kinome polypharmacology network. Using a large kinase panel (i.e., more than 150,000 activity values across 172 kinases for 3,858 compounds), they propose a set of statistical parameters (based on some measures of reliability and relevance) that can be used to establish relationships between kinases. Although such networks are useful in analyzing polypharmacology trends, the authors subsequently provide the following insight:

In order to impact medicinal chemistry design decisions, what is required is an understanding of specific chemotypes or functional groups that either create or destroy a specific connection, as defined by the therapeutic endpoint.

To that end, we’ve recently put together a tool, the kinome navigator, that provides chemotype-centric views of kinase data. It’s based on an improved version of our automated R-group analysis tool. The input requirement is the same as that of the kinome viewer, namely, kinase activity values are expressed in nM. Here is the proper input of the above dataset (which is also bundled with the tool). Below are some sample screenshots.

Click on the button below to launch it. Please note that the tool is memory intensive, so it’s best to run it on a machine with at least 1Gb of memory.  As always, we welcome comments and/or suggestions.

The human kinome Java component

Recently, we received a request to be able to access to the human kinome dendrogram programmatically.  We thought this might also be useful for others with similar needs.  Here is the kinome dendrogram as a self-contained Java package kinome.jar.  The only requirement is Java 1.5 or above.  This sample program provides the basic usage of the component.  To compile it, simply type:

javac -cp kinome.jar:.

And to run it

java -cp kinome.jar:. KinomeTest

Here is a sample image generated by the program.

As always, we welcome comments and feedback.

How many drugs are there?

Update May 29, 2011: Please see this post for the answer to the above question.

The NCGC Pharmaceutical Collection (NPC), as we noted earlier, is a result of our ongoing efforts to construct a definitive informatics and screening resource for drug repurposing.  Much of our recent efforts have been dedicated toward making this resource easily and readily accessible.  To that end, we’re now releasing an updated version of the NPC browser that contains the latest version of the NPC dataset in its entirety. To the best of our knowledge, this dataset constitutes the most up to date records of all compounds that have been approved for human and veterinary use.  The dataset is by no means complete and bug free.  We welcome comments and feedback to help improve on the data quality and coverage.  The NPC resource is available here.

The human kinome (part 1)

Update Nov 27, 2012: The entire source code for this tool is available here.

Kinases are an important class of enzymes that are therapeutically relevant for a wide range of indications such as cancer and neurodegenerative diseases. Given that there are roughly 500 different kinase proteins, selectivity—and in some cases polypharmacology—is perhaps one of the most important parameters used during lead optimization. Here the lead compound is subjected to a panel of kinase assays and the results are often projected onto the kinome dendrogram. Such visualizations provide an effective means to evaluate the selectivity profile of lead compounds. Recently, we develop a self-contained kinome prototype tool for this very purpose. At the moment the tool assumes that kinase activities are expressed in μM. Here is an example of the input format for the tool. As always, we welcome suggestions and/or feedback.

Although being able to robustly project data onto the kinome dendrogram is an important step in evaluating selectivity, a more useful scenario is to consider the underlying kinase profile in a broader context of  known/published results. In part 2 of this post, we discuss the integration of our kinome viewer with the kinase subset of the ChEMBL database.  Please stay tuned.

Fragment activity profiler

A challenging task in early stage lead discovery is the triaging of chemical series from high throughput screening (HTS) assays.  Here triaging is a multiplex problem that seeks to find the balance between a number of often competing goals such as potency, tractability, selectivity, novelty, etc.  Depending on the type of assay used (e.g., biochemical, cell-based), the number of identified chemical series can easily be in the hundreds.  Sifting through that much data to identify only a handful of promising series for follow-up can be quite overwhelming.

One effective strategy toward triaging is to utilize well curated databases to provide the necessary context around each chemical series. In particular, for given a chemical series we’d often like to address the following two basic questions:

  • What is known, if any, about this chemical series with respect to the intended and/or related target(s)?
  • What compound classes are known to modulate the intended and/or related target(s) and how similar are they to the underlying chemical series?

Recently, together with our colleague Rajarshi Guha, we develop a profiling tool to help with the triaging task.  The tool is built around our molecular framwork and uses a subset (namely, IC50 activity type) of the ChEMBL database (release 5) as the backend.  To facilitate activity comparisons across assays, activity values are normalized based on a robust variant of the Z-score (i.e., median and MAD are used in place of mean and standard deviation, respectively).  We hope the tool also serves as an effective way to browse ChEMBL.  Please feel free to let us know how we can make the tool more useful.  If you’d like to have the tool available in-house with your own version of ChEMBL, please drop us a note.  We’ll be happy to help you set it up.

A fragment-based approach to (multiple) MCS

Maximum common subgraph (MCS) is considered as one of the most challenging problems in combinatorial optimization.  As a generalization to (sub-) graph isomorphism, MCS has deep roots within cheminformatics.  From reaction mapping to enabling molecular graphs as first-class objects in kernel-based QSAR methods, MCS has become a basic necessity in any cheminformatics toolbox.  However, the utility of MCS (or lack thereof) is currently limited by our inability to solve MCS efficiently.  Pending a significant development in computational complexity theory, this fact will likely to remain true for a forseeable future.

Despite the inherent “hardness” in solving MCS, there exist a number of algorithms for solving MCS within reasonable time on modern hardware.  For a good survey of MCS algorithms, please see this paper.  The purpose of this post, as such, is to briefly describe our approach to solving MCS and, more generally, to the problem of solving MCS for a set of molecular graphs.  Multiple MCS (as the latter problem is often known) has numerous applications in cheminformatics such as Markush structure extraction (e.g., see our related R-group decomposition tool), generalizing bioisostere replacement beyond pairwise, (2D) pharmacophore elucidation, and so forth.

Clique-based approaches are arguably the current state-of-the-art for solving exact MCS.  Here the MCS problem is transformed into one of finding the maximum clique of the corresponding association graph.  This approach has a number of desired properties including  (but are not limited to) the following:

  • Algorithmic elegance.  The clique finding problem has been well-studied in computer science, so there exist many well-known algorithms available (e.g., see the second DIMACS implementation challenge).  Most of the published maximum clique algorithms are quite easy to understand.  In fact, the most problematic aspects of the clique-based approach, in terms of code, are likely to be in the construction of the association graph and bounding function. The actual clique finding routine can be expressed in a few lines of code using any modern computer programming language.
  • Optimized implementation.  Since most efficient clique solvers are based on the branch-and-bound algorithm, they can easily be parallelizable on modern multi-core CPUs.  Moreover, with some care the implementation of a clique solver can be done entirely with bit twiddling.
  • Built-in support for disconnected MCS.  Due to the nature of the association graph, the resulting maximum clique often yields disconnected MCS’s.  This can be quite useful, for example, in the identification and visualization of bioisostere replacements.

Notwithstanding these properties, however, clique-based approaches have a few drawbacks.  Given two graphs G1 and G2 with m and n nodes, respectively, the number of nodes in the corresponding association graph  is in the order of mn.  Even with a modest sized molecule of 20 atoms, this can result in an association graph of 400 nodes.  Here a fast clique solver will likely to spend a few CPU seconds finding the maximum clique.

Thus far we’ve considered MCS as an optimization problem, the premise being that the best solution is also the right one.  This is certainly true for most general cases, but for cheminformatics it’s not quite so obvious.  Consider, for example, the following two molecules:

There are at least 26 different maximum subgraph isomorphisms here.  Let’s look at one specific MCS below

The numbering denotes isomorphisms between the two molecules.  (This example also highlights the fact that we only consider maximum common edge subgraph isomorphism.)  Depending on the context, the MCS above might be desirable in that it captures the carboxyl functional group.  In other context, it might be the case that the less optimal subgraph isomorphism corresponds to the indole scaffold is desirable.

To remedy this shortcoming, a different approach is to consider MCS as a clique enumeration problem.  That is, instead of finding the maximum clique, we enumerate all maximal cliques in the association graph.  Each maximal clique, in turn, corresponds to a subgraph ismorphism, which may or may not be the actual MCS.  The distinction between this approach and that of the maximum clique is effectively there are no bounds defined, so the entire search space is evaluated.  Given there can be as much as 3(n/3) cliques in a graph of n nodes, such exhaustive enumeration can be prohibitly expensive.

An effective strategy to speed up the enumeration is to reduce the size of the association graph.  This is precisely the approach behind the C-clique algorithm (and also improvements thereof).  By considering only connected subgraph ismorphomism, the C-clique algorithm significantly reduces the size of the search space, thereby making it a very effective algorithm for MCS.  This is the algorithm that motivates our fragment-based (multiple) MCS approach, which we briefly discuss below.  (For the impatient, please click here to take the multiple MCS tool for a spin.)

The main idea behind our fragment-based approach, as we hinted at previously, is to simply use fragments as seeds to the C-clique algorithm.  Effectively, this seeding serves to (i) further reduce the search space, (ii) focus the search in a specific direction, and (iii) move a good portion of the burden in dealing with internal symmetry of the MCS out of C-clique.  This last point is significant in that it allows us to use other algorithms that are much more effective for automorphism detection.  (Clique-based MCS algorithms are notoriously bad in dealing with symmetry.)

Multiple MCS is simply an extension to the fragment-based MCS.  Here an advantage to our fragment-based approach is that we know, given a set of molecules, the fragments that span the entire set (if any). This allows us to do  pairwise MCS with appropriately seeded fragments to eventually arrive at the multiple MCS.  Clearly, pairwise MCS without seeded fragments will likely yield a non-optimal solution (if any at all).

The tool to try out multiple MCS is available here.  We believe the algorithm is fundamentally correct.  Any unexpected result is likely due to bugs in our implementation, in which case we’d love to hear about it.

The NPC browser

February 19, 2011:  Please see an updated version here.

In preparation for the release of the NCGC pharmaceutical collection (NPC) dataset, we’ve prepared a trimmed down version of Tripod focusing only on compounds.  The software can be accessed from here.  We hope the available features, as limited as they are at the moment, provide a concrete feel for what Tripod will ultimately become.  As always, we would love to get feedback either privately (please see our contact details here) or via this forum.

Some notes about the application.  Like all tools available here, it’s deployed via Java webstart.  If you’re having problems launching the application, please refer to this page for instructions on how to install and setup Java webstart.  Due to the scope of Tripod, however, the deployment requires advanced permissions that are available only as a signed (i.e., trusted) application.  Here we’ve used a self-signed certificate, so if you get a warning about untrust certificate, please feel free to accept it. The deployed dataset is currently a DrugBank subset of the NPC.  We’ll release the full dataset in a couple of months.

October 1, 2010 Update:  Please note that Java 1.6 is required.  The deployment is a self-contained application (browser and embedded database) running locally on your machine with the entire content of DrugBank (except for target information) available for browsing and searching.

The NCGC pharmaceutical collection

The best way to make a drug is to start with a drug.

This is the mantra that an experienced medicinal chemist at the NCGC often preached to us.  Indeed, drug repurposing has become a much needed discovery strategy to help replenish the rapidly depleting drug pipelines.

The first necessary step in any repurposing effort is to figure out what are all the known drugs.  As simple as this might sound, obtaining a comprehensive collection of approved drugs is a non-trivial informatics archeological expedition that requires many laborious hours of (i) “digging” for reliable data sources, (ii) (manually) resolving discrepancies in registry identifiers, synonyms, and/or structures, and (iii) identifying the appropriate vendors/suppliers for sample acquisition and custom synthesis.  The NCGC Pharmaceutical Collection (NPC)—under the guidance of our colleagues Ruili Huang and Noel Southall—is NCGC’s on-going effort to build such a collection for the purpose of systematic serendipity. The coverage of the NPC, to the best of our knowledge, is the most complete collection reported to date.

The development of Tripod is currently focused around the NPC dataset. In particular, we’re working on building various relevant networks—e.g., biological, pharmacology, disease—around all known drugs.  It’s our goal that the initial release of Tripod will come bundled with the NPC dataset and at least one pre-built network.

Molecular framework

Molecular framework forms the basis for which compounds are organized, indexed, and visualized within Tripod.  There are two types of framework currently being generated when compounds are registered in Tripod: Fragment and topology.  Here molecular fragments are derived based on exhaustive enumeration (up to some imposed limits) of all possible combinations of smallest set of smallest rings (SSSR).  For a molecule that contains k SSSR, the maximum number of possible fragments generated is in the order of 2k – 1.  In practice, however, the actual number of fragments generated for a typical small molecule is much lower due to symmetries and/or any number of additional constraints, e.g., synthetic accessibility/availability, reactivity filters, etc.

Consider, for example, the following molecule with its associated atom numbering.

The generated fragments are as follows:

Note that the so-called Murcko fragment (2) is a byproduct of the enumeration.  Even though there are only 5 unique fragments, we explicitly retain the atom mapping of the parent molecule as distinct fragments.  This provides much flexibility for downstream analysis.

Molecular topologies are simply carbon skeletons of the molecular fragments. Below are the generated topologies for the given molecule.

Once generated, molecular fragments are extremely useful in many common tasks in cheminformatics.  Our automated R-group analysis tool is one example of such task.  Another interesting use of fragments is for speeding up maximum common substructure (MCS) searches.  That is, using fragments as initial seeds, we were able to significantly improve (an order of magnitude) on a branch-and-bound MCS algorithm.  From clustering to 2D multi-graph alignment, we therefore see no limits to the utility of fragment-based approaches in cheminformatics. Coupled this sentiment with the recent progress in fragment-based screening, perhaps fragments are the future of drug discovery.

The tool to generate molecular fragments and topologies is available here.  Please feel free to let us know if you have any problems running it.  As always, we appreciate any feedback.

Scaffold activity diagram

Understanding structure activity relationship (SAR) is fundamental to early stage drug discovery.  With limited resources, the fate of a therapeutics campaign often rests upon the project team’s ability to rationalize the SAR of a few selected chemotypes during hit-to-lead and/or lead-optimization.  Here R-group tables have predominately been used as communication devices throughout the decision making process.  Although very effective, R-group tables can quickly become unwieldy to manage for even a reasonable sized dataset.  Recently, a number of visualization tools (e.g., Scaffold Hunter and Scaffold Explorer just to name a couple) were introduced to complement the limitations of R-group tables.  These are highly interactive tools that—using advanced visualization techniques to display activity data—allow the user to quickly navigate the SAR landscape.

In similar spirit, we recently put together a quick prototype for scaffold visualization.  Here we use graph layout instead of the typical tree hierarchy to organize the scaffolds.  The prototype is flexible in that it will initially discover all “relevant” scaffolds for a dataset.  Once the scaffold graph is generated, the user can edit (hide or add) the scaffolds at will.  There is no restrictions on the user-defined scaffolds; Markush scaffolds are fully supported, though it’s recommended that the scaffold be properly aromatized prior to defining the query atoms.  Below are a few snapshots of the tool.  It comes bundled with a number of example datasets, so feel free to give it a try.

The navigation controls are as follows.  Panning is performed through the standard mouse drag.  Zooming is achieved via (i) right-click and move up/down for a three-button mouse, or (ii) hold down the Command key and mouse button while moving up/down on the Mac.

We would like to take this opportunity to make a few remarks about the tools available here.  First, we’ve gone the extra miles to make sure the tools run within the Java webstart sandbox.  What this means is that the tools run under a very restricted environment, so it’s not possible for them to do harm to your computer without your explicit knowledge.  Secondly, the only network  communication that occurs between the tools and our server is to check for updates.  This is part of the Java webstart launching protocol and is not  something we have control over.  And lastly, we use Java 1.5 as the lowest  common denominator so as to support as many platforms as possible.  If you get a security warning about this from your operating system, you can safely ignore it.  We hope these measures are enough to convince you that you can  safely try the tools on proprietary data without reservations.

Tautomer generator

The ability of atoms within a molecule to interconvert between hydrogen-donor and acceptor due to tautomerism plays an important role in drug discovery.  This interconversion not only affects the molecule’s interaction with its biological surrounding but also creates a number challenges for cheminformatics in terms of (i) chemical registration, (ii) structure/similarity searching, and (iii) QSAR modeling.  Our focus in this post is on (i); we hope to have the opportunity to address (ii) and (iii) in future posts.

The difficulties associated with tautomerism in chemical registration can be boiled down to the following:

  • Given a molecule, can we identify all of its tautomeric forms?
  • Given a list of tautomeric forms, how do we identify the most “preferrable” form?

Both questions have been well-studied in the literature (e.g., see the recent perspective by Roger Sayle and references therein).  Here we describe our take on the problems and make available our implementation as a self-contained tool for experimentation.  Note that we only consider  prototropic tautomerism in the sequel—i.e., 1,n-shift tautomerism where n in {3, 5, 7,…}.

Approaches to tautomer enumeration can be broadly classified as either local or global.  (Technically speaking, the term “enumeration” used in this context is not correct; instead “generation” is more appropriate here.)  In the case of local, a set of well-known structural patterns are applied to  transform the input molecule into its different tautomeric forms.  The  patterns are typically localized to 1,3-shift.

Though not much details are available for the global approach, the basic outline is as follows:

  1. All hydrogen-donor and acceptor (hetero-) atoms are first identified
  2. Next, a tautomeric form is generated for each unique alternating path of double-single bonds between a hydrogen donor-acceptor atom pair by considering the parity of the path.  Care should be taken to handle overlapping paths.
  3. Repeat step (2) for all combinations of unique paths

Since this approach does not follow any specific patterns, there is no limit to the distance of the shift.  For example, consider the following two molecules (taken from Fig. 15 of this paper):

Example of 1,11-tautomeric form

Since what we have is a 1,11-shift, it’s unlikely that a typical local approach will be able to resolve 1 and 2 as two different tautomeric forms of the same molecule.  The global approach, on the other hand, is able to handle this “hidden” tautomerism without a problem.

To address the most “preferrable” tautomeric form, we extend the tautomer scoring scheme as described here.  This scoring scheme, in turn, is used to rank each generated tautomer.  The one with the highest score is selected as the canonical tautomer.

Below is a quick look at the tautomer generation tool.  Currently, the maximum number tautomers generated is limited to 1000.  In the snapshot, the canonical tautomer is highlighted in pink.  As always, we appreciate any feedback that would help us improve on our implementation.