We’re hiring…

The development of Tripod is progressing steadily further along.  With limited bandwidth, it’s difficult for us to keep up the pace while at the same time providing support for internal software projects.  To help speeding up Tripod development, we now have opening for one or possibly two software engineering positions in our group.  Please click here to see our hiring announcement.

Visualizing structural diversity

Assessing chemical diversity is an integral part of library design.  Depending on the specific requirements (e.g., lead discovery, target-focused), numerous approaches are available for assessing diversity.  A common approach, however, is to calculate a (possibly large) number of relevant properties and/or descriptors for each compound in the library and use principal component analysis (PCA) to project the data as either 2- or 3-dimensional diversity plot.  This approach affords an effective way of visually assessing property and/or activity diversity of chemical libraries, but it’s not clear how structural diversity can similarly be achieved.  Here, part of the difficulty is due to the lack of an effective method for encoding of structural features as Euclidean vectors (which are necessary for linear dimensionality reduction methods such as PCA to work properly).  Instead, structural features are most effective when encoded as either  fixed-length binary vectors or hologram fingerprints (i.e., sparse vectors consisting of structural feature counts) with appropriate proximity metrics defined.

An obvious alternative to PCA is multidimensional scaling (MDS).  MDS is a non-linear dimensionality reduction technique that can be effective at revealing the “natural” manifold of the underlying dataset.  An interesting aspect of data embedding with MDS is that only pairwise proximity (e.g., distances) values are required (and moreover, the proximity metric used doesn’t have to strictly adhere to the standard definition of metric space).  This is quite convenient in that now we don’t have to construct an explicitly encoding of the molecular graph for embedding.  For example, a reasonable proximity metric can be defined in terms of maximum common subgraph (MCS), whereby the molecular graphs are used directly in the calculation of the metric values (though doing so is only practical for small datasets because of the combinatorial nature of existing MCS algorithms).

A drawback of classical MDS algorithm is that the number of computed proximity values (and hence the memory requirement) is quadratic with the size of the library.  This is not practical for all but small libraries.  There exist a number of scalable MDS  algorithms in the literature.  Here we have implemented one such algorithm known as FastMap.  This algorithm can scale to libraries of hundreds of thousands of compounds, though the implementation below is limited by the physical memory allotted by Java webstart.  (Please let us know if you would like to run FastMap on large libraries.)

A Java webstart tool to experiment with FastMap is available here.  Below is an example of FastMap embedding of a CRE dataset (partitioned by the different active scaffolds).

Structure standardizer

Although much of our recent effort has been toward the user interface, the backend of Tripod is by far where we’ve spent most of our development effort.  We hope to provide a more detailed description of the overall architecture of Tripod in a future post, but suffice to say Tripod is a self-hosted application with persistent data storage for biological and chemical entities such as compounds, assays, targets, genes, documents, etc. To make effective use of the entities for downstream analyses (e.g., building polypharmacology networks), the entities themselves must be uniquely registered within Tripod’s persistent data store.  For the most part this is straightforward, since each of the entity types usually has some form of well-known registry identifier (e.g., UniProt accession for protein targets, locus ID for genes, DOI for publications).  The exception, of course, is compound entity.  Although there are well-known registries available, using any such registry would severely limit the utility of Tripod, especially within a corporate setting.

Compound registration within Tripod is not quite like a traditional chemical registration system.  In addition to assigning unique registry identifiers, Tripod also performs additional processing (e.g., generate fragments and structural indexes) to speed-up downstream analysis such as R-group decomposition.  (It’s our hopes that any useful analysis task in Tripod will ultimately be reduced to a simple step of browsing.)  In terms of structure standardization, Tripod is aggressive at (i) stripping out salt/solvent, (ii) (de-) protonation, and (iii) removal of “spurious” sp2 stereochemistry.  Here what we mean by “spurious” is effectively any E/Z configuration that might be induced due to either tautomer/mesomer enumeration and/or alternating path during proton removal (i.e., see Section 5 of the InChI technical manual).

We’re now wrapping the standardizer engine used in Tripod’s registration system as a self-contained tool in hopes of getting feedback on how it can be improved.  While we’ve done our best to consult InChI and PubChem’s standardizer during its development, there are just too many posibilities for us, with limited bandwidth, to have any hope of getting it right (for examples, the different types of tautomeric forms discussed in this paper should provide a good workout for any chemical registration system).

The standardizer tool is available as a Java webstart application here.  It requires at least Java 1.5.  The tool can be used interactively or in batch mode.  Please let us know if you have problems running it.  Below is a quick look of it. 

Automated R-group analysis

Generating R-group tables is a tedious task that we often find ourselves performed repeatedly for every project.  For a typical HTS campaign, this task can be as simple as taking the most potent compounds and performing substructure searches around the identified scaffolds.  For most cases, however, the task is quite involved, requiring a combination of clustering, substructure-, and pair-wise MCS searches.  Here a couple of issues with this approach should be noted.  First, clustering results are very sensitive to not only the underlying algorithm, but also on the metric used to compute the distance/similarity values between molecular fingerprints.  We have seen commercial molecular clustering software failed repeatedly on fairly obvious cases.  And secondly, it doesn’t take much imagination to come up with examples where scaffolds derived from  maximum common substructure searches are clearly not the “correct” scaffolds.  For this very reason we’re often puzzled as to why most cheminformatics toolkits (we’re only aware of a couple of exceptions) don’t provide an (exact) implementation for generating/enumerating all maximal common substructures.

Recently, we developed a prototype that can automatically generate R-group tables for any reasonable sized collection (limited by the available CPU cores and physical memory) of med-chem friendly compounds.  Our implementation—in contrast to the one described here—doesn’t make any assumptions as to what a scaffold should look like.  Instead, through exhaustive enumeration, we defer the scaffold recognition task to the user.  We feel this approach is reasonable since, after all, even medicinal chemists don’t often agree on the definition of “scaffold.”

The prototype is available here.  Below is a quick look of it.

Since this prototype was put together quickly, its user interface is a bit clunky but hopefully still functional.  This is the first time it’s been exposed outside of our group, so we’d be very interested in comments and feedback on how we can make it better.

Categories and entity grouping

Over the past couple of weeks we’ve been making steady progress.  Since our first priority is to provide a usable tool for the practicing scientists, it’s important that we take the time to make a good first impression (after all, the tolerance level of our intended users is likely to be lower than that of a traditional open-source project).  To put things in perspective, we had a working prototype back in 2008, but because it was not deemed to be user friendly it never got released.  Tripod is a complete rewrite based on our experience developing the prototype.

Our effort over much of the past weeks has been to determine the right level of categories that are useful for grouping of entities.  Take Compound entity as an example.  Here we can group compounds based on a number of categories (e.g., by molecular framework, target, mode of action, etc.).  A laundry list of categories certainly would not fit well within the iTunes’ group/album view.  So instead we organize the categories into “themes.”  Here are examples of the categories (Topology and Framework) for the molecular theme.

We are, by no means, have enumerated all useful themes and categories for each entity and would welcome comments and feedback.

Welcome…

The purpose of this site is to allow us to provide update on the latest development of Tripod, our iTunes inspired tool for browsing and managing of biological and chemical data.  We are currently working hard toward an alpha version.  Please feel free to subscribe to the RSS feed to receive progress updates.

   Newer→