Elaborate Graph Mining: Exploiting Structural Invariants and Latent Information in Graph Databases to predict REACH-relevant Endpoints.

Andreas Maunz, Institute for Physics, Freiburg University, Germany

In computational chemistry, Frequent Subgraph Mining has been widely applied to databases of pharmacological compounds to identify functional groups for drug design or hazard detection. However, the result set is typically much too large to be of actual use to most statistical learners, let alone human experts. Moreover, many very similar fragments are retrieved this way.

I will re-frame the frequent subgraph mining problem as a selection task in a modified hypothesis space. In this learning setting, two methods are discussed that a) reduce the result set of substructures by structural compression and correlation to the endpoint under investigation and b) reduce the result set by extracting latent (hidden) motifs from the complete set of substructures, incorporating correlation as well. The key results include: Tremendous speedup in computation, very high compression while retaining good coverage of the database, predictive accuracy using the obtained descriptors on par (a) or significantly better (b) than the complete set of subgraphs.

The first method, called Backbone Refinement Class Mining (BBRC), has high compression potential and handles large chemical datasets very effectively. It partitions the complete search space of subgraphs (chemical fragments) by maintaining a structural invariant (the backbone) and selects just one representative (the most significant one) from inside each partition. This modified hypothesis space has been shown to yield a robust collection of structurally diverse descriptors which cover the compounds very well. For example, datasets of > 20,000 compounds can be processed in just a few minutes using this approach.

The second method, termed Latent Structure Pattern Mining (LAST-PM), extracts latent (hidden) information from the set of frequent and correlated subgraphs. In contrast to other methods however, it is designed to be integrated into the actual graph mining step, not as a separate post-processing step, making it much more efficient. The engine produces elaborate patterns, integrating structural ambiguities of various size into the patterns, which are found by aligning and stacking subgraphs, followed by a spectral analysis step applied to this weighted graph. The resulting descriptors have been used in computational models for complex biological endpoints (bioavailability, blood-brain-barrier) and have been compared favorably to or could improve on highly optimized physico-chemical descriptors. Many of the descriptors are readily interpretable by experts.

Several classification models for mutagenicity and carcinogenicity have been produced for REACH-relevant endpoints with BBRC and LAST-PM, using the Lazar framework (regression models for Fish Toxicity are currently being produced and improved). Predictions for these endpoints can be derived using the well-defined interface of the OpenTox framework. The models are also available in the web applications ToxCreate and ToxPredict.

A particularly important aspect for REACH is that, in contrast to most models, Lazar routinely provides an estimation of applicability domain (AD) based on structural similarity to the neighbors, re-using the same proposed descriptors as for the prediction itself. Small sets of descriptors are most profitable in this setting, allowing for efficient model derivation, which is often identified as performance bottleneck in nearest-neighbor predictions, where models have to be derived dynamically at prediction time. Several SVM kernel formulations are currently available as learning algorithms.

Model validation for the REACH-relevant endpoints, derived by ten-fold crossvalidation, shows a clear dependency between prediction quality and AD estimation, thus allowing to tighten the AD bound for an optimal tradeoff between sensitivity and specificity. This highlights the central role of the proposed descriptors, being directly involved in similarity search, AD estimation and predictions. Moreover, those inner functions are particularly easy accessible in Lazar, being characterized only by a small quantity of descriptors, which makes models intuitive and interpretable, a central demand in REACH.

The proposed subgraph mining methods (BBRC and LAST-PM) are publicly available as stand-alome web services within the OpenTox Framework. The algorithms are released as open source libraries for several technological platforms. Both methods employ the popular SMILES / SMARTS notation to characterize compounds and fragments, respectively, which makes them applicable to filtering and screening workflows. More information is available on http://cs.maunz.de.

Sections

Elaborate Graph Mining: Exploiting Structural Invariants and Latent Information in Graph Databases to predict REACH-relevant Endpoints.

Document Actions