Parallel Structural Graph Clustering

Madeleine Seeland, Technical University Munich, Germany
Simon A. Berger, Heidelberg Institute for Theoretical Studies, Germany
Tobias Girschick, Technical University Munich, Germany
Alexandros Stamatakis, Heidelberg Institute for Theoretical Studies, Germany
Stefan Kramer, Technical University Munich, Germany

The goal of clustering a graph database of molecular structures is to identify groups of similar structures, such that intra-group similarity is high and inter-group similarity is low. This can serve to structure the chemical space and to improve the understanding of the data.

In previous work we proposed a graph-based clustering approach based on the frequent graph mining algorithm gSpan. In the proposed approach, clusters encompass all molecules that share a sufficiently large common substructure. The size of the common substructure of a compound in a cluster has to take at least a user-specified fraction of its overall size. The algorithm processes the instances in one defined order, one after the other, and produces overlapping (non-disjoint) and non-exhaustive clusters. Several experiments were designed to evaluate the effectiveness and efficiency of the structural clustering algorithm on various real-world data sets of molecular graphs. We showed that the approach is able to rediscover known structure classes in the NCI standard anti-cancer agents. Moreover a baseline comparison with a PubChem Tanimoto fingerprint-based clustering was presented.

In recent work, we addressed the problem of clustering large graph databases of molecular structures. We parallelized the structural clustering algorithm to take advantage of high-performance parallel hardware and further improve the algorithm in three ways: a refined cluster membership test based on a set abstraction of graphs, sorting graphs according to size, to avoid cluster membership tests in the first place, and the definition of a cluster representative once the cluster scaffold is unique, to avoid cluster comparisons with all cluster members. In experiments on a large database of chemical structures, we showed that running times can be reduced by a large factor for one parameter setting used in previous work. For harder parameter settings, it was possible to obtain results within reasonable time for 300,000 structures, compared to 10,000 structures in previous work. This shows that structural, scaffold-based clustering of smaller libraries for virtual screening is already feasible.

Sections

Parallel Structural Graph Clustering

Document Actions