gSpan'
Contact: Stefan Kramer
Categories: Descriptor calculation
Exposed methods:
gSpan | |
---|---|
Input: | 2D chemical structure information |
Output: | Frequent substructures |
Input format: | gSpan' specific; [SDF with existing converters] |
Output format: | DFScode file, relabel.txt file (plain text files) |
User-specified parameters: | - Restriction choices for fragments - Minimum support |
Reporting information: | DFS codes for each frequent linear/acyclic fragment, (number of) instances that posses the fragment |
Description:
The gSpan' algorithm [JK05] implements two optimizations of the widely known gSpan algorithm [HAN02] for
mining molecular databases. Both optimizations apply to the enumeration of subgraph occurrences in a graph
database, which is, also according to our profiling, the most expensive operation of gSpan. The first
optimization reduces the number of subgraph isomorphisms that need to be accessed for proper support
computation in considering the symmetries inherent in many chemical molecules, and the second speeds up
subgraph isomorphism tests by making use of the non-uniform frequency distribution of atom and bond
types.
The software is implemented in the programming language C and was developed for the Linux operating
system. The gSpan‟ implementation has no dependencies on other software packages. The gSpan algorithm
has a specific input format, but we already have conversion scripts for the widely used MDL Molfiles
(sometimes called SD file or SDF; specification URL: http://www.mdl.com/downloads/public/ctfile/ctfile.jsp)
available at TUM. There exists no graphical user interface (GUI) and the program is executed via the command
line. gSpan‟ s output consists of program specific plain text files.
For further information, we refer to the original publication [JK05] and the website
http://wwwkramer.in.tum.de/research/data_mining/pattern_mining/graph_mining
Background (publication date, popularity/level of familiarity, rationale of approach, further comments)
Published in 2005. An optimization of the gSpan algorithm for molecular graphs.
Type of Descriptor:
Substructural descriptors, currently no wildcards used or other more advanced
features of the SMARTS language, results can be used in all fingerprint-based
similarity and distance measures. The user can restrict the search to acyclic and/or
linear fragments and/or fragments with a maximum number of edges (bonds).
Interfaces: Standalone application
Priority: Medium
Development status:
Homepage: http://wwwkramer.in.tum.de/research/data_mining/pattern_mining/graph_mining
Dependencies:
Technical details
Data: No
Software: Yes
Programming language(s): C
Operating system(s): Linux; Windows planned
Input format: gsp (gSpan format)
Output format: DFScode file, relabel.txt file (plain text files)
License: GPL
References
References:
[JK05] Jahn, K. and Kramer, S. (2005). Optimizing gSpan for Molecular Datasets In: Proceedings of the Third International Workshop on Mining Graphs, Trees and Sequences (MGTS-2005).
[HAN02] X. Yan, J. Han: gSpan: Graph-based substructure pattern mining. Proc. 2nd IEEE Int. Conf. on Data Mining (ICDM 2002, Maebashi, Japan), 721-724. IEEE Press, Piscataway, NJ, USA 2002.