gSpan'

Categories: Descriptor calculation

Exposed methods:

gSpan
Input:	2D chemical structure information
Output:	Frequent substructures
Input format:	gSpan' specific; [SDF with existing converters]
Output format:	DFScode file, relabel.txt file (plain text files)
User-specified parameters:	- Restriction choices for fragments - Minimum support
Reporting information:	DFS codes for each frequent linear/acyclic fragment, (number of) instances that posses the fragment

Description:

The gSpan' algorithm [JK05] implements two optimizations of the widely known gSpan algorithm [HAN02] for
mining molecular databases. Both optimizations apply to the enumeration of subgraph occurrences in a graph
database, which is, also according to our profiling, the most expensive operation of gSpan. The first
optimization reduces the number of subgraph isomorphisms that need to be accessed for proper support
computation in considering the symmetries inherent in many chemical molecules, and the second speeds up
subgraph isomorphism tests by making use of the non-uniform frequency distribution of atom and bond
types.
The software is implemented in the programming language C and was developed for the Linux operating
system. The gSpan‟ implementation has no dependencies on other software packages. The gSpan algorithm
has a specific input format, but we already have conversion scripts for the widely used MDL Molfiles
(sometimes called SD file or SDF; specification URL: http://www.mdl.com/downloads/public/ctfile/ctfile.jsp)
available at TUM. There exists no graphical user interface (GUI) and the program is executed via the command
line. gSpan‟ s output consists of program specific plain text files.
For further information, we refer to the original publication [JK05] and the website
http://wwwkramer.in.tum.de/research/data_mining/pattern_mining/graph_mining

Background (publication date, popularity/level of familiarity, rationale of approach, further comments)
Published in 2005. An optimization of the gSpan algorithm for molecular graphs.

Type of Descriptor:

Substructural descriptors, currently no wildcards used or other more advanced
features of the SMARTS language, results can be used in all fingerprint-based
similarity and distance measures. The user can restrict the search to acyclic and/or
linear fragments and/or fragments with a maximum number of edges (bonds).

Interfaces: Standalone application

Priority: Medium

Development status:

Homepage: http://wwwkramer.in.tum.de/research/data_mining/pattern_mining/graph_mining

Dependencies:

Technical details

Data: No

Software: Yes

Programming language(s): C

Operating system(s): Linux; Windows planned

Input format: gsp (gSpan format)

Output format: DFScode file, relabel.txt file (plain text files)

License: GPL

References

References:
[JK05] Jahn, K. and Kramer, S. (2005). Optimizing gSpan for Molecular Datasets In: Proceedings of the Third International Workshop on Mining Graphs, Trees and Sequences (MGTS-2005).
[HAN02] X. Yan, J. Han: gSpan: Graph-based substructure pattern mining. Proc. 2nd IEEE Int. Conf. on Data Mining (ICDM 2002, Maebashi, Japan), 721-724. IEEE Press, Piscataway, NJ, USA 2002.

Sections

gSpan'

Technical details

References

Document Actions