FreeTreeMiner
Contact: Stefan Kramer
Categories: Descriptor calculation
Exposed methods:
ftm | |
---|---|
Input: | 2D chemical structure information |
Output: | Frequent substructures |
Input format: | SD file (MDL Mol) |
Output format: | Program specific text files and/or Weka's ARFF format |
User-specified parameters: | Minimum support |
Reporting information: | Frequent free trees (SMARTs) with occurrence maps, border elements |
Description:
The FreeTreeMiner (FTM) software [RUE04] computes all acyclic substructures (in mathematical terms: free or
unrooted trees) occurring at a given minimum frequency in a set of molecules. The substructures are computed
by a depth-first search. Additionally to the minimum frequency support, a maximum frequency constraint can
be set. This constraint can either refer to the same database/set or to a second one, meaning that all
substructures frequent in the first and infrequent in the second are returned by FTM. The frequent
substructures are returned as SMARTS strings together with their occurrences in the given set of structures.
The software is implemented in the programming language C++ and was developed for the Linux and Mac OS
X operating systems. The FTM software is dependent on the open source chemistry toolbox OpenBabel
(http://www.openbabel.org). FTM itself provides no graphical user interface (GUI) and is executed via the
command line. The input format accepted by FTM is the widely used MDL Molfile (sometimes called SD file or
SDF; specification URL: http://www.mdl.com/downloads/public/ctfile/ctfile.jsp). FTM's output formats are
program specific plain text files and/or Weka's [WIT99] ARFF format. For further information, we refer to the
original publication [RUE04] and the website
http://wwwkramer.in.tum.de/research/data_mining/pattern_mining/graph_mining
Background (publication date, popularity/level of familiarity, rationale of approach, further comments)
Published in 2004. A further development of the MolFea approach for acyclic substructures. Acyclic substructures were chosen, as they still allow advanced computations like the calculation of borders. On typical structure databases, the number of frequent acyclic substructures is not much less than the number of frequent unconstrained (i.e., also including cyclic) substructures.
Type of Descriptor:
Substructural descriptors, acyclic substructures, currently no wildcards used or other
more advanced features of the SMARTS language, results can be used in all
fingerprint-based similarity and distance measures.
Interfaces: Standalone application
Priority: High
Development status:
Homepage: http://wwwkramer.in.tum.de/research/data_mining/pattern_mining/graph_mining
Dependencies:
External components: OpenBabel
Technical details
Data: No
Software: Yes
Programming language(s): C++
Operating system(s): Linux, Windows
Input format: SDF
Output format: txt, ARFF
License: GPL
References
References:
[RUE04] Rückert, U and Kramer, S., Frequent Free Tree Discovery in Graph Data, in: SAC '04: Proceedings of the 2004 ACM Symposium on Applied Computing, pp. 564-570 (New York, NY, USA: ACM Press, 2004).
[WIT99] Witten, I.H. Frank, E., Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations (Morgan Kaufmann, 1999).