Sections
You are here: Home » Development » Documentation » Components » iSAR

iSAR

Contact: Stefan Kramer

Categories: Prediction

Exposed methods:

predict
Input: Instances, chemical substructure feature vectors, class values
Output: Classification model (actually training instances are stored; lazy learning method)
Input format: lazySAR internal format (.ibf, .fbi, .count)
Output format: Program specific plain text (.result, .info)
User-specified parameters: The user can choose to use all features, an upper limit for the number of features or to use only closed features. Further he can choose the number of non-occurring substructures to add as features to test instance feature set. Only the most significant (in relation to the class) non-occurring features are added. The method to determine significance can be chosen amongst Chi Square, Cole, G- index and Information Gain.
Reporting information: (Combined) prediction for each instance, overall statistics (confusion matrix)

Description:

iSAR (instance-based structure-activity relationships) is an implementation of a lazy SAR algorithm. In lazy
SARs, classifications are particularly tailored for each test compound. Therefore, it is possible to make the most
of the structure of a test compound. iSAR uses subgraphs and paths that are generated by e.g., gSpan‟ [JK05]
or unrooted trees that are generated by e.g., Free Tree Miner [RUE04] as features for the classification task.
These substructures are derived from a test compound to determine similar structures. In order to obtain a
well-balanced and representative set of structural descriptors, this set can be enriched by strongly activating or
deactivating fragments from the training set and subsequently redundant fragments (use only closed features)
can be removed. Finally, a k-Nearest Neighbor classification with one k or for several values of k is performed
and a vote among the resulting predictions is taken. The validation is performed via leave-one-out cross
validation (LOOCV).
iSAR is implemented in the Perl programming language. The iSAR software is dependent on a substructural
feature generator, e.g., gSpan‟ or Free Tree Miner (FTM), JOELIB [JOELIB] and Weka [WIT99]. iSAR provides no
graphical user interface and is executed via the command line. The input format accepted is an internal iSAR
format. Perl scripts that convert the output of FTM or gSpan‟ to this format are provided. iSARs output is
program specific plain text.
For further information, we refer to the original publication [SOM07] and the website
http://wwwkramer.in.tum.de/research/pubs/articlereference.2008-03-17.2708343675

Background (publication date, popularity/level of familiarity, rationale of approach, further comments)
Published 2007. Follows the concept of lazy instance-based learning. Similar to lazar.
Extends simple instance-based learners by the three techniques: enrichment (use of
strongly activating or deactivating fragments from the training set), removing
redundancy (use only closed features), and voting (building several KNN-Classifier und
vote amongst their predictions). Useful tool for SAR datasets with congeneric and non-
congeneric compounds.

Bias (instance-selection bias, feature-selection bias, combined instance-selection/feature-selection bias, independence assumptions?, ...)
Instance-selection bias

Lazy learning/eager learning
Lazy learning

Interpretability of models (black box model?, ...)
Good (kNN classifier)

Type of Descriptor:

Interfaces: Standalone application

Priority: Low

Development status:

Homepage: http://wwwkramer.in.tum.de/research/pubs/articlereference.2008-03-17.2708343675

Dependencies:
OpenTox components: FreeTreeMiner, gSpan'
External components: WEKA, JOELib


Technical details

Data: No

Software: Yes

Programming language(s): perl

Operating system(s): Linux

Input format: lazySAR internal format (.ibf, .fbi, .count)

Output format: .result, .info (plain text files)

License:


References

References:
[SOM07] Sommer, S., Kramer, S. (2007). Three Data Mining Techniques To Improve Lazy Structure-Activity Relationships for Non-Congeneric Compounds, Journal of Chemical Information and Modeling 47(6):2035-2043.
[JK05] Jahn, K. and Kramer, S. (2005). Optimizing gSpan for Molecular Datasets In: Proceedings of the Third International Workshop on Mining Graphs, Trees and Sequences (MGTS-2005).
[RUE04] Rückert, U and Kramer, S., Frequent Free Tree Discovery in Graph Data, in: SAC '04: Proceedings of the 2004 ACM Symposium on Applied Computing, pp. 564-570 (New York, NY, USA: ACM Press, 2004).
[JOELIB] http://www.ra.cs.uni-tuebingen.de/software/joelib/index.html
[WIT99] Witten, I.H. Frank, E., Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations (Morgan Kaufmann, 1999).

Document Actions