Prediction

This section provides a description of current OpenTox Prediction components.

Prediction Components

Fuzzy Means

Fuzzy-means is a training method for Radial Basis Function (RBF) neural networks and is based on the fuzzy
partition of the input space, which is produced by defining a number of triangular fuzzy sets in the domain of
each input variable. The centers of these fuzzy sets form a multidimensional grid on the input space. A
rigorous selection algorithm chooses the most appropriate vertices on the grid, which are then used as the
hidden node centers in the resulting RBF network model. The so called “fuzzy-means” training method does
not need the number of centers to be fixed before the execution of the method. Due to the fact that it is a one-
pass algorithm, it is extremely fast, even in the case of a large database of input-output training data. The
method was originally developed for solving nonlinear regression problems. A variant of the method for solving
classification problems has also been developed.

Gaussian Processes for Regression

GPR (Gaussian Processes for Regression) is a way of supervised learning. A Gaussian process is a generalization
of the Gaussian probability distribution. Whereas a probability distribution describes random variables which
are scalars or vectors (for multivariate distributions), a stochastic process governs the properties of functions.

iSAR

iSAR (instance-based structure-activity relationships) is an implementation of a lazy SAR algorithm. In lazy
SARs, classifications are particularly tailored for each test compound. Therefore, it is possible to make the most
of the structure of a test compound. iSAR uses subgraphs and paths that are generated by e.g., gSpan' or unrooted trees that are generated by e.g., Free Tree Miner as features for the classification task.

J48

J48 implements Quinlan‟s C4.5 algorithm for generating a pruned or unpruned C4.5 decision
tree. C4.5 is an extension of Quinlan's earlier ID3 algorithm. The decision trees generated by J48 can be used
for classification. J48 builds decision trees from a set of labeled training data using the concept of information
entropy. It uses the fact that each attribute of the data can be used to make a decision by splitting the data into
smaller subsets. J48 examines the normalized information gain (difference in entropy) that results from
choosing an attribute for splitting the data. To make the decision, the attribute with the highest normalized
information gain is used. Then the algorithm recurs on the smaller subsets. The splitting procedure stops if all
instances in a subset belong to the same class. Then a leaf node is created in the decision tree telling to
choose that class. But it can also happen that none of the features give any information gain. In this case J48
creates a decision node higher up in the tree using the expected value of the class.

kNN

The k-nearest neighbors algorithm (kNN) is a method for classifying objects based on closest training
examples in the feature space. It is a type of instance-based learning, or lazy learning where the function is
only approximated locally and all computation is delayed until classification. A majority vote of an object‟s
neighbors is used for classification, with the object being assigned to the class most common amongst its k
(positive integer, typically small) nearest neighbors. If k is set to 1, then the object is simply assigned to the
class of its nearest neighbor. The knn algorithm can also be applied for regression in the same way by simply
assigning the property value for the object to be the average of the values of its k nearest neighbors.

lazar

Lazar is a k-nearest-neighbor approach to predict chemical endpoints from a training set based on structural
fragments. It uses a SMILES file and precomputed fragments with occurrences as well as target class
information for each compound as training input. It also features regression, in which case the target activities
consist of continuous values. Lazar uses activity-specific similarity (i.e. each fragment contributes with its
significance for the target activity) that is the basis for predictions and a confidence index for every single
prediction.

lazar web interface

web interface for lazar based on Ruby.

M5P

M5P is a reconstruction of Quinlan's M5 algorithm for inducing trees of regression models.
M5P combines a conventional decision tree with the possibility of linear regression functions at the nodes.
First, a decision-tree induction algorithm is used to build a tree, but instead of maximizing the information
gain at each inner node, a splitting criterion is used that minimizes the intra-subset variation in the class
values down each branch. The splitting procedure in M5P stops if the class values of all instances that reach a
node vary very slightly, or only a few instances remain.

MaxTox

Comparing the query molecule to each cluster (EP based) and finding an MCS score with respect to molecules
of each cluster. Using MCS score(s) in a Machine Learning algorithm, to generate predictive models.
The software is primarily implemented in the JAVA for a Linux based system. MaxTox
software is dependent on the open source chemistry development kit (CDK) and OpenBabel.

Multiple Linear Regression

MLR (Multiple Linear Regression) is a simple and popular statistical technique that uses several explanatory
(independent) variables to predict the outcome of a response (dependent) variable. The model creates a
relationship in the form of a straight line (linear) that best approximates all the individual data points.

makeSCR - Self-consistent Regression

Delphi implementation of a self-consistent regression (SCR) algorithm. Using self-consistent regression one can
obtain the best QSAR/QSPR model for the training set with a large number of descriptors. SCR is based on
a least-squares regularized method.

Partial-least Squares Regression

One way to understand Partial-least squares regression (PLS) is that it simultaneously projects the x and y
variables onto the same subspace in such a way that there is a good relationship between the predictor and
response data. Another way to see PLS is that it forms “new” x variables as linear combinations of the old ones,
and subsequently uses these new linear combinations as predictors of y. Hence, as opposed to MLR PLS can handle correlated variables, which are noisy and possibly also incomplete.

RUMBLE

RUMBLE (RUle and Margin Based LEarner) is a statistically motivated rule learning system based on the Margin
Minus Variance (MMV) optimization criterion.

SMIREP/SMIPPER

SMIREP/SMIPPER is based on combining feature generation and rule learning into one integrated
package. It constructs features, or sub graphs, by defragmenting the SMILES representations of the training
data, and refining these on the fly during the learning process. The underlying learning algorithm is similar to
that of the IREP rule learner employing a reduced error pruning approach. SMIREP is able to incorporate
external, predefined SMART patterns – like functional groups – as well as able to incorporate physico-chemical
properties during rule construction. The resulting models learned by SMIREP are sets of rules. SMIPPER employs
essentially a similar approach, by refining the found rule set repeatedly. The system can be run in three modes:
train/test, k-fold cross validation, or leave-one-out cross validation. Optionally, for each test set or fold
receivers operating characteristic curves are constructed for visualization purposes.

Support Vector Machines

Support vector machines (SVM) are a set of supervised learning methods used for classification and regression.
In the most widely used two-class SVM classification method, input data are viewed as two sets of vectors in
the multi-dimensional input space. The SVM classifier constructs a separating hyperplane in that space, one
which maximizes the margin between the two data sets. The method is extended to multi-class and nonlinear
classification problems by using a nonlinear kernel function. To obtain an optimum classifier for nonseparable
data, a penalty is introduced for misclassified data. This penalty is zero for patterns classified correctly, and
has a positive value that increases with the distance from the corresponding hyperplane for patterns that are
not situated on the correct side of the classifier. Similar concepts are used in the SVM regression problem,
where the objective is to identify a function that for all training patterns has a maximum deviation ε from the
target (experimental) values.

Toxmatch

Toxmatch (Java, GPL) provides means to compare a chemical or set of chemicals to a toxicity dataset through the use of similarity indices.
Intended use is one to many or many to many quantitative read-across. To help in the systematic formation of groups and read-across it has emphasis on endpoint-dependent similarity. Includes datasets for four toxicity endpoints to facilitate endpoint specific read-across.

Toxtree

Toxtree is a fully-featured and flexible user-friendly open source application, which is able to estimate toxic
hazard by applying a decision tree approach. Toxtree can be applied to datasets from various compatible file types. User-defined molecular structures are also supported which can be entered by SMILES, or by using the built-in 2D structure diagram editor.
Toxtree has been designed with flexible capabilities for future extensions in mind (e.g. other classification schemes that could be developed at a future date). New decision trees with arbitrary rules can be built with the help of graphical user interface or by developing new plug-ins.

Sections