Descriptor Calculation

This section provides a summary of current OpenTox descriptor calculation components.

Descriptor Calculation Components

AMBIT

AMBIT is a software package for chemoinformatic data management. The descriptor calculation relies on the CDK library, but also implements several additional descriptors. The descriptor calculation is a separate module and packaged in ambit2-descriptors.jar, which depends only on cdk library, core ambit module (ambit2-core.jar) and ambit SMARTS (ambit2-smarts.jar) implementation.

Chemistry Development Kit

The Chemistry Development Kit (CDK) is a Java library for structural chemo- and bioinformatics. A number of descriptor implementations are available.

FMiner

Fminer is a novel method for efficiently mining relevant tree-shaped subgraph descriptors with minimum
frequency and correlation constraints, each representing a set of fragments sharing a common core structure
(backbone), thereby reducing feature set size and runtime. The approach is able to optimize structural inter-
feature entropy as opposed to occurrences, which is characteristic for open or closed fragment mining. In the
experiments, the proposed method reduces feature set sizes by >90% and >30% compared to complete tree
mining and open tree mining, respectively. Evaluation using crossvalidation runs shows that their classification
accuracy is similar to the complete set of trees but significantly better than that of open trees. Compared to
open or closed fragment mining, a large part of the search space can be pruned due to an improved statistical
constraint (dynamic upper bound adjustment), which is also confirmed in the experiments in lower runtimes
compared to ordinary (static) upper bound pruning. Further analysis using large-scale datasets yields insight
into important properties of the proposed descriptors, such as dataset coverage and class size represented by
each descriptor. A final crossvalidation run confirms that the novel descriptors render large training sets
feasible which previously might have been intractable for computational models.

FreeTreeMiner

The FreeTreeMiner (FTM) software computes all acyclic substructures (in mathematical terms: free or
unrooted trees) occurring at a given minimum frequency in a set of molecules. The substructures are computed
by a depth-first search. Additional to the minimum frequency support, a maximum frequency constraint can
be set. This constraint can either refer to the same database/set or to a second one, meaning that all
substructures frequent in the first and infrequent in the second are returned by FTM. The frequent
substructures are returned as SMARTS strings together with their occurrences in the given set of structures.
The software is implemented in the programming language C++ and was developed for the Linux and Mac OS
X operating systems. The FTM software is dependent on the open source chemistry toolbox OpenBabel.

gSpan'

The gSpan' algorithm implements two optimizations of the widely known gSpan algorithm for
mining molecular databases. Both optimizations apply to the enumeration of subgraph occurrences in a graph
database, which is, also according to our profiling, the most expensive operation of gSpan. The first
optimization reduces the number of subgraph isomorphisms that need to be accessed for proper support
computation in considering the symmetries inherent in many chemical molecules, and the second speeds up
subgraph isomorphism tests by making use of the non-uniform frequency distribution of atom and bond
types.

JOELib2

JOELib2 is a platform independent open source computational chemistry package written in Java. JOELib2
consists of an algorithm library that was designed for prototyping, data mining and graph mining of chemical
compounds. JOELib2 is the Java successor of the OELib library from OpenEye.
The software was developed for the Linux and Windows operating system. The JOELib2 implementation has no
dependencies on other software packages. There exists no graphical user interface (GUI) and the program is
executed via the command line or via Java code integration.

lazar

Lazar is a k-nearest-neighbor approach to predict chemical endpoints from a training set based on structural
fragments. It uses a SMILES file and precomputed fragments with occurrences as well as target class
information for each compound as training input. It also features regression, in which case the target activities
consist of continuous values. Lazar uses activity-specific similarity (i.e. each fragment contributes with its
significance for the target activity) that is the basis for predictions and a confidence index for every single
prediction.

MakeMNA

MakeMNA is a software product for generating MNA descriptors.
These descriptors are based on the molecular structure representation, which includes the hydrogens
according to the valences and partial charges of other atoms and does not specify the types of bonds. MNA
descriptors are generated as recursively defined sequences or zero-level atomic and nearest neighbour descriptors.

MakeQNA

MakeQNA calculates Quantitative Neighbourhoods of Atoms (QNA) descriptors based on quantities of ionization potential (IP) and electron affinity (EA) of each atom of the molecule.

MaxTox

Comparing the query molecule to each cluster (EP based) and finding an MCS score with respect to molecules
of each cluster. Using MCS score(s) in a Machine Learning algorithm, to generate predictive models.
The software is primarily implemented in the JAVA for a Linux based system. MaxTox
software is dependent on the open source chemistry development kit (CDK) and OpenBabel.

MOPAC

MOPAC (Molecular Orbital PACkage) was started in 1981, and has been under
continuous development since then. MOPAC 7.1 is a FORTRAN 90 version of MOPAC 7.
It supports the methods: MNDO, AM1, and PM3, as well as Sparkle/AM1 for the
lanthanides. All published NDDO parameter sets are supported.

OpenBabel

Open Babel is a chemical toolbox designed to speak the many languages of chemical data. It's an open,
collaborative project allowing anyone to search, convert, analyze, or store data from molecular modeling,
chemistry, solid-state materials, biochemistry, or related areas.
OpenBabel is an open source computational chemistry package written in C++.
The software is available for the Linux, Windows and MAC operating system. The OpenBabel implementation
has no dependencies on other software packages.

ToxTree

Toxtree is a fully-featured and flexible user-friendly open source application, which is able to estimate toxic
hazard by applying a decision tree approach. Toxtree can be applied to datasets from various compatible file types. User-defined molecular structures are also supported which can be entered by SMILES, or by using the built-in 2D structure diagram editor.
Toxtree has been designed with flexible capabilities for future extensions in mind (e.g. other classification schemes that could be developed at a future date). New decision trees with arbitrary rules can be built with the help of graphical user interface or by developing new plug-ins.

Sections

Descriptor Calculation