Sections
You are here: Home » Meet » OpenTox 2011 » Abstracts » Modeling Ames Mutagenicity Using Machine learning methods – A comparative study

Modeling Ames Mutagenicity Using Machine learning methods – A comparative study

Dr. U C Abdul Jaleel, Malabar Christian College, Department of Cheminformatics , Kerala, India
Abhik Seal, DOEACC Society Kolkata (JU Campus), India
Anurag Passi, Open Source Drug Discovery, Council of scientific and Industrial Research (CSIR), India

Toxicity and mutagenicity are one of the major fields in ADMET studies which are key components in any drug discovery process. The ultimate aim of our work was to build structure-activity based in-silico predictive mutagenicity models using machine learning approach to largely replace the in-vitro experiments. In our work we used the Bursi mutagenicity dataset containing 4337 compounds (Set 1) and a benchmark dataset of 6512 compounds (Set 2) available at (http://doc.ml.tu-berlin.de/toxbenchmark/) model for our data. A third dataset (Set 3) was prepared by adding up the two sets. A total of 179 descriptors (attributes) were calculated using the PowerMV descriptor calculation tool. For classification of compounds Weka machine learning based data mining tool was used. Useless descriptors were removed, which reduced the attributes to 156 descriptors. The redundant data was removed resulting in 8292 compounds (set3) with Minority ratio of 1.84. Using 156 descriptors and classifier algorithms such as Naive Bayes, Random Forest, J48 and SMO with 10 fold cross-validation, classification models were developed. For validation of each model, testing was performed with one internal test set and two external high quality test set from PubChem. For each set it was found that random forest outperforms other classifiers and also models for the set3 dataset are much more accurate with 89.27% accuracy, 0.89 precision and ROC of 95.3%. Two external sets AID 1189 and AID 1194 with mutagenicity data were tested and showed 62% accuracy with 0.67 precision and 65% ROC area and 91 % accuracy, 0.91 precision with 96.3% ROC area respectively. The present dataset used will help to develop high quality mutagenicity QSAR models. For large screening of molecules with unknown mutagenicity data, random forest based classification models could be developed which could predict the mutagenicity of molecules with high accuracy.

Document Actions