Assessment of the complementarity of machine learning methods in QSAR modeling using AZOrange

Jonna Stålring, AstraZeneca, Sweden
Pedro Almeida, EngInMotion Ltd/AstraZeneca, Portugal
Scott Boyer, AstraZeneca, Sweden

A multitude of machine learning (ML) algorithms have been developed for different applications and they rely upon various conceptional foundations. General guidelines for the selection of ML algorithm does not exist and it is acknowledged within the machine learning community that a data set specific choice of algorithm results in the most accurate models. AZOrange has been developed to provide multiple high performance ML algorithms within the same package. The implementations are customized such that users without theoretical knowledge about the algorithms can still use them accurately. Furthermore, automated selection of model hyper-parameters, of particular importance to non-linear algorithms, reduces the need for manual tweaking of model parameters and increases the efficiency in the model development process. The customization and the automated model parameter selection provide the tools necessary for batch generation of models and the assessment of multiple model hypothesis.

AZOrange has been complemented with a process for automated building, selection and validation of all its ML algorithms. To avoid overfitting and overestimation of the generalization accuracy, potentially resulting from any elaborate selection process, extensive re-sampling is used and no selection of algorithm or model hyper-parameters have been performed based on the validation sets. The accuracy is complemented by the accuracy variance between the folds in the assessment of model quality. The process includes automated construction of a consensus model, weighting the constituting models by their global accuracy.

The process is empirically evaluated on a suite of regression and binary classification data containing ten SARs for toxicologically associated assays obtained from PubChem, several data sets from the chemoinformatics web page, as well as data from US-EPA, FDA and NTP. Chemical structure was represented using either the set of 177 physio-chemical descriptors of RDkit or by circular fingerprint as implemented in RDkit with radius 1.

The accuracy of the different ML methods is compared based on their rank sums over the data suites, while the significance of the difference between the rank sums is assessed with the Bonferroni-Dunn test. The results for the small congeneric data sets and the large global QSAR data sets are analyzed separately. No ML method can be identified as superior, but several methods are frequently selected as the most accurate. The automatically constructed consensus learner based on all stable algorithms, is almost exclusively selected for the global regression data sets. However, the global classification sets identifies the Boost algorithm as significantly more accurate than any other method. This recently implemented automated model building, validation and selection process is an efficient tool to select the statistically most accurate method for a specific data set.

(presenting author: Jonna Stålring)

Sections

Assessment of the complementarity of machine learning methods in QSAR modeling using AZOrange

Document Actions