An Extensive Multi-label Analysis of the ToxCast Data Set
Joerg Wicker, Julian Lemke and Stefan Kramer, TU München, Germany
The ToxCast data set was released by the EPA in beginning of 2009 [1]. In August 2009, the data of phase one became publicly available. The data set contains 320 chemical structures and 1633 endpoints. The chemicals are tested against the ToxRefDB, which contains 424 toxicological in vivo endpoints. The remaining features are mainly in vitro data. The main goal of the ToxCast data set is to break the in vitro / in vivo border and use the in vitro features to predict in vivo endpoints. Nevertheless, the data set is not easy to handle. Both the in vitro and in vivo data are of unknown and presumedly varying quality, many instances have missing values (in both in vitro and in vivo data), the structures are heterogeneous, and there is a slight skew in the class distributions. At the ToxCastâ„¢ Data Analysis Summit in May 2009, first analysis strategies were presented by partners in the project. First analyses of the data set showed that it is hard to find correlations between the in vitro data and individual in vivo endpoints.
We will present the results of a comprehensive multi-label classification analysis [3] of the data set. Multi label classifiers do not predict only one class but take into account interdependencies among several classes or labels to improve the prediction. In contrast to a previous multi-label approach presented by Jeliazkova et al. [2], we use all 320 structures and not just the 160 structures with all in vivo data available. We used the Mulan multi-label library [3], as it provides several multi-label classifiers and analysis methods. The ToxCast data set consists of many missing values, as not all structures are tested in all the assays. While many learning algorithms can cope with missing values in general, the performance of many multi-label schemes should be expected to suffer, as less information on the dependencies among labels can be exploited.
To alleviate this
problem, we studied methods for data imputation, i.e., methods to fill
in missing values in the data. Most of these methods focus on missing
numeric values. For multi-label classification, a method imputing binary
data is required. We applied a new method developed on the basis of
multi-label classification to fill missing labels in a multi-label data
set. This improves the performance and allows using multi-label
classifiers which cannot handle missing values in the labels.
References
[1] Judson et al. (2010) "In Vitro Screening of Environmental Chemicals for Targeted Testing Prioritization - The ToxCast Project", Environmental Health Perspectives, in press (published online in December 2009).
[2] Jeliazkova, N., Jeliazkov V. (2009) "Hierarchical Multi-Label Classification of ToxCast Datasets", ToxCast Data Analysis Summit, US EPA, Research Triangle Park, NC, May, 14-15, 2009.
[3] Tsoumakas,
G., Katakis, I., Vlahavas, I. (2010) "Mining Multi-Label Data", Data
Mining and Knowledge Discovery Handbook, O. Maimon, L. Rokach (Ed.),
Springer, 2nd edition, 2010.
Acknowledgements
This research was funded by the EU FP7 project OpenTox - An Open Source Predictive Toxicology Framework (Health-F5-2008-200787)