Authors: Sun X, Zhang X, Wang L, Li Y, Muir DCG, Zeng EY
Deep convolutional neural network (DCNN) has proved to be a promising tool for identifying organic chemicals of environmental concern. However, the uncertainty associated with DCNN predictions remains to be quantified. The training process contains many random configurations, including dataset segmentation, input sequences, and initial weight, etc. Moreover, the DCNN working mechanism is non-linear and opaque. To increase confidence to use this novel approach, persistent, bioaccumulative, and toxic substances (PBTs) were utilized as representative chemicals of environmental concern to estimate the prediction uncertainty under five distinguished datasets and ten different molecular descriptor (MD) arrangements with 111,852 chemicals and 2424 available MDs. An internal correlation coefficient test indicated that the prediction confidence reached 0.98 when a mean of 50 DCNNs' predictions was used instead of a sing DCNN prediction. A threshold for PBT categorization was determined by considering costs between false-negative and false-positive predictions. As revealed by the guided backpropagation-class activation mapping (GBP-CAM) saliency images, only 12% of all selected MDs were activated by DCNN and influenced decision-making process. However, the activated MDs not only varied among chemical classes but also shifted with different DCNNs. Principal component analysis indicated that 2424 MDs could transform into 370 orthogonal variables. Both results suggest that redundancy exists among selected MDs. Yet, DCNN was found to adapt to redundant data by focusing on the most important information for better prediction performance.
Keywords: Gradient-weighted class activation mapping; Guided backpropagation; Organic contaminants; Prediction uncertainty; Redundancy;
PubMed: https://pubmed.ncbi.nlm.nih.gov/34388923/
DOI: 10.1016/j.jhazmat.2021.126746