Although the recent progress is substantial, deep learning methods can bevulnerable to the maliciously generated adversarial examples. In this paper, wepresent a novel training procedure and a thresholding test strategy, towardsrobust detection of adversarial examples. In training, we propose to minimizethe reverse cross-entropy (RCE), which encourages a deep network to learnlatent representations that better distinguish adversarial examples from normalones. In testing, we propose to use a thresholding strategy as the detector tofilter out adversarial examples for reliable predictions. Our method is simpleto implement using standard algorithms, with little extra training costcompared to the common cross-entropy minimization. We apply our method todefend various attacking methods on the widely used MNIST and CIFAR-10datasets, and achieve significant improvements on robust predictions under allthe threat models in the adversarial setting.
translated by 谷歌翻译