In adversarial machine learning, new defenses against attacks on deep
learning systems are routinely broken soon after their release by more powerful
attacks. In this context, forensic tools can offer a valuable complement to
existing defenses, by tracing back a successful attack to its root cause, and
offering a path forward for mitigation to prevent similar attacks in the

In this paper, we describe our efforts in developing a forensic traceback
tool for poison attacks on deep neural networks. We propose a novel iterative
clustering and pruning solution that trims “innocent” training samples, until
all that remains is the set of poisoned data responsible for the attack. Our
method clusters training samples based on their impact on model parameters,
then uses an efficient data unlearning method to prune innocent clusters. We
empirically demonstrate the efficacy of our system on three types of
dirty-label (backdoor) poison attacks and three types of clean-label poison
attacks, across domains of computer vision and malware classification. Our
system achieves over 98.4% precision and 96.8% recall across all attacks. We
also show that our system is robust against four anti-forensics measures
specifically designed to attack it.

By admin