The success of deep learning in recent years has raised concerns about adversarial examples, which allow attackers to force deep neural networks to output a specified target. Although a method by which to generate audio adversarial examples targeting a state-of-the-art speech recognition model has been proposed, this method cannot fool the model in the case of playing over the air, and thus, the threat was considered to be limited. In this paper, we propose a method to generate adversarial examples that can attack even when playing over the air in the physical world by simulating transformation caused by playback or recording and incorporating them in the generation process. Evaluation and a listening experiment demonstrated that audio adversarial examples generated by the proposed method may become a real threat.
We played and recorded each adversarial example 10 times using JBL CLIP2 and Sony ECM-PCV80U and evaluated transcriptions by the pretrained model of DeepSpeech.
|Target phrase||SNR||Success rate (10 trials)||Average edit distance||Audio|
|(A)||"hello world"||9.3 dB||100%||0.0|
|(B)||"open the door"||-2.7 dB||100%||0.0|
|(C)||"ok google"||7.5 dB||0%||4.2|
|Attack on recurrent models||Attack over the air||Audio|
|Carlini et al. (2018)||✔|
|Yuan et al. (2018)||✔|
Hiromu Yakura, Jun Sakuma.
Robust Audio Adversarial Example for a Physical Attack.
In arXiv:1810.11793, 2018.
This study was supported by JST CREST JPMJCR1302 and KAKENHI 16H02864.