DolphinAttack: Inaudible Voice Commands

Authors:
Guoming Zhang

Zhejiang University, Hangzhou, China

Zhejiang University, Hangzhou, China
View Profile

,
Chen Yan

Zhejiang University, Hangzhou, China

Zhejiang University, Hangzhou, China
View Profile

,
Xiaoyu Ji

Zhejiang University, Hangzhou, China

Zhejiang University, Hangzhou, China
View Profile

,
Tianchen Zhang

Zhejiang University, Hangzhou, China

Zhejiang University, Hangzhou, China
View Profile

,
Taimin Zhang

Zhejiang University, Hangzhou, China

Zhejiang University, Hangzhou, China
View Profile

,
Wenyuan Xu

Zhejiang University, Hangzhou, China

Zhejiang University, Hangzhou, China
View Profile

CCS '17: Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications SecurityOctober 2017Pages 103–117https://doi.org/10.1145/3133956.3134052

Published:30 October 2017Publication History

CCS '17: Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security

Pages 103–117

ABSTRACT

Speech recognition (SR) systems such as Siri or Google Now have become an increasingly popular human-computer interaction method, and have turned various systems into voice controllable systems (VCS). Prior work on attacking VCS shows that the hidden voice commands that are incomprehensible to people can control the systems. Hidden voice commands, though "hidden", are nonetheless audible. In this work, we design a totally inaudible attack, DolphinAttack, that modulates voice commands on ultrasonic carriers (e.g., f > 20 kHz) to achieve inaudibility. By leveraging the nonlinearity of the microphone circuits, the modulated low-frequency audio commands can be successfully demodulated, recovered, and more importantly interpreted by the speech recognition systems. We validated DolphinAttack on popular speech recognition systems, including Siri, Google Now, Samsung S Voice, Huawei HiVoice, Cortana and Alexa. By injecting a sequence of inaudible voice commands, we show a few proof-of-concept attacks, which include activating Siri to initiate a FaceTime call on iPhone, activating Google Now to switch the phone to the airplane mode, and even manipulating the navigation system in an Audi automobile. We propose hardware and software defense solutions, and suggest to re-design voice controllable systems to be resilient to inaudible voice command attacks.

References

Muhammad Taher Abuelma'atti. 2003. Analysis of the effect of radio frequency interference on the DC performance of bipolar operational amplifiers. IEEE Transactions on Electromagnetic Compatibility 45, 2 (2003), 453--458. Google ScholarCross Ref
Akustica. 2014. AKU143 Top Port, analog silicon MEMS microphone. http://www.mouser.com/ds/2/720/DS37--1.01%20AKU143%20Datasheet-552974.pdf. (2014).Google Scholar
Akustica. 2014. AKU242 digital silicon MEMS microphone. http://www.mouser.com/ds/2/720/PB24--1.0%20-%20AKU242%20Product%20Brief-770082.pdf. (2014).Google Scholar
Amazon. 2017. Alexa. https://developer.amazon.com/alexa. (2017).Google Scholar
Apple. 2017. iOS-Siri-Apple. https://www.apple.com/ios/siri/. (2017).Google Scholar
Adam J. Aviv, Benjamin Sapp, Matt Blaze, and Jonathan M. Smith. 2012. Prac- ticality of accelerometer side channels on smartphones. In Proceedings of the Computer Security Applications Conference. 41--50.Google Scholar
Michael Backes, Markus Dürmuth, Sebastian Gerling, Manfred Pinkal, and Caro- line Sporleder. 2010. Acoustic side-channel attacks on printers.. In Proceedings of the USENIX Security Symposium. 307--322.Google Scholar
Baidu. 2017. Baidu Translate. http://fanyi.baidu.com/. (2017).Google Scholar
Avisoft Bioacoustics. 2017. Ultrasonic Dynamic Speaker Vifa. http://www.avisoft. com/usg/vifa.htm. (2017).Google Scholar
Nicholas Carlini, Pratyush Mishra, Tavish Vaidya, Yuankai Zhang, Micah Sherr, Clay Shields, David Wagner, and Wenchao Zhou. 2016. Hidden voice commands. In Proceedings of the USENIX Security Symposium.Google ScholarDigital Library
Simon Castro, Robert Dean, Grant Roth, George T Flowers, and Brian Grantham. 2007. Influence of acoustic noise on the dynamic performance of MEMS gyro- scopes. In Proceedings of the ASME International Mechanical Engineering Congress and Exposition. American Society of Mechanical Engineers, 1825--1831.Google Scholar
CereProc. 2017. CereProc Text-to-Speech. https://www.cereproc.com/. (2017).Google Scholar
Gordon KC Chen and James J Whalen. 1981. Comparative RFI performance of bipolar operational amplifiers. In Proceedings of the IEEE International Symposium on Electromagnetic Compatibility. IEEE, 1--5.Google ScholarCross Ref
Robert Neal Dean, Simon Thomas Castro, George T Flowers, Grant Roth, Anwar Ahmed, Alan Scottedward Hodel, Brian Eugene Grantham, David Allen Bittle, and James P Brunsch. 2011. A characterization of the performance of a MEMS gyroscope in acoustically harsh environments. IEEE Transactions on Industrial Electronics 58, 7 (2011), 2591--2596. Google ScholarCross Ref
Robert N Dean, George T Flowers, A Scotte Hodel, Grant Roth, Simon Castro, Ran Zhou, Alfonso Moreira, Anwar Ahmed, Rifki Rifki, Brian E Grantham, et al. 2007. On the degradation of MEMS gyroscope performance in the presence of high power acoustic noise. In Proceedings of the IEEE International Symposium on Industrial Electronics. 1435--140.Google ScholarCross Ref
Analog Devices. 2011. ADMP401: Omnidirectional microphone with bottom port and analog output obsolete data sheet. http://www.analog.com/media/en/technical-documentation/obsolete-data-sheets/ADMP401.pdf. (2011).Google Scholar
Sanorita Dey, Nirupam Roy, Wenyuan Xu, Romit Roy Choudhury, and Srihari Nelakuditi. 2014. AccelPrint: Imperfections of Accelerometers Make Smart- phones Trackable.. In Proceedings of the Network and Distributed System Security Symposium (NDSS).Google Scholar
Wenrui Diao, Xiangyu Liu, Zhe Zhou, and Kehuan Zhang. 2014. Your voice assistant is mine: How to abuse speakers to steal information and control your phone. In Proceedings of the ACM Workshop on Security and Privacy in Smartphones & Mobile Devices. ACM, 63--74.Google ScholarDigital Library
Aurélien Francillon, Boris Danev, and Srdjan Capkun. 2011. Relay attacks on passive keyless entry and start systems in modern cars. In Proceedings of the Network and Distributed System Security Symposium (NDSS).Google Scholar
Javier Gago, Josep Balcells, David GonzÁlez, Manuel Lamich, Juan Mon, and Alfonso Santolaria. 2007. EMI susceptibility model of signal conditioning circuits based on operational amplifiers. IEEE Transactions on Electromagnetic Compatibility 49, 4 (2007), 849--859. Google ScholarCross Ref
Google. 2016. Google Now. http://www.androidcentral.com/google-now. (2016).Google Scholar
Acapela Group. 2017. Acapela text to speech demo. http://www.acapela-group.com/. (2017).Google Scholar
Carnegie Mellon University Speech Group. 2012. Statistical parametirc sythesis and voice conversion techniques. http://festvox.org/11752/slides/lecture11a.pdf. (2012).Google Scholar
Weixi Gu, Zheng Yang, Longfei Shangguan, Xiaoyu Ji, and Yiyang Zhao. 2014. Toauth: Towards automatic near field authentication for smartphones. In Proceedings of the IEEE 13th International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom). IEEE, 229--236.Google ScholarDigital Library
Paul Horowitz and Winfield Hill. 1989. The art of electronics. Cambridge Univ. Press.Google Scholar
Rob Millerb Ishtiaq Roufa, Hossen Mustafaa, Sangho Ohb Travis Taylora, Wenyuan Xua, Marco Gruteserb, Wade Trappeb, and Ivan Seskarb. 2010. Se- curity and privacy vulnerabilities of in-car wireless networks: A tire pressure monitoring system case study. In Proceedings of the USENIX Security Symposium. 11--13.Google Scholar
Chadawan Ittichaichareon, Siwat Suksri, and Thaweesak Yingthawornsuk. 2012. Speech recognition using MFCC. In Proceedings of the International Conference on Computer Graphics, Simulation and Modeling (ICGSM). 28--29.Google Scholar
Chaouki Kasmi and Jose Lopes Esteves. 2015. IEMI threats for information security: Remote command injection on modern smartphones. IEEE Transactions on Electromagnetic Compatibility 57, 6 (2015), 1752--1755. Google ScholarCross Ref
Knowles. 2013. SPU0410LR5H-QB Zero-Height SiSonicTM Microphone. http://www.mouser.com/ds/2/218/-532675.pdf. (2013).Google Scholar
Dexus Pawel Krzywdzinski. 2017. Ultrasonic analyzer for iPad and iPhone. http://iaudioapps.com/page1/page1.html. (2017).Google Scholar
Denis Foo Kune, John Backes, Shane S Clark, Daniel Kramer, Matthew Reynolds, Kevin Fu, Yongdae Kim, and Wenyuan Xu. 2013. Ghost talk: Mitigating EMI signal injection attacks against analog sensors. In Proceedings of the IEEE Symposium on Security and Privacy (S&P). IEEE, 145--159.Google ScholarDigital Library
Hyewon Lee, Tae Hyun Kim, Jun Won Choi, and Sunghyun Choi. 2015. Chirp signal-based aerial acoustic communication for smart devices. In Proceedings of the IEEE International Conference on Computer Communications (INFOCOM). IEEE, 2407--2415. Google ScholarCross Ref
Xiaopeng Li, Wenyuan Xu, Song Wang, and Xianshan Qu. 2017. Are You Lying: Validating the Time-Location of Outdoor Images. In Proceedings of the International Conference on Applied Cryptography and Network Security. Springer, 103--123. Google ScholarCross Ref
Dog Park Software Ltd. 2017. iSpectrum - Macintosh Audio Spectrum Analyzer. https://dogparksoftware.com/iSpectrum.html. (2017).Google Scholar
Ivo Mateljan. 2017. Audio measurement and analysis software. http://www.artalabs.hr/. (2017).Google Scholar
Yan Michalevsky, Dan Boneh, and Gabi Nakibly. 2014. Gyrophone: Recognizing Speech from Gyroscope Signals. In Proceedings of the USENIX Security Symposium. 1053--1067.Google Scholar
Microsoft. 2017. What is Cortana? https://support.microsoft.com/en-us/help/17214/windows-10-what-is. (2017).Google Scholar
Dibya Mukhopadhyay, Maliheh Shirvanian, and Nitesh Saxena. 2015. All your voices are belong to us: Stealing voices to fool humans and machines. In Proceedings of the European Symposium on Research in Computer Security. Springer, 599--621. Google ScholarCross Ref
NeoSpeech. 2017. NeoSpeech Text-to-Speech. http://www.neospeech.com/. (2017).Google Scholar
Emmanuel Owusu, Jun Han, Sauvik Das, Adrian Perrig, and Joy Zhang. 2006. ACCessory: password inference using accelerometers on smartphones. (2006).Google Scholar
Carl Reinke. 2017. Spectroid. https://play.google.com/store/apps/details?id=org.intoorbit.spectrum&hl=en. (2017).Google Scholar
Nirupam Roy, Haitham Hassanieh, and Romit Roy Choudhury. 2017. Backdoor: Making microphones hear inaudible sounds. In Proceedings of the 15th Annual International Conference on Mobile Systems, Applications, and Services. ACM, 2--14. Google ScholarDigital Library
Samsung. 2017. What is S Voice? http://www.samsung.com/global/galaxy/what-is/s-voice/. (2017).Google Scholar
Roman Schlegel, Kehuan Zhang, Xiao-yong Zhou, Mehool Intwala, Apu Kapadia, and XiaoFeng Wang. 2011. Soundcomber: A Stealthy and Context-Aware Sound Trojan for Smartphones. In Proceedings of the Network and Distributed System Security Symposium (NDSS), Vol. 11. 17--33.Google Scholar
Sestek. 2017. Sestek TTS. http://www.sestek.com/. (2017).Google Scholar
Hocheol Shin, Yunmok Son, Youngseok Park, Yujin Kwon, and Yongdae Kim. 2016. Sampling race: bypassing timing-based analog active sensor spoofing detection on analog-digital systems. In Proceedings of the USENIX Workshop on Offensive Technologies (WOOT). USENIX Association.Google Scholar
Yasser Shoukry, Paul Martin, Paulo Tabuada, and Mani Srivastava. 2013. Non- invasive spoofing attacks for anti-lock braking systems. In Proceedings of the International Workshop on Cryptographic Hardware and Embedded Systems. Springer, 55--72. Google ScholarDigital Library
Laurent Simon and Ross Anderson. 2013. PIN skimmer: inferring PINs through the camera and microphone. In Proceedings of the ACM Workshop on Security and Privacy in Smartphones & Mobile Devices. 67--78. Google ScholarDigital Library
Yunmok Son, Hocheol Shin, Dongkwan Kim, Young-Seok Park, Juhwan Noh, Kibum Choi, Jungwoo Choi, Yongdae Kim, et al. 2015. Rocking drones with intentional sound noise on gyroscopic sensors. In Proceedings of the USENIX Security Symposium. 881--896.Google Scholar
Cry Sound. 2017. CRY343 free field measurment microphone. http://www.crysound.com/product_info.php?4/35/63. (2017).Google Scholar
Selvy Speech. 2017. Demo-Selvy TTS. http://speech.selvasai.com/en/text-to-speech-demonstration.php. (2017).Google Scholar
STMicroelectronics. 2014. MP23AB02BTR MEMS audio sensor, high- performance analog bottom-port microphone. http://www.mouser.com/ds/2/389/mp23ab02b-955093.pdf. (2014).Google Scholar
STMicroelectronics. 2016. MP34DB02 MEMS audio sensor omnidirectional digital microphone. http://www.mouser.com/ds/2/389/mp34db02--955149.pdf. (2016).Google Scholar
STMicroelectronics. 2017. Tutorial for MEMS microphones. http://www.st.com/content/ccc/resource/technical/document/application_note/46/0b/3e/74/cf/fb/4b/13/DM00103199.pdf/files/DM00103199.pdf/jcr:content/translations/en.DM00103199.pdf. (2017).Google Scholar
Jingchao Sun, Xiaocong Jin, Yimin Chen, Jinxue Zhang, Yanchao Zhang, and Rui Zhang. 2016. VISIBLE: Video-Assisted keystroke inference from tablet backside motion. In Proceedings of the Network and Distributed System Security Symposium (NDSS). Google ScholarCross Ref
Jinci Technologies. 2017. Open structure product review. http://www.jinci.cn/en/goods/112.html. (2017).Google Scholar
Keysight Technologies. 2017. N5172B EXG X-Series RF Vector Signal Generator, 9 kHz to 6 GHz. http://www.keysight.com/en/pdx-x201910-pn-N5172B. (2017).Google Scholar
From Text to Speech. 2017. Free online TTS service. http://www.fromtexttospeech.com/. (2017).Google Scholar
Innoetics Text to Speech Technologies. 2017. Innoetics Text-to-Speech. https://www.innoetics.com/. (2017).Google Scholar
Timothy Trippel, Ofir Weisse, Wenyuan Xu, Peter Honeyman, and Kevin Fu. 2017. WALNUT: Waging doubt on the integrity of mems accelerometers with acoustic injection attacks. In Proceedings of the IEEE European Symposium on Security and Privacy (EuroS&P). IEEE, 3--18.Google ScholarCross Ref
Tavish Vaidya, Yuankai Zhang, Micah Sherr, and Clay Shields. 2015. Cocaine Noodles: Exploiting the gap between human and machine speech recognition. In Proceedings of the USENIX Workshop on Offensive Technologies (WOOT). USENIX Association.Google Scholar
Olli Viikki and Kari Laurila. 1998. Cepstral domain segmental feature vector normalization for noise robust speech recognition. Speech Communication 25, 1 (1998), 133--147. Google ScholarDigital Library
Vocalware. 2017. Vocalware TTS. https://www.vocalware.com/. (2017).Google Scholar
Xiaohui Wang, Yanjing Wu, and Wenyuan Xu. 2016. WindCompass: Determine Wind Direction Using Smartphones. In Proceedings of the 13th Annual IEEE International Conference on Sensing, Communication, and Networking (SECON). IEEE, 1--9. Google ScholarDigital Library
Xdadevelopers. 2017. HiVoice app, what is it for? https://forum.xda-developers.com/honor-7/general/hivoice-app-t3322763. (2017).Google Scholar
Chen Yan, Wenyuan Xu, and Jianhao Liu. 2016. Can you trust autonomous vehicles: Contactless attacks against sensors of self-driving vehicle. DEF CON (2016).Google Scholar

Index Terms

DolphinAttack: Inaudible Voice Commands
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Speech recognition
2. Security and privacy
  1. Security in hardware
    1. Embedded systems security
  2. Systems security
    1. Operating systems security
      1. Mobile platform security

Recommendations

Regularized minimum variance distortionless response-based cepstral features for robust continuous speech recognition

We study the low-variance and robust features for speech recognition system on the AURORA-4 corpus.We propose to compute cepstral features from a regularized MVDR (RMVDR) spectral estimates, denoted as RMVDR-based Cepstral Coefficient (RMCC) features.A ...
Read More
Speech-Input Speech-Output Communication for Dysarthric Speakers Using HMM-Based Speech Recognition and Adaptive Synthesis System

Dysarthria is a motor speech disorder that causes inability to control and coordinate one or more articulators. This makes it difficult for a dysarthric speaker to utter certain speech sound units, thereby producing poorly articulated, slurred, and ...
Read More
Syllable-based automatic arabic speech recognition in noisy-telephone channel

The performance of well-trained speech recognizers using high quality full bandwidth speech data is usually degraded when used in real world environments. In particular, telephone speech recognition is extremely difficult due to the limited bandwidth of ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
CCS '17: Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security
October 2017
2682 pages
ISBN:9781450349468
DOI:10.1145/3133956
General Chair:
Bhavani Thuraisingham
The University of Texas at Dallas, USA
,
Program Chairs:
David Evans
University of Virginia
,
Tal Malkin
Columbia University
,
Dongyan Xu
Purdue University
Copyright © 2017 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 30 October 2017
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Badges
- Best Paper
Author Tags
defense
mems microphones
security analysis
speech recognition
voice controllable systems
Qualifiers
- research-article
Conference

Acceptance Rates
CCS '17 Paper Acceptance Rate151of836submissions,18%Overall Acceptance Rate1,261of6,999submissions,18%
More
Upcoming Conference
CCS '24

Sponsor:

sigsac

ACM SIGSAC Conference on Computer and Communications Security

October 14 - 18, 2024

Salt Lake City , UT , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 449
  Total Citations
  View Citations
- 6,898
  Total Downloads
- Downloads (Last 12 months)1,240
- Downloads (Last 6 weeks)146
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

DolphinAttack: Inaudible Voice Commands

CCS '17: Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security

ABSTRACT

References

Cited By

Index Terms

Recommendations

Regularized minimum variance distortionless response-based cepstral features for robust continuous speech recognition

Speech-Input Speech-Output Communication for Dysarthric Speakers Using HMM-Based Speech Recognition and Adaptive Synthesis System

Syllable-based automatic arabic speech recognition in noisy-telephone channel