VoiceCloak: Adversarial Example Enabled Voice De-Identification with Balanced Privacy and Utility

Meng Chen
ACM UbiComp/ISWC 2023
4 min readAug 20, 2023

Co-author: Li Lu@MUSLab

The privacy-utility dilemma in voice services

Recent decades have witnessed voice input becoming one of the most prevalent methods widely deployed in various services. Richer functional utility, including automatic speech transcription, efficient voice searching, and live language translation, thus gradually enables humans to enjoy a natural but much more intelligent experience.

However, behind the powerful utility of voice services, the privacy risks of voice data publishing raise extensive public concerns. Many leading tech giants are collecting and storing users’ voices in practice or even eavesdropping on users’ conversations without any consent. This exposes users to the risk of identity leakage by specialized Automatic Speaker Identification (ASI) tools, which can extract voiceprints after listening to only 8∼10 words. Such voiceprints may be used to disclose Personal Identifiable Information (PII) for targeted advertisement by user profiling or even malicious impersonation.

Existing voice de-identification schemes

To address this privacy-utility dilemma, voice de-identification schemes are proposed to eliminate individual traits while maintaining the linguistic content for other downstream tasks (e.g., Automatic Speech Recognition, ASR), including voice transformation, voice conversion, and speech synthesis.

Generally, these methods are designed for machine-centric tasks, i.e., protecting user identity against ASI while remaining correct speech transcripts from ASR, ignoring human-centric experiences, i.e., the perceptual quality of de-identified voices significantly declines due to the inconsistent voiceprint and severe distortion, leading to an unbalance between voice privacy and utility.

Adversarial example as a new de-identification tool

We take a different viewpoint to balance the speech utility and identity privacy of voice services. Inspired by the strong threat to learning-based automatic systems and the excellent imperceptibility to humans, we introduce adversarial examples as a new de-identification tool. By imposing subtle perturbation on the voice, we could conceal the speaker identity while maintaining the speech utility in terms of voiceprint consistency, speech integrity, and audio quality.

Based on this we propose a user-centric voice de-identification system, VoiceCloak, which is implemented as a guard App on the user side before voices are uploaded to the cloud server. For each raw voice, the Convolutional perturbation injector convolves it with an impulse response-like adversarial perturbation to construct an adversarial example. This modulates the perturbations into the natural sound reverberation, thus maintaining the perceptual voice characteristics. Then the Pseudo Target Sampler generates a pseudo speaker embedding as the target identity through a pre-trained embedding-level conditional variational auto-encoder. Finally, the Marginal Triplet Optimizer adopts a triplet loss architecture to optimize adversarial examples in an input-specific manner. This enables any source user to disguise a large group of target speakers in different utterances, thus further improving voice unlinkability.

As a result, the adversarial examples would be identified as different target speakers by ASIs while preserving the correct transcription after ASRs.

Evaluation results and audio examples

We conduct experiments against four mainstream and commercial ASIs on two voice datasets, involving both objective and subjective evaluation:

  • VoiceCloak achieves a high de-identification success rate of 92% and 84% on mainstream and commercial ASIs with a word accuracy drop of less than 10%.
  • Compared to existing methods, VoiceCloak exhibits the best speech quality and intelligence with a Mel cepstral distortion of 5.13dB and a short-time objective intelligence of over 0.81.
  • In the subjective test, VoiceCloak yields excellent voiceprint consistency, speech integrity, and audio quality with a mean opinion score over 4. We also found that nearly 50% of volunteers cannot distinguish the de-identified voices from the original ones.
  • Moreover, the evaluation validates the adaptability of VoiceCloak to various human voice characteristics, such as different genders, ages, and accents.

As shown in the spectrograms above, the de-identified voice remains similar acoustic structures to the original voice with reverb trails. This is highly similar to the naturally reverbed one, indicating that our perturbations successfully approximate a natural reverberation. Hence, the convolutive perturbations could maintain natural audibility and provide a better experience for human participants.

VoiceCloak turns convolutive adversarial examples as a defense tool against automatic speaker identification to balance the privacy and utility of voice services. It may be applied in various practical scenarios to help users preserve voice privacy while maintaining perceptual quality, including online voice publishing and offline audio post-processing.

All technical details and evaluation results are presented in our IMWUT paper, and more audio samples can be found on our project website.

--

--