Reducing Speaker Residual by Considering Pinhole Effect
in Voice Anonymization

Anonymous submission to Interspeech 2026

Abstract

De-identification is a critical objective of voice anonymization, requiring the absence of residual original speaker information in the anonymized speech. In state-of-the-art disentanglement-based frameworks, this is achieved by disentangling original speaker attributes from non-speaker attributes, which are represented separately, and substituting the speaker representation with that of a pseudo-speaker. Leakage of speaker attributes in non-speaker representations results in residual original speaker information in the anonymized speech. To this end, this paper proposes a fine-tuning strategy on the content and prosody extractors by incorporating a pinhole-based loss function applied to the anonymized speech. The linkability of the anonymized speech utterances is reduced, thereby minimizing the residual original speaker information in the anonymized speech. The effectiveness of the proposed pinhole-based fine-tuning strategy for voice privacy protection is validated through experiments conducted across multiple anonymization frameworks.

Index Terms: voice anonymization, residual speaker attributes, linkability, pinhole-based loss

Audio samples are organized by system and method. w/o denotes the original method without pinhole-based fine-tuning, and w/ denotes the corresponding method with pinhole-based fine-tuning.