Reducing Speaker Residual by Considering Pinhole Effect
in Voice Anonymization
Abstract
De-identification is a critical objective of voice anonymization, requiring the absence of residual original speaker information in the anonymized speech. In state-of-the-art disentanglement-based frameworks, this is achieved by disentangling original speaker attributes from non-speaker attributes, which are represented separately, and substituting the speaker representation with that of a pseudo-speaker. Leakage of speaker attributes in non-speaker representations results in residual original speaker information in the anonymized speech. To this end, this paper proposes a fine-tuning strategy on the content and prosody extractors by incorporating a pinhole-based loss function applied to the anonymized speech. The linkability of the anonymized speech utterances is reduced, thereby minimizing the residual original speaker information in the anonymized speech. The effectiveness of the proposed pinhole-based fine-tuning strategy for voice privacy protection is validated through experiments conducted across multiple anonymization frameworks.