Publication

해외 컨퍼런스Self-supervised speaker verification with relational mask prediction

(2024) Interspeech
2024-09-01

Self-supervised speaker verification with relational mask prediction[link]

Ju-ho Kim, Hee-Soo Heo, Bong-Jin Lee, Youngki Kwon, Minjae Lee, Ha-Jin Yu


Abstract 

Recently, self-supervised learning (SSL) has emerged as a promising strategy for constructing speaker verification (SV) systems, effectively mitigating the cost and privacy issues associated with the labeling process. The majority of SSL-based SV systems tend to focus on utterance-level features, potentially overlooking the inherent inter-frame structure of speech. To bridge this gap, we propose the relational mask prediction (RMP), a novel loss function that encourages models to understand the relationships between frames. Additionally, we introduce a block aggregation Transformer (BATransformer) to enrich frame-level features. Models were trained without labels using the VoxCeleb2 development set and comprehensively evaluated using various test sets. Experimental results demonstrate that the proposed framework outperforms recent SSLbased SV systems, achieving an average performance improvement of 22.39% over the baseline across the entire evaluation dataset.


본사이트의 모든 제작물의 저작권은 IRLab에 있으며, 무단복제나 도용은 저작권법(96조)에 의해 금지되어 있습니다.

COPYRIGHT ©  IRLab . Ltd. ALL RIGHTS RESERVED.