Recently, self-supervised learning (SSL) has emerged as a
promising strategy for constructing speaker verification (SV)
systems, effectively mitigating the cost and privacy issues associated
with the labeling process. The majority of SSL-based SV systems
tend to focus on utterance-level features, potentially overlooking the
inherent inter-frame structure of speech. To bridge this gap, we propose the relational mask prediction (RMP), a novel loss function that
encourages models to understand the relationships between frames.
Additionally, we introduce a block aggregation Transformer (BATransformer) to enrich frame-level features. Models were trained
without labels using the VoxCeleb2 development set and comprehensively evaluated using various test sets. Experimental results
demonstrate that the proposed framework outperforms recent SSLbased SV systems, achieving an average performance improvement
of 22.39% over the baseline across the entire evaluation dataset.
Self-supervised speaker verification with relational mask prediction[link]
Ju-ho Kim, Hee-Soo Heo, Bong-Jin Lee, Youngki Kwon, Minjae Lee, Ha-Jin Yu
Abstract
Recently, self-supervised learning (SSL) has emerged as a promising strategy for constructing speaker verification (SV) systems, effectively mitigating the cost and privacy issues associated with the labeling process. The majority of SSL-based SV systems tend to focus on utterance-level features, potentially overlooking the inherent inter-frame structure of speech. To bridge this gap, we propose the relational mask prediction (RMP), a novel loss function that encourages models to understand the relationships between frames. Additionally, we introduce a block aggregation Transformer (BATransformer) to enrich frame-level features. Models were trained without labels using the VoxCeleb2 development set and comprehensively evaluated using various test sets. Experimental results demonstrate that the proposed framework outperforms recent SSLbased SV systems, achieving an average performance improvement of 22.39% over the baseline across the entire evaluation dataset.