Title | Multi-View Self-Attention Based Transformer for Speaker Recognition |
Author | |
DOI | |
Publication Years | 2022
|
Conference Name | 47th IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
|
ISSN | 1520-6149
|
ISBN | 978-1-6654-0541-6
|
Source Title | |
Volume | 2022-May
|
Pages | 6732-6736
|
Conference Date | 23-27 May 2022
|
Conference Place | Singapore, Singapore
|
Publication Place | 345 E 47TH ST, NEW YORK, NY 10017 USA
|
Publisher | |
Abstract | Initially developed for natural language processing (NLP), Transformer model is now widely used for speech processing tasks such as speaker recognition, due to its powerful sequence modeling capabilities. However, conventional self-attention mechanisms are originally designed for modeling textual sequence without considering the characteristics of speech and speaker modeling. Besides, different Transformer variants for speaker recognition have not been well studied. In this work, we propose a novel multi-view self-attention mechanism and present an empirical study of different Transformer variants with or without the proposed attention mechanism for speaker recognition. Specifically, to balance the capabilities of capturing global dependencies and modeling the locality, we propose a multi-view self-attention mechanism for speaker Transformer, in which different attention heads can attend to different ranges of the receptive field. Furthermore, we introduce and compare five Transformer variants with different network architectures, embedding locations, and pooling methods to learn speaker embeddings. Experimental results on the VoxCeleb1 and VoxCeleb2 datasets show that the proposed multi-view self-attention mechanism achieves improvement in the performance of speaker recognition, and the proposed speaker Transformer network attains excellent results compared with state-of-the-art models. |
Keywords | |
SUSTech Authorship | Others
|
Language | English
|
URL | [Source Record] |
Indexed By | |
Funding Project | National Nature Science Foundation of China["61976160","62076182","61906137"]
; Technology research plan project of Ministry of Public and Security[2020JSYJD01]
; Shanghai Science and Technology Plan Project[21DZ1204800]
|
WOS Research Area | Acoustics
; Computer Science
; Engineering
|
WOS Subject | Acoustics
; Computer Science, Artificial Intelligence
; Engineering, Electrical & Electronic
|
WOS Accession No | WOS:000864187907007
|
EI Accession Number | 20222312199281
|
Data Source | IEEE
|
PDF url | https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9746639 |
Citation statistics |
Cited Times [WOS]:1
|
Document Type | Conference paper |
Identifier | http://kc.sustech.edu.cn/handle/2SGJ60CL/347982 |
Department | Department of Computer Science and Engineering |
Affiliation | 1.Tongji University,Department of Computer Science and Technology 2.Southern University of Science and Technology,Department of Computer Science and Engineering 3.Microsoft Research Asia 4.The Hong Kong Polytechnic University,Department of Computing |
Recommended Citation GB/T 7714 |
Rui Wang,Junyi Ao,Long Zhou,et al. Multi-View Self-Attention Based Transformer for Speaker Recognition[C]. 345 E 47TH ST, NEW YORK, NY 10017 USA:IEEE,2022:6732-6736.
|
Files in This Item: | There are no files associated with this item. |
|
Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.
Edit Comment