Title | SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing |
Author | |
Corresponding Author | Zhou, Long |
Publication Years | 2022
|
Conference Name | 60th Annual Meeting of the Association-for-Computational-Linguistics (ACL)
|
Source Title | |
Conference Date | MAY 22-27, 2022
|
Conference Place | null,Dublin,IRELAND
|
Publication Place | 209 N EIGHTH STREET, STROUDSBURG, PA 18360 USA
|
Publisher | |
Abstract | Motivated by the success of T5 (Text-To-Text Transfer Transformer) in pre-trained natural language processing models, we propose a unified-modal SpeechT5 framework that explores the encoder-decoder pre-training for self-supervised speech/text representation learning. The SpeechT5 framework consists of a shared encoder-decoder network and six modal-specific (speech/text) pre/post-nets. After preprocessing the input speech/text through the pre-nets, the shared encoder-decoder network models the sequence-to-sequence transformation, and then the post-nets generate the output in the speech/text modality based on the output of the decoder. Leveraging large-scale unlabeled speech and text data, we pre-train SpeechT5 to learn a unified-modal representation, hoping to improve the modeling capability for both speech and text. To align the textual and speech information into this unified semantic space, we propose a cross-modal vector quantization approach that randomly mixes up speech/text states with latent units as the interface between encoder and decoder. Extensive evaluations show the superiority of the proposed SpeechT5 framework on a wide variety of spoken language processing tasks, including automatic speech recognition, speech synthesis, speech translation, voice conversion, speech enhancement, and speaker identification. |
SUSTech Authorship | First
|
Language | English
|
URL | [Source Record] |
Indexed By | |
WOS Research Area | Computer Science
; Linguistics
|
WOS Subject | Computer Science, Artificial Intelligence
; Computer Science, Interdisciplinary Applications
; Linguistics
|
WOS Accession No | WOS:000828702305058
|
Data Source | Web of Science
|
Citation statistics |
Cited Times [WOS]:3
|
Document Type | Conference paper |
Identifier | http://kc.sustech.edu.cn/handle/2SGJ60CL/401486 |
Department | Department of Computer Science and Engineering |
Affiliation | 1.Southern Univ Sci & Technol, Dept Comp Sci & Engn, Shenzhen, Peoples R China 2.Hong Kong Polytech Univ, Dept Comp, Hong Kong, Peoples R China 3.Tongji Univ, Dept Comp Sci & Technol, Shanghai, Peoples R China 4.Microsoft, Redmond, WA 98052 USA 5.Peng Cheng Lab, Shenzhen, Peoples R China |
First Author Affilication | Department of Computer Science and Engineering |
First Author's First Affilication | Department of Computer Science and Engineering |
Recommended Citation GB/T 7714 |
Ao, Junyi,Wang, Rui,Zhou, Long,et al. SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing[C]. 209 N EIGHTH STREET, STROUDSBURG, PA 18360 USA:ASSOC COMPUTATIONAL LINGUISTICS-ACL,2022.
|
Files in This Item: | There are no files associated with this item. |
|
Items in the repository are protected by copyright, with all rights reserved, unless otherwise indicated.
Edit Comment