SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities
Informations
Type:
misc
Auteurs:
Dong Zhang and Shimin Li and Xin Zhang and Jun Zhan and Pengyu Wang and Yaqian Zhou and Xipeng Qiu
Pertinence:
Moyenne
Référence:
zhang2023speechgpt
Doi:
Mots-clés:
Url:
https://arxiv.org/abs/2305.11000
Date de publication:
05/2023
Résumé:
système vocal basé sur gpt (input et output vocales)
Abstract:
Multi-modal large language models are regarded as a crucial step towards Ar-
tificial General Intelligence (AGI) and have garnered significant interest with
the emergence of ChatGPT. However, current speech-language models typi-
cally adopt the cascade paradigm, preventing inter-modal knowledge transfer.
In this paper, we propose SpeechGPT, a large language model with intrinsic
cross-modal conversational abilities, capable of perceiving and generating multi-
model content. With discrete speech representations, we first construct SpeechIn-
struct, a large-scale cross-modal speech instruction dataset. Additionally, we
employ a three-stage training strategy that includes modality-adaptation pre-
training, cross-modal instruction fine-tuning, and chain-of-modality instruction
fine-tuning. The experimental results demonstrate that SpeechGPT has an im-
pressive capacity to follow multi-modal human instructions and highlight the
potential of handling multiple modalities with one model. Demos are shown in
https://0nutation.github.io/SpeechGPT.github.io/.
Pdf:
Lien pdf
Références
0 articles
Titre Type Pertinence Auteurs Date Publication Références Citations Actions
Pas encore d'article
Citations
0 articles
Titre Type Pertinence Auteurs Date Publication Références Citations Actions
Pas encore d'article
Mots-clés
0 mots-clés
Nom Nombre d'articles Actions
Pas encore de mot-clé
Auteurs
1 auteurs
Nom Nombre d'articles Actions
Dong Zhang and Shimin Li and Xin Zhang and Jun Zhan and Pengyu Wang and Yaqian Zhou and Xipeng Qiu 1