Biblio

Informations

Type:: misc
Auteurs:: Dong Zhang and Shimin Li and Xin Zhang and Jun Zhan and Pengyu Wang and Yaqian Zhou and Xipeng Qiu
Pertinence:: Moyenne
Référence:: zhang2023speechgpt
Doi:
Mots-clés:
Url:: https://arxiv.org/abs/2305.11000
Date de publication:: 05/2023
Résumé:: système vocal basé sur gpt (input et output vocales)
Abstract:: Multi-modal large language models are regarded as a crucial step towards Ar-
tificial General Intelligence (AGI) and have garnered significant interest with
the emergence of ChatGPT. However, current speech-language models typi-
cally adopt the cascade paradigm, preventing inter-modal knowledge transfer.
In this paper, we propose SpeechGPT, a large language model with intrinsic
cross-modal conversational abilities, capable of perceiving and generating multi-
model content. With discrete speech representations, we first construct SpeechIn-
struct, a large-scale cross-modal speech instruction dataset. Additionally, we
employ a three-stage training strategy that includes modality-adaptation pre-
training, cross-modal instruction fine-tuning, and chain-of-modality instruction
fine-tuning. The experimental results demonstrate that SpeechGPT has an im-
pressive capacity to follow multi-modal human instructions and highlight the
potential of handling multiple modalities with one model. Demos are shown in
https://0nutation.github.io/SpeechGPT.github.io/.
Pdf:: Lien pdf

Références

0 articles

Titre	Type	Pertinence	Auteurs	Date Publication	Références	Citations	Actions
Pas encore d'article

Citations

0 articles

Titre	Type	Pertinence	Auteurs	Date Publication	Références	Citations	Actions
Pas encore d'article

Mots-clés

0 mots-clés

Nom	Nombre d'articles	Actions
Pas encore de mot-clé

Auteurs

1 auteurs

Nom	Nombre d'articles	Actions
Dong Zhang and Shimin Li and Xin Zhang and Jun Zhan and Pengyu Wang and Yaqian Zhou and Xipeng Qiu	1	voir