OSUM: Advancing Open Speech Understanding Models with Limited Resources in Academia

Xuelong Geng, Kun Wei, Qijie Shao, Shuiyun Liu*, Zhennan Lin*, Zhixian Zhao*, Guojian Li*, Wenjie Tian*, Peikun Chen, Yangze Li, Pengcheng Guo, Mingchen Shao, Shuiyuan Wang, Yuang Cao, Chengyou Wang, Tianyi Xu, Yuhang Dai, Xinfa Zhu, Yue Li, Li Zhang, Lei Xie†

Test Page
📑 Paper | 📑 Demo | 💬 WeChat (微信)

Large Language Models (LLMs) have made significant progress in various downstream tasks, inspiring the development of Speech Understanding Language Models (SULMs) to enable comprehensive speech-based interactions. However, most advanced SULMs are developed by the industry, leveraging large-scale datasets and computational resources that are not readily available to the academic community. Moreover, the lack of transparency in training details creates additional barriers to further innovation. In this study, we present OSUM, an Open Speech Understanding Model designed to explore the potential of training SLUMs under constrained academic resources. The OSUM model combines a Whisper encoder with a Qwen2 LLM and supports a wide range of speech tasks, including speech recognition (ASR), speech recognition with timestamps (SRWT), vocal event detection (VED), speech emotion recognition (SER), speaking style recognition (SSR), speaker gender classification (SGC), speaker age prediction (SAP), and speech-to-text chat (STTC). By employing an ASR+X training strategy, OSUM achieves efficient and stable multi-task training by simultaneously optimizing ASR alongside target tasks. Beyond delivering strong performance, OSUM emphasizes transparency by providing openly available data preparation and training methodologies, offering valuable insights and practical guidance for the academic community. By doing so, we aim to accelerate research and innovation in advanced SULM technologies.

Architecture

The overview of the architecture and tasks of OSUM.

Name		Name	Last commit message	Last commit date
Latest commit History 101 Commits
.github		.github
raw		raw
README.md		README.md
index.html		index.html
style.css		style.css

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OSUM: Advancing Open Speech Understanding Models with Limited Resources in Academia

Architecture

About

Releases

Sponsor this project

Packages

Languages

ASLP-lab/OSUM.github.io

Folders and files

Latest commit

History

Repository files navigation

OSUM: Advancing Open Speech Understanding Models with Limited Resources in Academia

Architecture

About

Resources

Stars

Watchers

Forks

Releases

Sponsor this project

Packages 0

Languages

Packages