Nexus-O:

An Omni-Perceptive And -Interactive Model for Language, Audio, And Vision

Che Liu1,2* , Yingji Zhang3*, Dong Zhang1*, Weijie Zhang1*, Chenggong Gong1*, Haohan Li1*, Yu Lu1*,
Shilin Zhou1,6, Yue Lu1, Ziliang Gan1, Ziao Wang7, Junwei Liao8, Haipang Wu1, Ji Liu1, André Freitas3,10, Zenglin Xu5, Rongjunchen Zhang1,4,♠, Yong Dai1,5,♠,†
1Hithink Research, 2Imperical College London, 3University of Manchester, 4Zhejiang University, 5Fudan University, 6Soochow University, 7Baptist University, 8Microsoft, 9Meta AI, 10Idiap Research Institute

Demo


Download

Overview


Teaser

Nexus-O, which supports any combination of audio, image/video, and text inputs. Different from existing approaches that treat audio as an auxiliary, our model enables a joint understanding of these modalities. Nexus-O is capable of handling a wide range of tasks, demonstrating strong performance across various benchmarks. For Nexus-O-audio (EN) and (CN), the scores are expressed as reciprocal (1/scores).


•Architecture of Nexus-O

Teaser

Architecture of Nexus-O, which is designed to accept any combination of input modalities and generates output in either the language or audio modality, where the Auto-Regressive (AR) audio decoder takes a special start token embedding and the last language embedding as input to generate the hierarchical discrete audio codes in an auto-regressive manner. These codes are subsequently fed into a pretrained audio generator to produce the final audio output.

•Audio Dataset Synthesis Pipeline

Teaser

Audio dataset synthesis pipeline. In the current version, our testbed only supports the ASR task. We will further incorporate various audio-modal tasks, including the AQA and AVQA tasks. Both are currently under development.

•Overview of Training Stage

Teaser

Overview of the training stage in Nexus-O. The first stage aims to map the speech features into semantic space, the second stage aims to enable the audio instruction-following capability, and the last stage aims to enable speech generation capability.

Experimental Results


Evaluation on Vision Understanding Benchmarks.

Teaser

Evaluation on Audio English QA Benchmarks.

Teaser

Evaluation on ASR Benchmarks.

Teaser

Evaluation on Speech-to-Text Translation Benchmarks.

Teaser

Citation


        @article{liu2025nexus,
            title={Nexus-O: An Omni-Perceptive And-Interactive Model for Language, Audio, And Vision},
            author={Liu, Che and Zhang, Yingji and Zhang, Dong and Zhang, Weijie and Gong, Chenggong and Li, Haohan and Lu, Yu and Zhou, Shilin and Lu, Yue and Gan, Ziliang and others},
            journal={arXiv preprint arXiv:2503.01879},
            year={2025}
          }