I am a researcher at the ERNIE Team, Baidu, supervised by Prof. Rui Liu, focusing on audio-visual understanding and speech generation. I have published at top-tier conferences such as AAAI and ICASSP, and contributed to ERNIE 5.0. Google Scholar

I am always open to research collaborations and new opportunities. If you are interested in working together or have any exciting prospects, feel free to reach out at shuwei_he@163.com.

Large Audio Language Model Vision-Language Model Omni Model Text-to-Speech Multimodal Understanding

🔥 News

2026.02: 🚀 Open-sourced Eureka-Audio, a lightweight large audio understanding model. With only 1.7B parameters, it outperforms several significantly larger models. The preprint is now available on arXiv.
2026.01: ⭐ Participated in the core development of Baidu’s ERNIE 5.0 and was listed as a contributor in the official technical report.
2025.08: 🎊 Received an official offer from the Baidu ERNIE Team .
2025.02: 💼 Joined the Baidu ERNIE Foundation Model Team as a Large Model Algorithm Intern.
2024.12: 🎉 Our paper MS$^2$KU-VTTS was accepted by ICASSP 2025.
2024.12: 🎉 Our paper M$^2$SE-VTTS was accepted by AAAI 2025.

📝 Publications

Representative Work

arXiv 2026

Eureka-Audio: Triggering Audio Intelligence in Compact Language Models

Dan Zhang*, Yishu Lei*, Jing Hu*, Shuwei He*, Songhe Deng, Xianlong Luo, Danxiang Zhu, Shikun Feng, Rui Liu, Jingzhou He, Yu Sun, Hua Wu, Haifeng Wang (* Equal Contribution / * 共同一作)

Project

This paper introduces Eureka-Audio, a compact 1.7-billion-parameter audio language model that utilizes a sparsely activated Mixture-of-Experts adapter and a novel data synthesis pipeline called DataFlux to achieve competitive audio understanding and significantly faster inference speeds compared to much larger baseline models.

arXiv 2025

MoE Adapter for Large Audio Language Models: Sparsity, Disentanglement, and Gradient-Conflict-Free

Yishu Lei*, Shuwei He*, Jing Hu, Dan Zhang, Xianlong Luo, Danxiang Zhu, Shikun Feng, Rui Liu, Jingzhou He, Yu Sun, Hua Wu, Haifeng Wang (* Equal Contribution / * 共同一作)

Project

This paper introduces the MoE-Adapter, a sparse Mixture-of-Experts architecture designed to mitigate gradient conflicts in Large Audio Language Models by dynamically routing heterogeneous acoustic inputs to specialized experts, effectively disentangling speech, music, and sound for superior cross-modal alignment.

AAAI 2025

Multi-modal and Multi-scale Spatial Environment Understanding for Immersive Visual Text-to-Speech

Rui Liu^†, Shuwei He, Yifan Hu, Haizhou Li (^† Corresponding Author / 通讯作者, Shuwei He is the first student author / 何树伟为导师外学生一作)

Project

This paper introduces M$^2$SE-VTTS, an innovative multi-modal and multi-scale framework that integrates RGB and depth images with environment captions to effectively model local and global spatial contexts for synthesizing immersive reverberant speech.

More Publications

arXiv 2026 ERNIE 5.0 Technical Report, ERNIE Team, Baidu (Shuwei He as Core Contributor / 何树伟为核心参与者)
arXiv 2025 CORD: Bridging the Audio-Text Reasoning Gap via Weighted On-policy Cross-modal Distillation, Jing Hu, Danxiang Zhu, Xianlong Luo, Dan Zhang, Shuwei He, Yishu Lei, Haitao Zheng, Shikun Feng, Jingzhou He, Yu Sun, Hua Wu, Haifeng Wang
ICASSP 2025 Multi-Source Spatial Knowledge Understanding for Immersive Visual Text-to-Speech, Shuwei He (First Author / 第一作者), Rui Liu

🎖 Honors and Awards

2025.12 National Scholarship
2021.12 National Scholarship

📖 Educations

2023.08 - 2026.06, Master, Inner Mongolia University, Artificial Intelligence

💻 Internships

2025.02 - 2026.01, Algorithm Intern, Baidu ERNIE Bot, China.

Shuwei He (何树伟)

🔥 News

📝 Publications

Representative Work

More Publications

🎖 Honors and Awards

📖 Educations

💻 Internships