Mingsheng Li photo

Mingsheng Li

AI Researcher @ Qwen Team, Alibaba Group

Fudan University

I am currently an AI researcher at the Qwen Team, Alibaba Group. Before that, I graduated from Fudan University under the supervision of Prof. Tao Chen. Previously, I have had the wonderful experience of working with Dr. Hongyuan Zhu from A*STAR, Dr. Gang Yu, Dr. Chi Zhang, Dr. Xin Chen from Tencent, and Dr. Bo Zhang from Shanghai AI Lab.

My current work and research focus on Large Vision-Language Models, Multi-agent System, Generative AI, and Embodied AI.

Collaborations

  • We are looking for candidates passionate about (1) native multimodal large models and (2) agents with deep reasoning and planning. We also welcome collaborations from academia and industry — email limingsheng.lms@alibaba-inc.com.

Selected Publications

Tech Reports

Technical ReportQwen-VLA

Qwen-VLA: Unifying Vision-Language-Action Modeling across Tasks, Environments, and Robot Embodiments

Co-First Author

  • An all-in-one unified vision-language-action generalist model that handles complex manipulation tasks, long-horizon navigation, and trajectory prediction within a single framework, generalizing across diverse tasks, environments, and robot embodiments.
Technical ReportQwen3-VL

Qwen3-VL Technical Report

Core Author

  • A next-generation vision-language model featuring advanced visual reasoning, native dynamic-resolution processing, long-context video understanding, and multilingual support with both dense and MoE variants.
Technical ReportQwen3.5-Omni

Qwen3.5-Omni Technical Report

Author

  • An upgraded end-to-end omni model with enhanced multimodal reasoning, real-time streaming interaction, and stronger cross-modal understanding and generation across text, image, audio, and video modalities.
Technical ReportQwen3-Omni

Qwen3-Omni Technical Report

Author

  • A natively end-to-end multilingual omni model supporting real-time streaming interaction across text, image, audio, and video with unified perception, understanding, and generation.
Technical ReportStarVLA

StarVLA: A Lego-like Codebase for Vision-Language-Action Model Developing

Author

  • A modular, Lego-like open-source codebase for agile VLA development, supporting seamless integration of diverse VLMs, action heads, and world models with unified training, evaluation, and deployment across robotics benchmarks.

Preprints

arXiv 2026

Unify Robot Actions in Camera Frame

Sicheng Xie, Lingchen Meng, Zijie Diao, Haidong Cao, Zhiying Du, Shuyuan Tu, Jiaqi Leng, Qiuyue Wang, Mingsheng Li, Shuai Bai, Zuxuan Wu, Yu-Gang Jiang

  • A training-free, robot-independent pipeline that estimates camera extrinsics for offline datasets and unifies robot actions in camera frame, enabling scalable cross-embodiment learning without any manual calibration.
arXiv 2026FineVLA

FineVLA: Fine-Grained Instruction Alignment for Steerable Vision-Language-Action Policies

Xintong Hu, Xuhong Huang, Jinyu Zhang, Yutong Yao, Yuchong Sun, Qiuyue Wang, Mingsheng Li, Sicheng Xie, Yitao Liu, Junhao Chen, Yixuan Chen, Yingming Zheng, Shuai Bai, Tao Yu

  • Fine-grained VLA supervision that improves steerable robot control.

Published

T-PAMI 2026StructChart

StructChart: Perception, Structuring, Reasoning for Visual Chart Understanding

Renqiu Xia, Haoyang Peng, Hancheng Ye, Mingsheng Li, Xiangchao Yan, Peng Ye, Botian Shi, Yu Qiao, Junchi Yan, Bo Zhang

IEEE Transactions on Pattern Analysis and Machine Intelligence (T-PAMI), 2026.

  • A unified approach for visual chart perception and reasoning using Structured Triplet Representations.
ICLR 2026VLM4VLA

VLM4VLA: Revisiting Vision-Language-Models in Vision-Language-Action Models

Jianke Zhang, Xiaoyu Chen, Qiuyue Wang, Mingsheng Li, Yanjiang Guo, Yucheng Hu, Jiajun Zhang, Shuai Bai, Junyang Lin, Jianyu Chen

International Conference on Learning Representations (ICLR), 2026.

  • A unified framework for studying how VLMs affect VLA performance.
ICCV 2025Chimera

Chimera: Improving Generalist Model with Domain-Specific Experts

Tianshuo Peng, Mingsheng Li (co-first), Jiakang Yuan, Hongbin Zhou, Renqiu Xia, Renrui Zhang, Lei Bai, Song Mao, Bin Wang, Aojun Zhou, Botian Shi, Tao Chen, Bo Zhang, Xiangyu Yue

IEEE/CVF International Conference on Computer Vision (ICCV), 2025.

  • Cost-effective general-specialist collaboration to improve LMMs.
ICLR 2025GeoX

GeoX: Geometric Problem Solving Through Unified Formalized Vision-Language Pre-training

Renqiu Xia, Mingsheng Li (co-first), Hancheng Ye, Wenjie Wu, Hongbin Zhou, Jiakang Yuan, Tianshuo Peng, Xinyu Cai, Xiangchao Yan, Bin Wang, Conghui He, Botian Shi, Tao Chen, Junchi Yan, Bo Zhang

International Conference on Learning Representations (ICLR), 2025.

  • Formalized pre-training for geometry problem solving.
T-MM 2025WI3D

WI3D: Weakly Incremental 3D Detection via Vision Foundation Models

Mingsheng Li, Sijin Chen, Shengji Tang, Hongyuan Zhu, Yanyan Fang, Xin Chen, Zhuoyuan Li, Fukun Yin, Tao Chen

IEEE Transactions on Multimedia (T-MM), 2025.

  • Introducing new categories to 3D detectors via 2D foundation models.
NeurIPS 20243DET-Mamba

3DET-Mamba: State Space Model for End-to-End 3D Object Detection

Mingsheng Li, Jiakang Yuan, Sijin Chen, Lin Zhang, Anyu Zhu, Xin Chen, Tao Chen

Conference on Neural Information Processing Systems (NeurIPS), 2024.

  • End-to-end 3D detection with Mamba-based representation learning.
ECCV 2024M3DBench

M3DBench: Let's Instruct Large Models with Multi-modal 3D Prompts

Mingsheng Li, Xin Chen, Chi Zhang, Sijin Chen, Hongyuan Zhu, Fukun Yin, Gang Yu, Tao Chen

European Conference on Computer Vision (ECCV), 2024.

  • Large-scale 3D-language dataset with interleaved multimodal prompts.
T-PAMI 2024Vote2Cap-DETR++

Vote2Cap-DETR++: Decoupling Localization and Describing for End-to-End 3D Dense Captioning

Sijin Chen, Hongyuan Zhu, Mingsheng Li, Xin Chen, Peng Guo, Yinjie Lei, Gang Yu, Taihao Li, Tao Chen

IEEE Transactions on Pattern Analysis and Machine Intelligence (T-PAMI), 2024.

  • Decoupled queries for 3D localization and dense captioning.
CVPR 2024LL3DA

LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning

Sijin Chen, Xin Chen, Chi Zhang, Mingsheng Li, Gang Yu, Hao Fei, Hongyuan Zhu, Jiayuan Fan, Tao Chen

IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024.

  • 3D-LLMs for visual and textual interactions in complex 3D scenes.
T-MM 2024LGD

Lightweight Model Pre-training via Language Guided Knowledge Distillation

Mingsheng Li, Lin Zhang, Mingzhen Zhu, Zilong Huang, Gang Yu, Jiayuan Fan, Tao Chen

IEEE Transactions on Multimedia (T-MM), 2024.

  • Language-guided knowledge distillation for lightweight model pre-training.

Experiences

Senior AI Researcher
Apr. 2025 - Present

Qwen Team, Alibaba Group

Research Intern
Jun. 2024 - Apr. 2025

Tencent Tech

Research Intern
Oct. 2023 - Jun. 2024

Shanghai AI Lab

Awards

  • 2025   Tencent Rhino-Bird Elite Talent Program
  • 2025   Outstanding Graduate of Shanghai
  • 2024   National Scholarship
  • 2023   First-Class Academic Scholarship
  • 2022   Outstanding Graduate of Fudan University
  • 2020   National 2nd Prize, China Undergraduate Mathematical Contest in Modeling
  • 2019   National 1st Prize, Chinese Mathematics Competitions (Top 20)

Services

Conference Reviewer: CVPR, ICCV, ECCV, NeurIPS, ICLR, ACM MM

Journal Reviewer: T-PAMI, T-MM