Xiaokang Chen (陈小康)

I am currently a researcher at   DeepSeek AI, focusing on multi-modal large language models. I am driven by the mission to push the frontiers of machine intelligence and weave it into the fabric of everyday life, ultimately augmenting human potential.

I obtained my Ph.D degree at Peking University (PKU) in 2024, supervised by Professor Gang Zeng. Before that, I received my Bachelor’s degree at Peking University in July 2019.


Education
  • Peking University
        Peking University
         Ph.D. Student
    Sep. 2019 - Jul. 2024
  • Peking University
        Peking University
         B.S. in Computer Science
    Sep. 2015 - Jul. 2019

Academic Service
  • Journal reviewer: IJCV, TPAMI, TIP, TCSVT, Neurocomputing, CVIU.
  • Conference reviewer: CVPR, ECCV, ICCV, NeurIPS, ICML, AAAI
Honors & Awards
  • WAIC Yunfan Award (云帆奖)
    2025
  • Outstanding Graduate, Peking University
    2024
  • National Scholarship, (Ministry of Education, PRC)
    2021, 2022, 2023
  • Merit Student, PKU
    2020, 2021, 2022, 2023
  • Top 10 Outstanding Researcher (学术十杰), PKU
    2021
  • Huawei Scholarship
    2021
  • Award for Academic Innovation, PKU
    2021
  • Schlumberger Scholarship
    2020
  • Award for Excellent Research, PKU
    2018, 2019
Experience
  • DeepSeek
        DeepSeek
         AGI Researcher
    Apr. 2024 - now
  • Shanghai Artificial Intelligence Laboratory
        Shanghai Artificial Intelligence Laboratory
         Research Intern, directed by Dr. Wenhai Wang and Dr. Jifeng Dai.
    Dec. 2022 - Nov. 2023
  • Baidu Research
        Baidu Research
         Research Intern, directed by Dr. Jingdong Wang.
    Dec. 2021 - Dec. 2022
  • Microsoft Research Aisa (MSRA)
        Microsoft Research Aisa (MSRA)
         Research Intern, directed by Dr. Jingdong Wang.
    Jun. 2020 - Dec. 2021
  • Sensetime Research
        Sensetime Research
         Research Intern, directed by Dr. Kwan-Yee Lin and Dr. Wayne (Wenyan) Wu.
    Apr. 2019 - May. 2020
News
2025
- I was awarded as the 2025 WAIC Yunfan Award.
Jul 27
- Release Janus-Pro for unified multimodal understanding and generation.
Jan 28
2024
- Release DeepSeek-VL2, a Mixture-of-Experts based Vision-Language Models.
Dec 13
- I successfully defended my Ph.D thesis!
May 20
Selected Projects and Papers (view all )
Janus-Series: Unified Multimodal Understanding and Generation Models
Janus-Series: Unified Multimodal Understanding and Generation Models

DeepSeek

Project lead and core contributor.

Abstract
The Janus-Series pioneers the design of separate visual encoders for unified multimodal understanding and generation, effectively alleviating the inherent conflict of using a single encoder found in prior work. This series comprises three models: Janus, an autoregressive-based unified model published at CVPR 2025; JanusFlow, a Flow Matching-based unified model also published at CVPR 2025; and Janus-Pro, a scaled-up version of Janus in terms of both data and model size. Janus-Pro achieves state-of-the-art performance among open-source models in both multimodal understanding and generation. Notably, on the GenEval image generation benchmark, Janus-Pro scores 80, outperforming both DALLE-3 and Stable Diffusion 3.

Janus-Series: Unified Multimodal Understanding and Generation Models

DeepSeek

Project lead and core contributor.

Abstract
The Janus-Series pioneers the design of separate visual encoders for unified multimodal understanding and generation, effectively alleviating the inherent conflict of using a single encoder found in prior work. This series comprises three models: Janus, an autoregressive-based unified model published at CVPR 2025; JanusFlow, a Flow Matching-based unified model also published at CVPR 2025; and Janus-Pro, a scaled-up version of Janus in terms of both data and model size. Janus-Pro achieves state-of-the-art performance among open-source models in both multimodal understanding and generation. Notably, on the GenEval image generation benchmark, Janus-Pro scores 80, outperforming both DALLE-3 and Stable Diffusion 3.

DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding
DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

DeepSeek

Project co-lead and core contributor.

Abstract
DeepSeek-VL2 is a large multimodal foundation model based on the Mixture-of-Experts (MoE) architecture. It possesses a wide range of multimodal understanding capabilities, including image description, landmark recognition, chart understanding, OCR, meme understanding, multi-image understanding, object localization, and reasoning. Thanks to its MoE architecture, the model achieves better overall performance than Qwen2-VL-7B and InternVL2-8B while using only 4.1B active parameters. In terms of visual perception (specifically image description and vision perception), it surpasses Qwen2-VL-72B.

DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

DeepSeek

Project co-lead and core contributor.

Abstract
DeepSeek-VL2 is a large multimodal foundation model based on the Mixture-of-Experts (MoE) architecture. It possesses a wide range of multimodal understanding capabilities, including image description, landmark recognition, chart understanding, OCR, meme understanding, multi-image understanding, object localization, and reasoning. Thanks to its MoE architecture, the model achieves better overall performance than Qwen2-VL-7B and InternVL2-8B while using only 4.1B active parameters. In terms of visual perception (specifically image description and vision perception), it surpasses Qwen2-VL-72B.

CAE: Context Autoencoder for Self-Supervised Representation Learning
CAE: Context Autoencoder for Self-Supervised Representation Learning

Xiaokang Chen, Mingyu Ding, Xiaodi Wang, Ying Xin, Shentong Mo, Yunhao Wang, Shumin Han, Ping Luo, Gang Zeng, Jingdong Wang

International Journal of Computer Vision (IJCV) 2023

Abstract
Core Contributions: (1) Proposed to perform the prediction of masked image patches in the latent space. (2) Decoupled the functionalities of the encoder and decoder during the pre-training stage: the encoder is solely responsible for representation learning, while the decoder is only for completing the pre-training task. The method achieved state-of-the-art results on various ViT models (small, base, large, huge). Specifically, the ViT-H based model reached 64.5% mAP on the COCO test set, which ranked first on the leaderboard at the time of submission. The core idea of this work is similar to I-JEPA, a slightly later work from the same period by Turing Award winner Yann LeCun, as both perform prediction in the latent space. CAE has been successfully applied in Baidu's large models for industrial vision, OCR text recognition, and human body analysis.

CAE: Context Autoencoder for Self-Supervised Representation Learning

Xiaokang Chen, Mingyu Ding, Xiaodi Wang, Ying Xin, Shentong Mo, Yunhao Wang, Shumin Han, Ping Luo, Gang Zeng, Jingdong Wang

International Journal of Computer Vision (IJCV) 2023

Abstract
Core Contributions: (1) Proposed to perform the prediction of masked image patches in the latent space. (2) Decoupled the functionalities of the encoder and decoder during the pre-training stage: the encoder is solely responsible for representation learning, while the decoder is only for completing the pre-training task. The method achieved state-of-the-art results on various ViT models (small, base, large, huge). Specifically, the ViT-H based model reached 64.5% mAP on the COCO test set, which ranked first on the leaderboard at the time of submission. The core idea of this work is similar to I-JEPA, a slightly later work from the same period by Turing Award winner Yann LeCun, as both perform prediction in the latent space. CAE has been successfully applied in Baidu's large models for industrial vision, OCR text recognition, and human body analysis.

VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks
VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks

Wenhai Wang*, Zhe Chen*, Xiaokang Chen*, Jiannan Wu*, Xizhou Zhu, Gang Zeng, Ping Luo, Tong Lu, Jie Zhou, Yu Qiao, Jifeng Dai (* equal contribution)

Neural Information Processing Systems (NeurIPS) 2023

Abstract
We propose a vision-centric task framework based on large language models (LLMs). By treating images as a form of language and aligning vision tasks with language tasks—which can be flexibly defined and managed through linguistic instructions—this framework provides a unified perspective for both vision and language tasks. VisionLLM enables task customization at various levels via language instructions, ranging from fine-grained object-level to coarse-grained task-level customization. It achieves over 60% mAP on COCO, comparable to specialized detection models.

VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks

Wenhai Wang*, Zhe Chen*, Xiaokang Chen*, Jiannan Wu*, Xizhou Zhu, Gang Zeng, Ping Luo, Tong Lu, Jie Zhou, Yu Qiao, Jifeng Dai (* equal contribution)

Neural Information Processing Systems (NeurIPS) 2023

Abstract
We propose a vision-centric task framework based on large language models (LLMs). By treating images as a form of language and aligning vision tasks with language tasks—which can be flexibly defined and managed through linguistic instructions—this framework provides a unified perspective for both vision and language tasks. VisionLLM enables task customization at various levels via language instructions, ranging from fine-grained object-level to coarse-grained task-level customization. It achieves over 60% mAP on COCO, comparable to specialized detection models.

Conditional DETR for Fast Training Convergence
Conditional DETR for Fast Training Convergence

Xiaokang Chen*, Depu Meng*, Zejia Fan, Gang Zeng, Houqiang Li,, Yuhui Yuan,, Lei Sun, Jingdong Wang (* equal contribution)

International Conference on Computer Vision (ICCV) 2021

Abstract
We solve the slow convergence of Detection Transformer (DETR) with our Conditional Spatial Query method. DETR converges slowly because it struggles to find key extremity regions of an object (e.g., an elephant's feet, back, or trunk), which are vital for accurate localization and recognition. Our method explicitly finds these extremity regions in space, constrains the search area, and speeds up DETR's convergence by 6-10x. This was one of the first works to address DETR's slow training, inspiring many later algorithms like DAB-DETR and DINO.

Conditional DETR for Fast Training Convergence

Xiaokang Chen*, Depu Meng*, Zejia Fan, Gang Zeng, Houqiang Li,, Yuhui Yuan,, Lei Sun, Jingdong Wang (* equal contribution)

International Conference on Computer Vision (ICCV) 2021

Abstract
We solve the slow convergence of Detection Transformer (DETR) with our Conditional Spatial Query method. DETR converges slowly because it struggles to find key extremity regions of an object (e.g., an elephant's feet, back, or trunk), which are vital for accurate localization and recognition. Our method explicitly finds these extremity regions in space, constrains the search area, and speeds up DETR's convergence by 6-10x. This was one of the first works to address DETR's slow training, inspiring many later algorithms like DAB-DETR and DINO.

CPS: Semi-Supervised Semantic Segmentation with Cross Pseudo Supervision
CPS: Semi-Supervised Semantic Segmentation with Cross Pseudo Supervision

Xiaokang Chen, Yuhui Yuan, Gang Zeng, Jingdong Wang

IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2021

Abstract
This work proposes a simple and efficient semi-supervised semantic segmentation algorithm that enforces consistency between a dual-branch network using online-generated pseudo-labels. This approach achieves excellent semi-supervised performance without the need for threshold-based filtering. It significantly outperforms other contemporary semi-supervised segmentation algorithms on the PASCAL VOC 2012 and Cityscapes datasets, including Google's PseudoSeg (ICLR 2021). The method has become a key baseline in the field of semi-supervised segmentation. The paper has garnered over 1,000 citations and was featured on the [list of highly-cited AI papers in 2021].

CPS: Semi-Supervised Semantic Segmentation with Cross Pseudo Supervision

Xiaokang Chen, Yuhui Yuan, Gang Zeng, Jingdong Wang

IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2021

Abstract
This work proposes a simple and efficient semi-supervised semantic segmentation algorithm that enforces consistency between a dual-branch network using online-generated pseudo-labels. This approach achieves excellent semi-supervised performance without the need for threshold-based filtering. It significantly outperforms other contemporary semi-supervised segmentation algorithms on the PASCAL VOC 2012 and Cityscapes datasets, including Google's PseudoSeg (ICLR 2021). The method has become a key baseline in the field of semi-supervised segmentation. The paper has garnered over 1,000 citations and was featured on the [list of highly-cited AI papers in 2021].

All publications