Homepage - Welcome to Xiaokang Chen's Homepage

Xiaokang Chen (陈小康)

I am currently a technical staff member at DeepSeek AI, leading the multimodal group. This group, responsible for multimodal pre-training and post-training, focuses on advancing the multimodal capabilities of DeepSeek's large language models (LLMs). I am driven by the mission to expand the frontiers of machine intelligence and weave it into the fabric of everyday life, ultimately augmenting human potential.

I obtained my Ph.D degree at Peking University (PKU) in 2024, supervised by Professor Gang Zeng. Before that, I received my Bachelor’s degree at Peking University in July 2019.

pkucxk(at)pku.edu.cn Google Scholar GitHub Twitter LinkedIn

Education

Peking University

Ph.D. Student

Sep. 2019 - Jul. 2024
Peking University

B.S. in Computer Science

Sep. 2015 - Jul. 2019

Academic Service

Journal reviewer: IJCV, TPAMI, TIP, TCSVT, Neurocomputing, CVIU.
Conference reviewer: CVPR, ECCV, ICCV, NeurIPS, ICML, AAAI

Honors & Awards

WAIC Yunfan Award (云帆奖)

2025
Outstanding Graduate, Peking University

2024
National Scholarship, (Ministry of Education, PRC)

2021, 2022, 2023
Merit Student, PKU

2020, 2021, 2022, 2023
Top 10 Outstanding Researcher (学术十杰), PKU

2021
Huawei Scholarship

2021
Award for Academic Innovation, PKU

2021
Schlumberger Scholarship

2020
Award for Excellent Research, PKU

2018, 2019

Experience

DeepSeek

AGI Researcher

Apr. 2024 - now
Shanghai Artificial Intelligence Laboratory

Research Intern, directed by Dr. Wenhai Wang and Dr. Jifeng Dai.

Dec. 2022 - Nov. 2023
Baidu Research

Research Intern, directed by Dr. Jingdong Wang.

Dec. 2021 - Dec. 2022
Microsoft Research Aisa (MSRA)

Research Intern, directed by Dr. Jingdong Wang.

Jun. 2020 - Dec. 2021
Sensetime Research

Research Intern, directed by Dr. Kwan-Yee Lin and Dr. Wayne (Wenyan) Wu.

Apr. 2019 - May. 2020

News

2025

- I was awarded as the 2025 WAIC Yunfan Award.

Jul 27

- Release Janus-Pro for unified multimodal understanding and generation.

Jan 28

2024

- Release DeepSeek-VL2, a Mixture-of-Experts based Vision-Language Models.

Dec 13

- I successfully defended my Ph.D thesis!

May 20

Selected Projects and Papers (view all )

Janus-Series: Unified Multimodal Understanding and Generation Models

DeepSeek

Project lead and core contributor.

Abstract

The Janus-Series pioneers the design of separate visual encoders for unified multimodal understanding and generation, effectively alleviating the inherent conflict of using a single encoder found in prior work. This series comprises three models: Janus, an autoregressive-based unified model published at CVPR 2025; JanusFlow, a Flow Matching-based unified model also published at CVPR 2025; and Janus-Pro, a scaled-up version of Janus in terms of both data and model size. Janus-Pro achieves state-of-the-art performance among open-source models in both multimodal understanding and generation. Notably, on the GenEval image generation benchmark, Janus-Pro scores 80, outperforming both DALLE-3 and Stable Diffusion 3.

[Paper: Janus-Pro] [Paper: Janus (CVPR 2025)] [Paper: JanusFlow (CVPR 2025)]

[🔥 Code (17k stars)] [Huggingface Model] [Online Demo]

[🔥 Twitter] [机器之心] [量子位] [新智元]

Janus-Series: Unified Multimodal Understanding and Generation Models

DeepSeek

Project lead and core contributor.

Abstract

[Paper: Janus-Pro] [Paper: Janus (CVPR 2025)] [Paper: JanusFlow (CVPR 2025)]

DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

DeepSeek

Project co-lead and core contributor.

Abstract

DeepSeek-VL2 is a large multimodal foundation model based on the Mixture-of-Experts (MoE) architecture. It possesses a wide range of multimodal understanding capabilities, including image description, landmark recognition, chart understanding, OCR, meme understanding, multi-image understanding, object localization, and reasoning. Thanks to its MoE architecture, the model achieves better overall performance than Qwen2-VL-7B and InternVL2-8B while using only 4.1B active parameters. In terms of visual perception (specifically image description and vision perception), it surpasses Qwen2-VL-72B.

[Paper] [Code] [官方介绍]

DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

DeepSeek

Project co-lead and core contributor.

Abstract

[Paper] [Code] [官方介绍]

CAE: Context Autoencoder for Self-Supervised Representation Learning

Xiaokang Chen, Mingyu Ding, Xiaodi Wang, Ying Xin, Shentong Mo, Yunhao Wang, Shumin Han, Ping Luo, Gang Zeng, Jingdong Wang

International Journal of Computer Vision (IJCV) 2023

Abstract

Core Contributions: (1) Proposed to perform the prediction of masked image patches in the latent space. (2) Decoupled the functionalities of the encoder and decoder during the pre-training stage: the encoder is solely responsible for representation learning, while the decoder is only for completing the pre-training task. The method achieved state-of-the-art results on various ViT models (small, base, large, huge). Specifically, the ViT-H based model reached 64.5% mAP on the COCO test set, which ranked first on the leaderboard at the time of submission. The core idea of this work is similar to I-JEPA, a slightly later work from the same period by Turing Award winner Yann LeCun, as both perform prediction in the latent space. CAE has been successfully applied in Baidu's large models for industrial vision, OCR text recognition, and human body analysis.

[Paper] [Code] [Code2] [中文解读]

CAE: Context Autoencoder for Self-Supervised Representation Learning

Xiaokang Chen, Mingyu Ding, Xiaodi Wang, Ying Xin, Shentong Mo, Yunhao Wang, Shumin Han, Ping Luo, Gang Zeng, Jingdong Wang

International Journal of Computer Vision (IJCV) 2023

Abstract

[Paper] [Code] [Code2] [中文解读]

VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks

Wenhai Wang*, Zhe Chen*, Xiaokang Chen*, Jiannan Wu*, Xizhou Zhu, Gang Zeng, Ping Luo, Tong Lu, Jie Zhou, Yu Qiao, Jifeng Dai (* equal contribution)

Neural Information Processing Systems (NeurIPS) 2023

Abstract

We propose a vision-centric task framework based on large language models (LLMs). By treating images as a form of language and aligning vision tasks with language tasks—which can be flexibly defined and managed through linguistic instructions—this framework provides a unified perspective for both vision and language tasks. VisionLLM enables task customization at various levels via language instructions, ranging from fine-grained object-level to coarse-grained task-level customization. It achieves over 60% mAP on COCO, comparable to specialized detection models.

[Paper] [Code] [Demo]

VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks

Wenhai Wang*, Zhe Chen*, Xiaokang Chen*, Jiannan Wu*, Xizhou Zhu, Gang Zeng, Ping Luo, Tong Lu, Jie Zhou, Yu Qiao, Jifeng Dai (* equal contribution)

Neural Information Processing Systems (NeurIPS) 2023

Abstract

[Paper] [Code] [Demo]

Conditional DETR for Fast Training Convergence

Xiaokang Chen*, Depu Meng*, Zejia Fan, Gang Zeng, Houqiang Li,, Yuhui Yuan,, Lei Sun, Jingdong Wang (* equal contribution)

International Conference on Computer Vision (ICCV) 2021

Abstract

We solve the slow convergence of Detection Transformer (DETR) with our Conditional Spatial Query method. DETR converges slowly because it struggles to find key extremity regions of an object (e.g., an elephant's feet, back, or trunk), which are vital for accurate localization and recognition. Our method explicitly finds these extremity regions in space, constrains the search area, and speeds up DETR's convergence by 6-10x. This was one of the first works to address DETR's slow training, inspiring many later algorithms like DAB-DETR and DINO.

[Paper] [Code] [中文解读]

Conditional DETR for Fast Training Convergence

Xiaokang Chen*, Depu Meng*, Zejia Fan, Gang Zeng, Houqiang Li,, Yuhui Yuan,, Lei Sun, Jingdong Wang (* equal contribution)

International Conference on Computer Vision (ICCV) 2021

Abstract

[Paper] [Code] [中文解读]

CPS: Semi-Supervised Semantic Segmentation with Cross Pseudo Supervision

Xiaokang Chen, Yuhui Yuan, Gang Zeng, Jingdong Wang

IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2021

Abstract

This work proposes a simple and efficient semi-supervised semantic segmentation algorithm that enforces consistency between a dual-branch network using online-generated pseudo-labels. This approach achieves excellent semi-supervised performance without the need for threshold-based filtering. It significantly outperforms other contemporary semi-supervised segmentation algorithms on the PASCAL VOC 2012 and Cityscapes datasets, including Google's PseudoSeg (ICLR 2021). The method has become a key baseline in the field of semi-supervised segmentation. The paper has garnered over 1,000 citations and was featured on the [list of highly-cited AI papers in 2021].

[Paper] [Code] [Poster] [Slides] [Video Talk] [中文解读]

CPS: Semi-Supervised Semantic Segmentation with Cross Pseudo Supervision

Xiaokang Chen, Yuhui Yuan, Gang Zeng, Jingdong Wang

IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2021

Abstract

[Paper] [Code] [Poster] [Slides] [Video Talk] [中文解读]

Warning

Action required

Education

Academic Service

Honors & Awards

Experience

News

Selected Projects and Papers (view all )

Janus-Series: Unified Multimodal Understanding and Generation Models

Janus-Series: Unified Multimodal Understanding and Generation Models

DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

CAE: Context Autoencoder for Self-Supervised Representation Learning

CAE: Context Autoencoder for Self-Supervised Representation Learning

VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks

VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks

Conditional DETR for Fast Training Convergence

Conditional DETR for Fast Training Convergence

CPS: Semi-Supervised Semantic Segmentation with Cross Pseudo Supervision

CPS: Semi-Supervised Semantic Segmentation with Cross Pseudo Supervision

All publications