2025

Janus-Series: Unified Multimodal Understanding and Generation Models
Janus-Series: Unified Multimodal Understanding and Generation Models

DeepSeek

Project lead and core contributor.

Abstract
The Janus-Series pioneers the design of separate visual encoders for unified multimodal understanding and generation, effectively alleviating the inherent conflict of using a single encoder found in prior work. This series comprises three models: Janus, an autoregressive-based unified model published at CVPR 2025; JanusFlow, a Flow Matching-based unified model also published at CVPR 2025; and Janus-Pro, a scaled-up version of Janus in terms of both data and model size. Janus-Pro achieves state-of-the-art performance among open-source models in both multimodal understanding and generation. Notably, on the GenEval image generation benchmark, Janus-Pro scores 80, outperforming both DALLE-3 and Stable Diffusion 3.

Janus-Series: Unified Multimodal Understanding and Generation Models

DeepSeek

Project lead and core contributor.

Abstract
The Janus-Series pioneers the design of separate visual encoders for unified multimodal understanding and generation, effectively alleviating the inherent conflict of using a single encoder found in prior work. This series comprises three models: Janus, an autoregressive-based unified model published at CVPR 2025; JanusFlow, a Flow Matching-based unified model also published at CVPR 2025; and Janus-Pro, a scaled-up version of Janus in terms of both data and model size. Janus-Pro achieves state-of-the-art performance among open-source models in both multimodal understanding and generation. Notably, on the GenEval image generation benchmark, Janus-Pro scores 80, outperforming both DALLE-3 and Stable Diffusion 3.

2024

DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding
DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

DeepSeek

Project co-lead and core contributor.

Abstract
DeepSeek-VL2 is a large multimodal foundation model based on the Mixture-of-Experts (MoE) architecture. It possesses a wide range of multimodal understanding capabilities, including image description, landmark recognition, chart understanding, OCR, meme understanding, multi-image understanding, object localization, and reasoning. Thanks to its MoE architecture, the model achieves better overall performance than Qwen2-VL-7B and InternVL2-8B while using only 4.1B active parameters. In terms of visual perception (specifically image description and vision perception), it surpasses Qwen2-VL-72B.

DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

DeepSeek

Project co-lead and core contributor.

Abstract
DeepSeek-VL2 is a large multimodal foundation model based on the Mixture-of-Experts (MoE) architecture. It possesses a wide range of multimodal understanding capabilities, including image description, landmark recognition, chart understanding, OCR, meme understanding, multi-image understanding, object localization, and reasoning. Thanks to its MoE architecture, the model achieves better overall performance than Qwen2-VL-7B and InternVL2-8B while using only 4.1B active parameters. In terms of visual perception (specifically image description and vision perception), it surpasses Qwen2-VL-72B.

D$^3$ETR: Decoder Distillation for Detection Transformer
D$^3$ETR: Decoder Distillation for Detection Transformer

Xiaokang Chen, Jiahui Chen, Yan Liu, Jiaxiang Tang, Gang Zeng

International Joint Conference on Artificial Intelligence (IJCAI) 2024

D$^3$ETR: Decoder Distillation for Detection Transformer

Xiaokang Chen, Jiahui Chen, Yan Liu, Jiaxiang Tang, Gang Zeng

International Joint Conference on Artificial Intelligence (IJCAI) 2024

LGM: Large Multi-View Gaussian Model for High-Resolution 3D Content Creation
LGM: Large Multi-View Gaussian Model for High-Resolution 3D Content Creation

Jiaxiang Tang, Zhaoxi Chen, Xiaokang Chen, Tengfei Wang, Gang Zeng, Ziwei Liu

European Conference on Computer Vision (ECCV) 2024

LGM: Large Multi-View Gaussian Model for High-Resolution 3D Content Creation

Jiaxiang Tang, Zhaoxi Chen, Xiaokang Chen, Tengfei Wang, Gang Zeng, Ziwei Liu

European Conference on Computer Vision (ECCV) 2024

The Devil is in the Neurons: Interpreting and Mitigating Social Biases in Language Models
The Devil is in the Neurons: Interpreting and Mitigating Social Biases in Language Models

Yan Liu, Yu Liu, Xiaokang Chen, Pin-Yu Chen, Daoguang Zan, Min-Yen Kan, Tsung-Yi Ho,

International Conference on Learning Representations (ICLR) 2024

The Devil is in the Neurons: Interpreting and Mitigating Social Biases in Language Models

Yan Liu, Yu Liu, Xiaokang Chen, Pin-Yu Chen, Daoguang Zan, Min-Yen Kan, Tsung-Yi Ho,

International Conference on Learning Representations (ICLR) 2024

Improving Long Text Understanding with Knowledge Distilled from Summarization Model
Improving Long Text Understanding with Knowledge Distilled from Summarization Model

Yan Liu, Yazheng Yang, Xiaokang Chen

IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2024

Improving Long Text Understanding with Knowledge Distilled from Summarization Model

Yan Liu, Yazheng Yang, Xiaokang Chen

IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2024

2023

CAE: Context Autoencoder for Self-Supervised Representation Learning
CAE: Context Autoencoder for Self-Supervised Representation Learning

Xiaokang Chen, Mingyu Ding, Xiaodi Wang, Ying Xin, Shentong Mo, Yunhao Wang, Shumin Han, Ping Luo, Gang Zeng, Jingdong Wang

International Journal of Computer Vision (IJCV) 2023

Abstract
Core Contributions: (1) Proposed to perform the prediction of masked image patches in the latent space. (2) Decoupled the functionalities of the encoder and decoder during the pre-training stage: the encoder is solely responsible for representation learning, while the decoder is only for completing the pre-training task. The method achieved state-of-the-art results on various ViT models (small, base, large, huge). Specifically, the ViT-H based model reached 64.5% mAP on the COCO test set, which ranked first on the leaderboard at the time of submission. The core idea of this work is similar to I-JEPA, a slightly later work from the same period by Turing Award winner Yann LeCun, as both perform prediction in the latent space. CAE has been successfully applied in Baidu's large models for industrial vision, OCR text recognition, and human body analysis.

CAE: Context Autoencoder for Self-Supervised Representation Learning

Xiaokang Chen, Mingyu Ding, Xiaodi Wang, Ying Xin, Shentong Mo, Yunhao Wang, Shumin Han, Ping Luo, Gang Zeng, Jingdong Wang

International Journal of Computer Vision (IJCV) 2023

Abstract
Core Contributions: (1) Proposed to perform the prediction of masked image patches in the latent space. (2) Decoupled the functionalities of the encoder and decoder during the pre-training stage: the encoder is solely responsible for representation learning, while the decoder is only for completing the pre-training task. The method achieved state-of-the-art results on various ViT models (small, base, large, huge). Specifically, the ViT-H based model reached 64.5% mAP on the COCO test set, which ranked first on the leaderboard at the time of submission. The core idea of this work is similar to I-JEPA, a slightly later work from the same period by Turing Award winner Yann LeCun, as both perform prediction in the latent space. CAE has been successfully applied in Baidu's large models for industrial vision, OCR text recognition, and human body analysis.

VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks
VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks

Wenhai Wang*, Zhe Chen*, Xiaokang Chen*, Jiannan Wu*, Xizhou Zhu, Gang Zeng, Ping Luo, Tong Lu, Jie Zhou, Yu Qiao, Jifeng Dai (* equal contribution)

Neural Information Processing Systems (NeurIPS) 2023

Abstract
We propose a vision-centric task framework based on large language models (LLMs). By treating images as a form of language and aligning vision tasks with language tasks—which can be flexibly defined and managed through linguistic instructions—this framework provides a unified perspective for both vision and language tasks. VisionLLM enables task customization at various levels via language instructions, ranging from fine-grained object-level to coarse-grained task-level customization. It achieves over 60% mAP on COCO, comparable to specialized detection models.

VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks

Wenhai Wang*, Zhe Chen*, Xiaokang Chen*, Jiannan Wu*, Xizhou Zhu, Gang Zeng, Ping Luo, Tong Lu, Jie Zhou, Yu Qiao, Jifeng Dai (* equal contribution)

Neural Information Processing Systems (NeurIPS) 2023

Abstract
We propose a vision-centric task framework based on large language models (LLMs). By treating images as a form of language and aligning vision tasks with language tasks—which can be flexibly defined and managed through linguistic instructions—this framework provides a unified perspective for both vision and language tasks. VisionLLM enables task customization at various levels via language instructions, ranging from fine-grained object-level to coarse-grained task-level customization. It achieves over 60% mAP on COCO, comparable to specialized detection models.

Group DETR: Fast DETR Training with Group-Wise One-to-Many Assignment
Group DETR: Fast DETR Training with Group-Wise One-to-Many Assignment

Qiang Chen*, Xiaokang Chen*, Jian Wang, Haocheng Feng, Junyu Han, Errui Ding, Gang Zeng, Jingdong Wang (* equal contribution)

International Conference on Computer Vision (ICCV) 2023

Group DETR: Fast DETR Training with Group-Wise One-to-Many Assignment

Qiang Chen*, Xiaokang Chen*, Jian Wang, Haocheng Feng, Junyu Han, Errui Ding, Gang Zeng, Jingdong Wang (* equal contribution)

International Conference on Computer Vision (ICCV) 2023

Group DETR v2: Strong Object Detector with Encoder-Decoder Pretraining
Group DETR v2: Strong Object Detector with Encoder-Decoder Pretraining

Baidu Research Team

Technical Report 2023

The first model to achieve 64.5 mAP on the COCO test set leaderboard.

Group DETR v2: Strong Object Detector with Encoder-Decoder Pretraining

Baidu Research Team

Technical Report 2023

The first model to achieve 64.5 mAP on the COCO test set leaderboard.

Uncovering and Quantifying Social Biases in Code Generation
Uncovering and Quantifying Social Biases in Code Generation

Yan Liu, Xiaokang Chen#, Yan Gao, Zhe Su, Fengji Zhang, Daoguang Zan, Jian-Guang LOU, Pin-Yu Chen, Tsung-Ti Ho (# corresponding author)

Neural Information Processing Systems (NeurIPS) 2023

Uncovering and Quantifying Social Biases in Code Generation

Yan Liu, Xiaokang Chen#, Yan Gao, Zhe Su, Fengji Zhang, Daoguang Zan, Jian-Guang LOU, Pin-Yu Chen, Tsung-Ti Ho (# corresponding author)

Neural Information Processing Systems (NeurIPS) 2023

Interactive Segment Anything NeRF with Feature Imitation
Interactive Segment Anything NeRF with Feature Imitation

Xiaokang Chen*, Jiaxiang Tang*, Diwen Wan, Jingbo Wang, Gang Zeng (* equal contribution)

Technical Report 2023

Interactive Segment Anything NeRF with Feature Imitation

Xiaokang Chen*, Jiaxiang Tang*, Diwen Wan, Jingbo Wang, Gang Zeng (* equal contribution)

Technical Report 2023

Parallel Sentence-Level Explanation Generation For Real-World Low-Resource Scenarios
Parallel Sentence-Level Explanation Generation For Real-World Low-Resource Scenarios

Yan Liu, Xiaokang Chen, Qi Dai

IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2023

Parallel Sentence-Level Explanation Generation For Real-World Low-Resource Scenarios

Yan Liu, Xiaokang Chen, Qi Dai

IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2023

Delicate Textured Mesh Recovery from NeRF via Adaptive Surface Refinement
Delicate Textured Mesh Recovery from NeRF via Adaptive Surface Refinement

Jiaxiang Tang, Hang Zhou, Xiaokang Chen, Tianshu Hu, Errui Ding, Jingdong Wang, Gang Zeng

International Conference on Computer Vision (ICCV) 2023

Delicate Textured Mesh Recovery from NeRF via Adaptive Surface Refinement

Jiaxiang Tang, Hang Zhou, Xiaokang Chen, Tianshu Hu, Errui Ding, Jingdong Wang, Gang Zeng

International Conference on Computer Vision (ICCV) 2023

Uncovering and Categorizing Social Biases in Text-to-SQL
Uncovering and Categorizing Social Biases in Text-to-SQL

Yan Liu, Yan Gao, Zhe Su, Xiaokang Chen, Elliott Ash, Jian-Guang LOU

Annual Meeting of the Association for Computational Linguistics (ACL) 2023

Uncovering and Categorizing Social Biases in Text-to-SQL

Yan Liu, Yan Gao, Zhe Su, Xiaokang Chen, Elliott Ash, Jian-Guang LOU

Annual Meeting of the Association for Computational Linguistics (ACL) 2023

Understanding Self-Supervised Pretraining with Part-Aware Representation Learning
Understanding Self-Supervised Pretraining with Part-Aware Representation Learning

Jie Zhu*, Jiyang Qi*, Mingyu Ding*, Xiaokang Chen, Ping Luo, Xinggang Wang, Wenyu Liu, Leye Wang, Jingdong Wang (* equal contribution)

Transactions on Machine Learning Research (TMLR) 2023

Understanding Self-Supervised Pretraining with Part-Aware Representation Learning

Jie Zhu*, Jiyang Qi*, Mingyu Ding*, Xiaokang Chen, Ping Luo, Xinggang Wang, Wenyu Liu, Leye Wang, Jingdong Wang (* equal contribution)

Transactions on Machine Learning Research (TMLR) 2023

CAE v2: Context Autoencoder with CLIP Target
CAE v2: Context Autoencoder with CLIP Target

Xinyu Zhang, Jiahui Chen, Junkun Yuan, Qiang Chen, Jian Wang, Xiaodi Wang, Shumin Han, Xiaokang Chen, Jimin Pi, Kun Yao, Junyu Han, Errui Ding, Jingdong Wang

Transactions on Machine Learning Research (TMLR) 2023

CAE v2: Context Autoencoder with CLIP Target

Xinyu Zhang, Jiahui Chen, Junkun Yuan, Qiang Chen, Jian Wang, Xiaodi Wang, Shumin Han, Xiaokang Chen, Jimin Pi, Kun Yao, Junyu Han, Errui Ding, Jingdong Wang

Transactions on Machine Learning Research (TMLR) 2023

2022

Not All Voxels Are Equal: Semantic Scene Completion from the Point-Voxel Perspective
Not All Voxels Are Equal: Semantic Scene Completion from the Point-Voxel Perspective

Xiaokang Chen, Jiaxiang Tang, Jingbo Wang, Gang Zeng

AAAI Conference on Artificial Intelligence (AAAI) 2022

Not All Voxels Are Equal: Semantic Scene Completion from the Point-Voxel Perspective

Xiaokang Chen, Jiaxiang Tang, Jingbo Wang, Gang Zeng

AAAI Conference on Artificial Intelligence (AAAI) 2022

Conditional DETR V2: Efficient Detection Transformer with Box Queries
Conditional DETR V2: Efficient Detection Transformer with Box Queries

Xiaokang Chen, Fangyun Wei, Gang Zeng, Jingdong Wang

Technical Report 2022

Conditional DETR V2: Efficient Detection Transformer with Box Queries

Xiaokang Chen, Fangyun Wei, Gang Zeng, Jingdong Wang

Technical Report 2022

Compressible-composable NeRF via Rank-residual Decomposition
Compressible-composable NeRF via Rank-residual Decomposition

Jiaxiang Tang, Xiaokang Chen, Jingbo Wang, Gang Zeng

Neural Information Processing Systems (NeurIPS) 2022

Compressible-composable NeRF via Rank-residual Decomposition

Jiaxiang Tang, Xiaokang Chen, Jingbo Wang, Gang Zeng

Neural Information Processing Systems (NeurIPS) 2022

Point Scene Understanding via Disentangled Instance Mesh Reconstruction
Point Scene Understanding via Disentangled Instance Mesh Reconstruction

Jiaxiang Tang, Xiaokang Chen, Jingbo Wang, Gang Zeng

European Conference on Computer Vision (ECCV) 2022

Point Scene Understanding via Disentangled Instance Mesh Reconstruction

Jiaxiang Tang, Xiaokang Chen, Jingbo Wang, Gang Zeng

European Conference on Computer Vision (ECCV) 2022

MaskGroup: Hierarchical Point Grouping and Masking for 3D Instance Segmentation
MaskGroup: Hierarchical Point Grouping and Masking for 3D Instance Segmentation

Min Zhong, Xinghao Chen, Xiaokang Chen, Gang Zeng, Yunhe Wang

IEEE International Conference on Multimedia and Expo (ICME) 2022

MaskGroup: Hierarchical Point Grouping and Masking for 3D Instance Segmentation

Min Zhong, Xinghao Chen, Xiaokang Chen, Gang Zeng, Yunhe Wang

IEEE International Conference on Multimedia and Expo (ICME) 2022

2021

Conditional DETR for Fast Training Convergence
Conditional DETR for Fast Training Convergence

Xiaokang Chen*, Depu Meng*, Zejia Fan, Gang Zeng, Houqiang Li,, Yuhui Yuan,, Lei Sun, Jingdong Wang (* equal contribution)

International Conference on Computer Vision (ICCV) 2021

Abstract
We solve the slow convergence of Detection Transformer (DETR) with our Conditional Spatial Query method. DETR converges slowly because it struggles to find key extremity regions of an object (e.g., an elephant's feet, back, or trunk), which are vital for accurate localization and recognition. Our method explicitly finds these extremity regions in space, constrains the search area, and speeds up DETR's convergence by 6-10x. This was one of the first works to address DETR's slow training, inspiring many later algorithms like DAB-DETR and DINO.

Conditional DETR for Fast Training Convergence

Xiaokang Chen*, Depu Meng*, Zejia Fan, Gang Zeng, Houqiang Li,, Yuhui Yuan,, Lei Sun, Jingdong Wang (* equal contribution)

International Conference on Computer Vision (ICCV) 2021

Abstract
We solve the slow convergence of Detection Transformer (DETR) with our Conditional Spatial Query method. DETR converges slowly because it struggles to find key extremity regions of an object (e.g., an elephant's feet, back, or trunk), which are vital for accurate localization and recognition. Our method explicitly finds these extremity regions in space, constrains the search area, and speeds up DETR's convergence by 6-10x. This was one of the first works to address DETR's slow training, inspiring many later algorithms like DAB-DETR and DINO.

CPS: Semi-Supervised Semantic Segmentation with Cross Pseudo Supervision
CPS: Semi-Supervised Semantic Segmentation with Cross Pseudo Supervision

Xiaokang Chen, Yuhui Yuan, Gang Zeng, Jingdong Wang

IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2021

Abstract
This work proposes a simple and efficient semi-supervised semantic segmentation algorithm that enforces consistency between a dual-branch network using online-generated pseudo-labels. This approach achieves excellent semi-supervised performance without the need for threshold-based filtering. It significantly outperforms other contemporary semi-supervised segmentation algorithms on the PASCAL VOC 2012 and Cityscapes datasets, including Google's PseudoSeg (ICLR 2021). The method has become a key baseline in the field of semi-supervised segmentation. The paper has garnered over 1,000 citations and was featured on the [list of highly-cited AI papers in 2021].

CPS: Semi-Supervised Semantic Segmentation with Cross Pseudo Supervision

Xiaokang Chen, Yuhui Yuan, Gang Zeng, Jingdong Wang

IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2021

Abstract
This work proposes a simple and efficient semi-supervised semantic segmentation algorithm that enforces consistency between a dual-branch network using online-generated pseudo-labels. This approach achieves excellent semi-supervised performance without the need for threshold-based filtering. It significantly outperforms other contemporary semi-supervised segmentation algorithms on the PASCAL VOC 2012 and Cityscapes datasets, including Google's PseudoSeg (ICLR 2021). The method has become a key baseline in the field of semi-supervised segmentation. The paper has garnered over 1,000 citations and was featured on the [list of highly-cited AI papers in 2021].

Joint Implicit Image Function for Guided Depth Super-Resolution
Joint Implicit Image Function for Guided Depth Super-Resolution

Jiaxiang Tang, Xiaokang Chen, Gang Zeng

ACM Multimedia (ACM MM) 2021

Joint Implicit Image Function for Guided Depth Super-Resolution

Jiaxiang Tang, Xiaokang Chen, Gang Zeng

ACM Multimedia (ACM MM) 2021

2020

Bi-directional Cross-Modality Feature Propagation with Separation-and-Aggregation Gate for RGB-D Semantic Segmentation
Bi-directional Cross-Modality Feature Propagation with Separation-and-Aggregation Gate for RGB-D Semantic Segmentation

Xiaokang Chen, Kwan-Yee Lin, Jingbo Wang, Wayne Wu, Chen Qian, Hongsheng Li, Gang Zeng

European Conference on Computer Vision (ECCV) 2020

Bi-directional Cross-Modality Feature Propagation with Separation-and-Aggregation Gate for RGB-D Semantic Segmentation

Xiaokang Chen, Kwan-Yee Lin, Jingbo Wang, Wayne Wu, Chen Qian, Hongsheng Li, Gang Zeng

European Conference on Computer Vision (ECCV) 2020

3D Sketch-aware Semantic Scene Completion via Semi-supervised Structure Prior
3D Sketch-aware Semantic Scene Completion via Semi-supervised Structure Prior

Xiaokang Chen, Kwan-Yee Lin, Chen Qian, Gang Zeng, Hongsheng Li

IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2020

3D Sketch-aware Semantic Scene Completion via Semi-supervised Structure Prior

Xiaokang Chen, Kwan-Yee Lin, Chen Qian, Gang Zeng, Hongsheng Li

IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2020

Real-time Semantic Scene Completion Via Feature Aggregation and Conditioned Prediction
Real-time Semantic Scene Completion Via Feature Aggregation and Conditioned Prediction

Xiaokang Chen, Yajie Xing, Gang Zeng

International Conference on Image Processing (ICIP) 2020

Real-time Semantic Scene Completion Via Feature Aggregation and Conditioned Prediction

Xiaokang Chen, Yajie Xing, Gang Zeng

International Conference on Image Processing (ICIP) 2020

2019

2.5D Convolution for RGB-D Semantic Segmentation
2.5D Convolution for RGB-D Semantic Segmentation

Yajie Xing, Jingbo Wang, Xiaokang Chen, Gang Zeng

International Conference on Image Processing (ICIP) 2019

2.5D Convolution for RGB-D Semantic Segmentation

Yajie Xing, Jingbo Wang, Xiaokang Chen, Gang Zeng

International Conference on Image Processing (ICIP) 2019

Coupling Two-Stream RGB-D Semantic Segmentation Network by Idempotent Mappings
Coupling Two-Stream RGB-D Semantic Segmentation Network by Idempotent Mappings

Yajie Xing, Jingbo Wang, Xiaokang Chen, Gang Zeng

International Conference on Image Processing (ICIP) 2019

Coupling Two-Stream RGB-D Semantic Segmentation Network by Idempotent Mappings

Yajie Xing, Jingbo Wang, Xiaokang Chen, Gang Zeng

International Conference on Image Processing (ICIP) 2019