Publications - Welcome to Xiaokang Chen's Homepage

2025

Janus-Series: Unified Multimodal Understanding and Generation Models

DeepSeek

Project lead and core contributor.

Abstract

The Janus-Series pioneers the design of separate visual encoders for unified multimodal understanding and generation, effectively alleviating the inherent conflict of using a single encoder found in prior work. This series comprises three models: Janus, an autoregressive-based unified model published at CVPR 2025; JanusFlow, a Flow Matching-based unified model also published at CVPR 2025; and Janus-Pro, a scaled-up version of Janus in terms of both data and model size. Janus-Pro achieves state-of-the-art performance among open-source models in both multimodal understanding and generation. Notably, on the GenEval image generation benchmark, Janus-Pro scores 80, outperforming both DALLE-3 and Stable Diffusion 3.

[Paper: Janus-Pro] [Paper: Janus (CVPR 2025)] [Paper: JanusFlow (CVPR 2025)]

[🔥 Code (17k stars)] [Huggingface Model] [Online Demo]

[🔥 Twitter] [机器之心] [量子位] [新智元]

Janus-Series: Unified Multimodal Understanding and Generation Models

DeepSeek

Project lead and core contributor.

Abstract

[Paper: Janus-Pro] [Paper: Janus (CVPR 2025)] [Paper: JanusFlow (CVPR 2025)]

2024

DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

DeepSeek

Project co-lead and core contributor.

Abstract

DeepSeek-VL2 is a large multimodal foundation model based on the Mixture-of-Experts (MoE) architecture. It possesses a wide range of multimodal understanding capabilities, including image description, landmark recognition, chart understanding, OCR, meme understanding, multi-image understanding, object localization, and reasoning. Thanks to its MoE architecture, the model achieves better overall performance than Qwen2-VL-7B and InternVL2-8B while using only 4.1B active parameters. In terms of visual perception (specifically image description and vision perception), it surpasses Qwen2-VL-72B.

[Paper] [Code] [官方介绍]

DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

DeepSeek

Project co-lead and core contributor.

Abstract

[Paper] [Code] [官方介绍]

D$^3$ETR: Decoder Distillation for Detection Transformer

Xiaokang Chen, Jiahui Chen, Yan Liu, Jiaxiang Tang, Gang Zeng

International Joint Conference on Artificial Intelligence (IJCAI) 2024

2025

Janus-Series: Unified Multimodal Understanding and Generation Models

Janus-Series: Unified Multimodal Understanding and Generation Models

2024

DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

D$^3$ETR: Decoder Distillation for Detection Transformer

D$^3$ETR: Decoder Distillation for Detection Transformer

LGM: Large Multi-View Gaussian Model for High-Resolution 3D Content Creation

LGM: Large Multi-View Gaussian Model for High-Resolution 3D Content Creation

The Devil is in the Neurons: Interpreting and Mitigating Social Biases in Language Models

The Devil is in the Neurons: Interpreting and Mitigating Social Biases in Language Models

Improving Long Text Understanding with Knowledge Distilled from Summarization Model

Improving Long Text Understanding with Knowledge Distilled from Summarization Model

2023

CAE: Context Autoencoder for Self-Supervised Representation Learning

CAE: Context Autoencoder for Self-Supervised Representation Learning

VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks

VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks

Group DETR: Fast DETR Training with Group-Wise One-to-Many Assignment

Group DETR: Fast DETR Training with Group-Wise One-to-Many Assignment

Group DETR v2: Strong Object Detector with Encoder-Decoder Pretraining

Group DETR v2: Strong Object Detector with Encoder-Decoder Pretraining

Uncovering and Quantifying Social Biases in Code Generation

Uncovering and Quantifying Social Biases in Code Generation

Interactive Segment Anything NeRF with Feature Imitation

Interactive Segment Anything NeRF with Feature Imitation

Parallel Sentence-Level Explanation Generation For Real-World Low-Resource Scenarios

Parallel Sentence-Level Explanation Generation For Real-World Low-Resource Scenarios

Delicate Textured Mesh Recovery from NeRF via Adaptive Surface Refinement

Delicate Textured Mesh Recovery from NeRF via Adaptive Surface Refinement

Uncovering and Categorizing Social Biases in Text-to-SQL

Uncovering and Categorizing Social Biases in Text-to-SQL

Understanding Self-Supervised Pretraining with Part-Aware Representation Learning

Understanding Self-Supervised Pretraining with Part-Aware Representation Learning

CAE v2: Context Autoencoder with CLIP Target

CAE v2: Context Autoencoder with CLIP Target

2022

Not All Voxels Are Equal: Semantic Scene Completion from the Point-Voxel Perspective

Not All Voxels Are Equal: Semantic Scene Completion from the Point-Voxel Perspective

Conditional DETR V2: Efficient Detection Transformer with Box Queries

Conditional DETR V2: Efficient Detection Transformer with Box Queries

Compressible-composable NeRF via Rank-residual Decomposition

Compressible-composable NeRF via Rank-residual Decomposition

Point Scene Understanding via Disentangled Instance Mesh Reconstruction

Point Scene Understanding via Disentangled Instance Mesh Reconstruction

MaskGroup: Hierarchical Point Grouping and Masking for 3D Instance Segmentation

MaskGroup: Hierarchical Point Grouping and Masking for 3D Instance Segmentation

2021

Conditional DETR for Fast Training Convergence

Conditional DETR for Fast Training Convergence

CPS: Semi-Supervised Semantic Segmentation with Cross Pseudo Supervision

CPS: Semi-Supervised Semantic Segmentation with Cross Pseudo Supervision

Joint Implicit Image Function for Guided Depth Super-Resolution

Joint Implicit Image Function for Guided Depth Super-Resolution

2020

Bi-directional Cross-Modality Feature Propagation with Separation-and-Aggregation Gate for RGB-D Semantic Segmentation

Bi-directional Cross-Modality Feature Propagation with Separation-and-Aggregation Gate for RGB-D Semantic Segmentation

3D Sketch-aware Semantic Scene Completion via Semi-supervised Structure Prior

3D Sketch-aware Semantic Scene Completion via Semi-supervised Structure Prior

Real-time Semantic Scene Completion Via Feature Aggregation and Conditioned Prediction

Real-time Semantic Scene Completion Via Feature Aggregation and Conditioned Prediction

2019

2.5D Convolution for RGB-D Semantic Segmentation

2.5D Convolution for RGB-D Semantic Segmentation

Coupling Two-Stream RGB-D Semantic Segmentation Network by Idempotent Mappings

Coupling Two-Stream RGB-D Semantic Segmentation Network by Idempotent Mappings