What question did this study set out to answer?

This research aims to develop a practical method for creating high-quality 3D head avatars using simple monocular video input.

May 8, 2026

Expressive Head Avatar Modeling from Monocular Video of Neutral Expression

Key Points

This research aims to develop a practical method for creating high-quality 3D head avatars using simple monocular video input.
Introduced R² Avatar for generating expressive 3D head avatars using a reconstruction-by-restoration strategy.
Employed a geometry-guided warping module to synthesize coarse expressions from neutral input.
Utilized a restoration module to refine results and recover high-frequency facial details.
Achieved realistic avatars with improved expression diversity compared to baseline methods.
Demonstrated enhanced view consistency in the generated avatars, optimizing user experience.

Abstract

We study the reconstruction of high-quality 3D head avatars. Our goal is to reduce the reliance on dense capture data required by most existing approaches, which limits their practicality. Recent advances have attempted to address this using single or few input images by either training a prior model or fine-tuning multi-view diffusion models to generate pseudo training points. However, these methods fall short in producing multi-view-consistent, high-fidelity results aligned with the input data. This motivates us to explore a more practical and user-friendly input setting. Modern smartphones such as Apple's Face ID already guide users to slowly rotate their heads in front of a single camera, enabling the capture of facial data across varying viewpoints with minimal effort. This simple and intuitive scanning motion has become a widely accepted user habit and provides sufficient geometric information-highlighting a natural opportunity for 3D head avatar creation from monocular videos of neutral expression. In this paper, we introduce R ^2 Avatar, a lightweight and user-friendly framework for generating expressive 3D head avatars under this setting. Our method adopts a Reconstruction-by-Restoration strategy that avoids large-scale model pretraining while achieving high-quality animatable avatars. Specifically, a geometry-guided warping module first synthesizes coarse expression variations from the neutral input. Then, a restoration module refines the warped results by recovering high-frequency facial details, including mouth interior, with the help of a data-driven 2D animation prior. These restored images serve as supervision targets to optimize the final avatar. Experiments demonstrate that our method produces realistic avatars with improved expression diversity and view consistency compared to baseline approaches. Our Code and data will be released upon acceptance.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Qing Chang

Yao-Xiang Ding

Kun Zhou

Journals

IEEE Transactions on Visualization and Computer Graphics

Actions

Institutions

Zhejiang University

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Expressive Head Avatar Modeling from Monocular Video of Neutral Expression

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study