What question did this study set out to answer?

The aim is to enhance portrait animations driven by audio by improving identity consistency and facial detail fidelity.

April 7, 2026Open Access

Identity-consistent and high-fidelity audio-driven portrait animation with enhanced latent diffusion

Key Points

The aim is to enhance portrait animations driven by audio by improving identity consistency and facial detail fidelity.
Utilized the Hallo architecture as the backbone for animations.
Introduced Multi-Source Self Attention (MSSA) to improve reference portrait interactions.
Developed Unit-wise Spectral-Blend Temporal Attention (U-SBTA) for capturing local facial details.
Generated animations showed better identity preservation compared to previous methods.
Enhanced background consistency and facial detail fidelity in synthesized videos.
Qualitative and quantitative evaluations confirmed superior performance of the proposed method.

Abstract

Speech-driven portrait animation generation models have made significant progress in generating realistic and dynamic portrait animations. The class of end-to-end latent diffusion paradigms represented by Hallo achieves impressive results in terms of alignment accuracy between audio inputs and visual outputs, encompassing lip movements, expressions and head poses. However, constrained by the suboptimal interaction design between reference portrait information and the denoising U-Net in such architectures, certain frames in the output video sequences suffer from inconsistencies in identity and background preservation. Moreover, the temporal attention within the temporal module operates by incorporating information across frames within each generation unit to capture overall motion trends, but ignoring shorter frame subsequences within the generation unit, consequently losing fine-grained details between adjacent frames. In order to solve the above problems, we take the end-to-end latent diffusion paradigm Hallo as the backbone, and construct a Multi-Source Self Attention (MSSA) to optimize the interaction between reference portrait identity information and denoising U-Net. In addition, we also propose a plug-and-play, training-free method known as Unit-wise Spectral-Blend Temporal Attention (U-SBTA), which enables simultaneously capture local high-frequency facial details from shorter frame subsequences within each generation unit, thereby improving facial fidelity in synthesized portrait videos. Our method is comprehensively evaluated on public dataset and our collected datasets from qualitative and quantitative analysis. The results demonstrate that the portrait animation videos generated by our method are better able to preserve identity and background consistency with the reference portrait, as well as exhibiting superior facial detail fidelity.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Cite this study

Ma et al. (Sun,) studied this question.

www.synapsesocial.com/papers/69d49fc5b33cc4c35a228317 — DOI: https://doi.org/10.1038/s41598-026-46445-6

Authors

Xiangwen Ma

Jiaxin Zhao

Xiaoyu Huang

Journals

Scientific Reports

Actions

Institutions

Northeast Normal University

Heilongjiang University

Institute of Applied Physics and Computational Mathematics

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Identity-consistent and high-fidelity audio-driven portrait animation with enhanced latent diffusion

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Cite this study

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion