Speech-driven portrait animation generation models have made significant progress in generating realistic and dynamic portrait animations. The class of end-to-end latent diffusion paradigms represented by Hallo achieves impressive results in terms of alignment accuracy between audio inputs and visual outputs, encompassing lip movements, expressions and head poses. However, constrained by the suboptimal interaction design between reference portrait information and the denoising U-Net in such architectures, certain frames in the output video sequences suffer from inconsistencies in identity and background preservation. Moreover, the temporal attention within the temporal module operates by incorporating information across frames within each generation unit to capture overall motion trends, but ignoring shorter frame subsequences within the generation unit, consequently losing fine-grained details between adjacent frames. In order to solve the above problems, we take the end-to-end latent diffusion paradigm Hallo as the backbone, and construct a Multi-Source Self Attention (MSSA) to optimize the interaction between reference portrait identity information and denoising U-Net. In addition, we also propose a plug-and-play, training-free method known as Unit-wise Spectral-Blend Temporal Attention (U-SBTA), which enables simultaneously capture local high-frequency facial details from shorter frame subsequences within each generation unit, thereby improving facial fidelity in synthesized portrait videos. Our method is comprehensively evaluated on public dataset and our collected datasets from qualitative and quantitative analysis. The results demonstrate that the portrait animation videos generated by our method are better able to preserve identity and background consistency with the reference portrait, as well as exhibiting superior facial detail fidelity.
Building similarity graph...
Analyzing shared references across papers
Loading...
Ma et al. (Sun,) studied this question.
www.synapsesocial.com/papers/69d49fc5b33cc4c35a228317 — DOI: https://doi.org/10.1038/s41598-026-46445-6
Xiangwen Ma
Jiaxin Zhao
Xiaoyu Huang
Scientific Reports
Northeast Normal University
Heilongjiang University
Institute of Applied Physics and Computational Mathematics
Building similarity graph...
Analyzing shared references across papers
Loading...