ABSTRACT Image inpainting aims to restore missing regions in a visually plausible and semantically coherent manner. Despite notable advances, existing deep learning approaches still face key limitations, including heavy Transformer‐based or unstable generative architectures, diffusion models with high computational cost, training pipelines that overlook the heterogeneous difficulty of diverse mask patterns, and the absence of explicit mechanisms to ensure smooth transitions along mask boundaries—the most perceptually sensitive area in the reconstruction process. To address these challenges, we introduce CA‐Fill, a lightweight two‐stage encoder–decoder framework that efficiently balances global structure recovery and fine‐grained texture refinement. By jointly integrating structural perceptual progression and optimization progression, the proposed method realizes a dual‐progressive perceptual alignment strategy that explicitly emphasizes boundary transition regions while progressively aligning training difficulty with model learning capacity. This design enables smoother boundary transitions, improved structural consistency, and enhanced perceptual realism under a lightweight computational budget. Extensive experiments on public benchmarks demonstrate that CA‐Fill achieves competitive or superior performance compared with representative baselines across both pixel‐level and perceptual evaluation metrics, while maintaining low parameter count and inference cost.
XIANG et al. (Thu,) studied this question.