Abstract Low-bit-rate image compression faces a persistent quality-efficiency dilemma: lightweight models such as VQ-VAE produce perceptually degraded reconstructions, while high-quality alternatives like VQGAN and diffusion models incur prohibitive computational costs. To bridge this gap, we propose HiRes-VQ, a lightweight perceptual-guided VQ-VAE that achieves high-fidelity reconstruction without sacrificing efficiency. Built upon VQ-VAE-2’s hierarchical quantization, HiRes-VQ introduces two key innovations: (1) an asymmetric encoder-decoder architecture, where the encoder hierarchically extracts semantic features at multiple spatial scales and the decoder reconstructs low-frequency structures and high-frequency textures through separate frequency-domain pathways, together ensuring pixel-level fidelity; and (2) a multi-scale perceptual alignment loss that jointly optimizes pixel accuracy, semantic feature consistency, and style statistics, enabling perceptual-quality gains without compromising structural metrics. With only 3.21M parameters, HiRes-VQ achieves 18%–40% fidelity gains over similar-sized baselines on FFHQ-256 and ImageNet-256 across both pixel-level and semantic-level metrics, while surpassing high-complexity models such as VQGAN and OptVQ in quality-efficiency trade-off. Ablation experiments confirm that the dual-path decoder and the perceptual loss serve complementary roles, together enabling significant improvements in both pixel-level fidelity and semantic perceptual quality. These results demonstrate that HiRes-VQ effectively resolves the quality-efficiency dilemma, offering a practical solution for resource-constrained deployment.
Bie et al. (Tue,) studied this question.