In this work, we present HieraTok, a novel multi-scale Vision Transformer (ViT) -based tokenizer that overcomes the inherent limitation of modeling single-scale representations. This is realized through two key designs: (1) multi-scale downsampling applied to the token map generated by the tokenizer encoder, producing a sequence of multi-scale tokens, and (2) a scale-causal attention mechanism that enables the progressive flow of information from low-resolution global semantic features to high-resolution structural details. Coupling these designs, HieraTok achieves significant improvements in both image reconstruction and generation tasks. Under identical settings, the multi-scale visual tokenizer outperforms its single-scale counterpart by a 27. 2\% improvement in rFID (1. 47 1. 07). When integrated into downstream generation frameworks, it achieves a 1. 38 faster convergence rate and an 18. 9\% boost in gFID (16. 4 13. 3), which may be attributed to the smoother and more uniformly distributed latent space. Furthermore, by scaling up the tokenizer's training, we demonstrate its potential by a sota rFID of 0. 45 and a gFID of 1. 82 among ViT tokenizers. To the best of our knowledge, we are the first to introduce multi-scale ViT-based tokenizer in image reconstruction and image generation. We hope our findings and designs advance the ViT-based tokenizers in visual generation tasks.
Building similarity graph...
Analyzing shared references across papers
Loading...
Chen et al. (Sun,) studied this question.
www.synapsesocial.com/papers/68f6379bb481a140a36cf4e8 — DOI: https://doi.org/10.48550/arxiv.2509.23736
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:
Cong Chen
Ziyuan Huang
Cheng Zou
Building similarity graph...
Analyzing shared references across papers
Loading...