What type of study is this?

This is a Quantitative Study study.

October 20, 2025Open Access

HieraTok: Multi-Scale Visual Tokenizer Improves Image Reconstruction and Generation

Key Points

HieraTok demonstrates a 27.2% improvement in image reconstruction metrics compared to single-scale models.
The model achieves a 1.38× faster convergence rate in image generation tasks, enhancing overall efficiency.
With advancements in training scalability, HieraTok sets new state-of-the-art metrics for vision transformers.
These results indicate significant potential for multi-scale approaches in visual generation applications.

Abstract

In this work, we present HieraTok, a novel multi-scale Vision Transformer (ViT) -based tokenizer that overcomes the inherent limitation of modeling single-scale representations. This is realized through two key designs: (1) multi-scale downsampling applied to the token map generated by the tokenizer encoder, producing a sequence of multi-scale tokens, and (2) a scale-causal attention mechanism that enables the progressive flow of information from low-resolution global semantic features to high-resolution structural details. Coupling these designs, HieraTok achieves significant improvements in both image reconstruction and generation tasks. Under identical settings, the multi-scale visual tokenizer outperforms its single-scale counterpart by a 27. 2\% improvement in rFID (1. 47 1. 07). When integrated into downstream generation frameworks, it achieves a 1. 38 faster convergence rate and an 18. 9\% boost in gFID (16. 4 13. 3), which may be attributed to the smoother and more uniformly distributed latent space. Furthermore, by scaling up the tokenizer's training, we demonstrate its potential by a sota rFID of 0. 45 and a gFID of 1. 82 among ViT tokenizers. To the best of our knowledge, we are the first to introduce multi-scale ViT-based tokenizer in image reconstruction and image generation. We hope our findings and designs advance the ViT-based tokenizers in visual generation tasks.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Cite this study

Chen et al. (Sun,) studied this question.

www.synapsesocial.com/papers/68f6379bb481a140a36cf4e8 — DOI: https://doi.org/10.48550/arxiv.2509.23736

Also consider

Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:

An Image is Worth 32 Tokens for Reconstruction and Generation· 2024 · 4 citations
OmniTokenizer: A Joint Image-Video Tokenizer for Visual Generation· 2024
LookupViT: Compressing visual information to a limited number of tokens· 2024
ViTAR: Vision Transformer with Any Resolution· 2024 · 3 citations
HSViT: Horizontally Scalable Vision Transformer· 2024 · 2 citations

Authors

Cong Chen

Ziyuan Huang

Cheng Zou

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

HieraTok: Multi-Scale Visual Tokenizer Improves Image Reconstruction and Generation

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Cite this study

Also consider

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion