May 2, 2024

Vision transformer with pre-positional embedding

Key Points

Key points are not available for this paper at this time.

Abstract

Vision Transformer (ViT) is one of the neural network architectures applied to image processing based on Transformer. ViT has achieved State-Of-The-Art performances on various computer vision tasks. This study attempts to improve Input Layer of ViT by changing the way of positional embedding. We propose ViT with pre-positional embedding that adds constants to each pixel before dividing input images into patches. This method assumes the following image features: vertically asymmetric, horizontally symmetric, and distribution of similar features in an image extending concentrically from the center of the image. Experimental results demonstrate that the proposed method achieves the same image recognition accuracy as the conventional method with positional embedding while reducing the number of training parameters.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Cite this study

Eguchi et al. (Thu,) studied this question.

www.synapsesocial.com/papers/68e6bd48b6db64358763dea6 — DOI: https://doi.org/10.1117/12.3018012

Authors

Takuro Eguchi

Yoshimitsu Kuroki

Actions

Institutions

National Institute of Technology, Kurume College

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Vision transformer with pre-positional embedding

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Cite this study

Authors

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Also consider