What type of study is this?

This is a Experimental Study study.

September 29, 2025Open Access

FlatAttention: Dataflow and Fabric Collectives Co-Optimization for Efficient Multi-Head Attention on Tile-Based Many-PE Accelerators

Key Points

FlatAttention achieves 89.3% utilization and 4.1x performance speedup over current dataflows.
The optimization reduces HBM traffic by 16 times and requires 40% less bandwidth compared to the Nvidia H100.
Through co-exploration, an optimal setup for a 32x32 tile mesh offers comparable performance to top GPUs.
The refined configuration leads to a significant 1.8x reduction in die size at similar technology nodes.

Abstract

Multi-Head Attention (MHA) is a critical computational kernel in transformer-based AI models. Emerging scalable tile-based accelerator architectures integrate increasing numbers of tightly-packed processing elements (PEs) with tensor units. MHA dataflow mapping is crucial for achieving high utilization of the available units. We propose FlatAttention, a new dataflow for MHA on tile-based many-PE accelerators, minimizing costly main memory (HBM) accesses by leveraging collective primitives integrated into the on-chip network fabric. FlatAttention achieves up to 89.3% utilization, and 4.1x performance speedup over FlashAttention-3 dataflow on tile-based accelerators whilst reducing HBM traffic by 16x. Through algorithm-architecture co-exploration, we identify an optimal configuration for a large scaled-out tile-based accelerator featuring a 32x32 tile mesh with 1024 TFLOPS @ FP16 peak performance, comparable to the state-of-the-art Nvidia H100 GPU. FlatAttention in this configuration achieves up to 1.3x higher utilization over FlashAttention-3 on the H100 GPU. Meanwhile, this tile-based accelerator configuration requires 40% less HBM bandwidth compared to the H100, enabling a 1.8x reduction in die size, estimated on the same technology node.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Cite this study

Zhang et al. (Sat,) studied this question.

www.synapsesocial.com/papers/68da58d8c1728099cfd111ec — DOI: https://doi.org/10.48550/arxiv.2505.18824

Also consider

Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:

FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision· 2024 · 16 citations
HLX: A Unified Pipelined Architecture for Optimized Performance of Hybrid Transformer-Mamba Language Models· 2025 · 1 citations
Optimizing Attention by Exploiting Data Reuse on ARM Multi-core CPUs· 2024 · 7 citations
Optimizing Attention for Large Language Model Inference on the MT-3000 Many-Core Processor· 2026

Authors

Chi Zhang

Luca Colagrande

Renzo Andri

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

FlatAttention: Dataflow and Fabric Collectives Co-Optimization for Efficient Multi-Head Attention on Tile-Based Many-PE Accelerators

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Cite this study

Also consider

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion