March 3, 2026Open Access

Single-precision Matrix Multiplication Performance on Cerebras CS-2: Evaluation and Modelling of Performance, Scalability and Energy Efficiency

Key Points

Maximum performance reached 349.0 TFlops/s for single-precision matrix multiplication, showing high efficiency.
Model showed weak scaling efficiency of 1.00 across matrix sizes, linked to the unique architecture of the CS-2.
Observational analysis of matrix multiplication performance on the Cerebras CS-2 with 745,500 processing elements involved.
Highlights inter-node communication latency as a significant factor affecting performance scaling in supercomputers.

Abstract

Although recent supercomputers have been improving their computational performance, achieving performance scaling with respect to the number of nodes is not easy due to long inter-node communication latency. Many attempts have been made to hide communication latency and maintain strong scalability even for dense matrix multiplication. Matrix multiplication is an ideal candidate for benchmarking the performance of supercomputers. The Cerebras CS-2 system is an accelerator for deep learning with the world's largest chip, the wafer-scale engine 2 (WSE-2). The WSE-2 can be considered a distributed memory system that comes with 745, 500 processing elements connected in a low-latency 2-D mesh topology. This paper presents the effective maximum performance, weak and strong scaling performance, and proposes a performance model for single-precision matrix multiplication on the CS-2. We observed the maximum performance of 349.0TFlops/s (matrix size: 33,000×33,000, used PEs: 750×750), performance per watt of 79.66GFlops/W, and a weak scaling efficiency of 1.00. The mean absolute percentage error between our performance model and the actual measurement was 9.2%.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Takaaki Miyajima

Ryosuke Matsuzaki

Daichi Mukunoki

Journals

Journal of Information Processing

Actions

Institutions

Nagoya University

Meiji University

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Single-precision Matrix Multiplication Performance on Cerebras CS-2: Evaluation and Modelling of Performance, Scalability and Energy Efficiency

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study

Also consider