Although recent supercomputers have been improving their computational performance, achieving performance scaling with respect to the number of nodes is not easy due to long inter-node communication latency. Many attempts have been made to hide communication latency and maintain strong scalability even for dense matrix multiplication. Matrix multiplication is an ideal candidate for benchmarking the performance of supercomputers. The Cerebras CS-2 system is an accelerator for deep learning with the world's largest chip, the wafer-scale engine 2 (WSE-2). The WSE-2 can be considered a distributed memory system that comes with 745, 500 processing elements connected in a low-latency 2-D mesh topology. This paper presents the effective maximum performance, weak and strong scaling performance, and proposes a performance model for single-precision matrix multiplication on the CS-2. We observed the maximum performance of 349.0TFlops/s (matrix size: 33,000×33,000, used PEs: 750×750), performance per watt of 79.66GFlops/W, and a weak scaling efficiency of 1.00. The mean absolute percentage error between our performance model and the actual measurement was 9.2%.
Building similarity graph...
Analyzing shared references across papers
Loading...
Takaaki Miyajima
Ryosuke Matsuzaki
Daichi Mukunoki
Journal of Information Processing
Nagoya University
Meiji University
Building similarity graph...
Analyzing shared references across papers
Loading...
Miyajima et al. (Thu,) studied this question.
www.synapsesocial.com/papers/69a7611ec6e9836116a2ebaa — DOI: https://doi.org/10.2197/ipsjjip.34.132
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: