Key points are not available for this paper at this time.
Many companies are deploying services, either for consumers or industry, which are largely based on machine-learning algorithms for sophisticated processing of large amounts of data. The state-of-the-art and most popular such machine-learning algorithms are Convolutional and Deep Neural Networks (CNNs and DNNs), which are known to be both computationally and memory intensive. A number of neural network accelerators have been recently proposed which can offer high computational capacity/area ratio, but which remain hampered by memory accesses. However, unlike the memory wall faced by processors on general-purpose workloads, the CNNs and DNNs memory footprint, while large, is not beyond the capability of the on chip storage of a multi-chip system. This property, combined with the CNN/DNN algorithmic characteristics, can lead to high internal bandwidth and low external communications, which can in turn enable high-degree parallelism at a reasonable area cost. In this article, we introduce a custom multi-chip machine-learning architecture along those lines. We show that, on a subset of the largest known neural network layers, it is possible to achieve a speedup of 450.65x over a GPU, and reduce the energy by 150.31x on average for a 64-chip system. We implement the node down to the place and route at 28nm, containing a combination of custom storage and computational units, with industry-grade interconnects.
Building similarity graph...
Analyzing shared references across papers
Loading...
Yunji Chen
Tao Luo
Shaoli Liu
Institut national de recherche en sciences et technologies du numérique
Inner Mongolia University
Building similarity graph...
Analyzing shared references across papers
Loading...
Chen et al. (Mon,) studied this question.
www.synapsesocial.com/papers/6a00c40cef8139f8ff77a66e — DOI: https://doi.org/10.1109/micro.2014.58