November 1, 2022

DeepSpeed- Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale

Puntos clave

Los puntos clave no están disponibles para este artículo en este momento.

Resumen

The landscape of transformer model inference is increasingly diverse in model size, model characteristics, latency and throughput requirements, hardware requirements, etc. With such diversity, designing a versatile inference system is challenging. DeepSpeed-Inference addresses these challenges by (1) a multi-GPU inference solution to minimize latency while maximizing throughput for both dense and sparse transformers when the model fits in aggregate GPU memory, and (2) a heterogeneous inference solution that leverages CPU/NVMe/GPU memory to enable high-throughput inference for models larger than aggregate GPU memory. DeepSpeed-Inference reduces latency by 6.4× and increases throughput by 1.5 ×over the state-of-the-art. It enables trillion parameter scale inference under real-time latency constraints by leveraging hundreds of GPUs, an unprecedented scale for inference. It can inference 25 ×larger models than with GPU-only solutions, while delivering a high throughput of 84 TFLOPS (over 50% of A6000 peak).

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Reza Yazdani Aminabadi

Samyam Rajbhandari

Ammar Ahmad Awan

Actions

Institutions

Microsoft (United States)

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Cite this study

Aminabadi et al. (Tue,) studied this question.

www.synapsesocial.com/papers/6a08b5e4ad370a6b44de498f — DOI: https://doi.org/10.1109/sc41404.2022.00051

Also consider

Synapse has enriched 2 closely related papers on similar clinical questions. Consider them for comparative context:

MizAR 60 for Mizar 50· 2023 · 75,682 citations
AI-Assisted Pipeline for Dynamic Generation of Trustworthy Health Supplement Content at Scale· 2018 · 45,559 citations

DeepSpeed- Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale

Puntos clave

Resumen

Citation Network

Connected Papers

Discussion

Authors

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study

Also consider