What question did this study set out to answer?

April 27, 2026Open Access

From Attention to Inference - A Technical Study of Large Language Models

Key Points

This paper aims to provide a detailed technical overview of the Transformer architecture used in Natural Language Processing.
Examines core components of Transformer architecture such as self-attention, tokenization, and positional encoding.
Discusses model families including encoder-only, decoder-only, and encoder-decoder.
Analyzes system-level considerations like KV caching, throughput, and VRAM usage.
Presents a structured reference for practitioners and students in machine learning and NLP.
Highlights the importance of pretraining and fine-tuning for effective inference.
Discusses metrics like Time to First Token as critical for model performance.

Abstract

This paper presents a technical overview of the Transformer architecture and its role in modern Natural Language Processing (NLP). It examines the core components of the paradigm including self-attention mechanisms, tokenization, positional encoding, model families (encoder-only, decoder-only, and encoder–decoder), pretraining objectives, fine-tuning, and inference processes. System-level considerations such as KV caching, Time to First Token, throughput, and VRAM usage are also discussed. The paper is intended as a structured technical reference for practitioners and students working in machine learning and NLP.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

THOMAS SIOUMPALAS

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

From Attention to Inference - A Technical Study of Large Language Models

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study