What question did this study set out to answer?

The study aims to analyze the significance of tokenisation in language model architecture and its implications on performance.

March 18, 2026Open Access

The Architecture of Language Discretisation

Read Full Paperexternally

Key Points

The study aims to analyze the significance of tokenisation in language model architecture and its implications on performance.
Investigated historical evolution of tokenisation methods
Conducted a controlled experiment on a literary corpus
Compared whitespace-based tokenisation with GPT-2 BPE model
Examined implications of tokenisation on cognitive and computational performance
Whitespace-based tokenisation achieved better compression than GPT-2 BPE model
Highlighted architectural flaws in BPE for whitespace handling
Revealed cognitive blind spots affecting arithmetic reasoning
Identified financial and computational disadvantages for non-English users

Abstract

This comprehensive analysis investigates the critical role of tokenisation—the conversion of continuous text into discrete numerical representations—within the architecture of large language models (LLMs). It argues that this process, far from being a simple preprocessing step, fundamentally dictates a model's computational efficiency, morphological understanding, and inherent linguistic biases. The presentation begins by tracing the historical evolution from semantically rich but vocabulary-heavy word-level tokenisation to computationally expensive character-level methods, establishing the context for the rise of subword algorithms like Byte-Pair Encoding (BPE). Through a controlled experiment tokenising a literary corpus, the analysis reveals counterintuitive results where a naive whitespace-based tokeniser achieves superior compression to a sophisticated GPT-2 BPE model, exposing architectural flaws in how BPE handles whitespace and fragments text. This leads to a deeper examination of the tangible consequences of tokenisation, including cognitive blind spots in models (such as the inability to count characters in a word), degraded arithmetic reasoning, and a significant "multilingual token tax" that financially and computationally disadvantages non-English users. Critically, the presentation debunks the long-held assumption that maximal text compression yields superior model performance, highlighting research that shows overly optimised tokenisers can underperform by violating natural linguistic structures. In response to these limitations, the analysis explores the architectural frontier, detailing promising solutions such as SuperBPE, which captures multi-word expressions by bridging spaces, and the more radical Byte-Level Transformer (BLT) paradigm, which aims to eliminate tokenisation entirely by processing raw byte streams. The work concludes that the future of the field lies in moving beyond the rigid constraints of static subword vocabularies toward more flexible, equitable, and architecturally native systems that better reflect the fluid reality of human language.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Partha Majumdar

Actions

Institutions

Swiss School of Public Health

Kalinga University

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

The Architecture of Language Discretisation

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study

Also consider