What question did this study set out to answer?

The aim is to establish theoretical foundations for efficient local inference of large language models on limited hardware.

May 12, 2026Open Access

Theoretical Foundations for Memory-Hierarchical Local Inference of Large Language Models

Key Points

The aim is to establish theoretical foundations for efficient local inference of large language models on limited hardware.
Developed a four-tier memory hierarchy for model configuration detection and management.
Derived feasibility conditions for layer-streaming inference using a simplified memory-budget model.
Extended cache theory to multiple attention architectures and established cache-efficiency metrics.
Formal architecture compatibility matrix allows for optimal resource allocation in constrained environments.
Throughput bounds show improved efficiency under memory constraints with targeted I/O strategies.
Importance-guided eviction heuristic enhances cache performance across different model architectures.

Abstract

Running large language models (LLMs) locally on hardware with severely constrained memory is a fundamental challenge in broadening access to advanced AI. Existing local inference systems rely predominantly on passive memory-mapped file access, a strategy that provably degrades to near-random I/O when available RAM is far smaller than the model's storage footprint. In this paper we develop the theoretical foundations of a framework designed to detect infeasible configurations before loading, prevent uncontrolled memory growth, and degrade gracefully under memory pressure, so that explicitly supported GGUF architectures can run with reduced speed while maintaining host responsiveness, even on machines with as little as 2 GB of RAM and no GPU. We formalise a four-tier memory hierarchy, derive sufficient feasibility conditions under an explicit simplified memory-budget model for layer-streaming inference, prove throughput bounds under double-buffered asynchronous I/O, and present an importance-guided eviction heuristic for multi-tier key-value caches. A central contribution is the extension of the KV cache theory to cover three fundamentally distinct attention architectures in current open-weight models: grouped-query attention, Multi-Head Latent Attention, and hybrid local/global attention. We further extend Mixture-of-Experts analysis to fine-grained expert segmentation, establish a formal architecture compatibility matrix, and derive distinct cache-efficiency metrics for MoE expert caches. Together these results provide a rigorous foundation for local LLM inference on constrained consumer hardware.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Cite this study

Long Nguyen (Sun,) studied this question.

www.synapsesocial.com/papers/6a02c380ce8c8c81e9640cc4 — DOI: https://doi.org/10.5281/zenodo.20110705

Also consider

Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:

Theoretical Foundations for Memory-Hierarchical Local Inference of Large Language Models

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Cite this study

Also consider

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion