Running large language models (LLMs) locally on hardware with severely constrained memory is a fundamental challenge in broadening access to advanced AI. Existing local inference systems rely predominantly on passive memory-mapped file access, a strategy that provably degrades to near-random I/O when available RAM is far smaller than the model's storage footprint. In this paper we develop the theoretical foundations of a framework designed to detect infeasible configurations before loading, prevent uncontrolled memory growth, and degrade gracefully under memory pressure, so that explicitly supported GGUF architectures can run with reduced speed while maintaining host responsiveness, even on machines with as little as 2 GB of RAM and no GPU. We formalise a four-tier memory hierarchy, derive sufficient feasibility conditions under an explicit simplified memory-budget model for layer-streaming inference, prove throughput bounds under double-buffered asynchronous I/O, and present an importance-guided eviction heuristic for multi-tier key-value caches. A central contribution is the extension of the KV cache theory to cover three fundamentally distinct attention architectures in current open-weight models: grouped-query attention, Multi-Head Latent Attention, and hybrid local/global attention. We further extend Mixture-of-Experts analysis to fine-grained expert segmentation, establish a formal architecture compatibility matrix, and derive distinct cache-efficiency metrics for MoE expert caches. Together these results provide a rigorous foundation for local LLM inference on constrained consumer hardware.
Building similarity graph...
Analyzing shared references across papers
Loading...
Long Nguyen (Sun,) studied this question.
www.synapsesocial.com/papers/6a02c380ce8c8c81e9640cc4 — DOI: https://doi.org/10.5281/zenodo.20110705
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:
Long Nguyen
Building similarity graph...
Analyzing shared references across papers
Loading...