Deploying machine learning models at the edge presents a difficult trade-off: we need fast inference to meet real-time requirements, but keeping models ready consumes expensive GPU resources. Current auto-scaling solutions wait until traffic increases before spinning up full GPU instances, which causes noticeable delays when users hit the system. Meanwhile, keeping everything running during quiet periods wastes money. We developed Edge-Aware Model Warmup to tackle both problems. Our approach uses a three-layer system that keeps lightweight, compressed versions of models always ready on edge nodes. When traffic patterns suggest a spike is coming, we automatically swap in the full model. During normal periods, the compressed surrogates handle requests efficiently. In our tests, this reduced latency by more than 40% during traffic bursts compared to standard auto-scaling. We also cut GPU-hour costs by over 25% on Azure Edge Zones. The compressed models maintain above 90% accuracy during the transition period. The whole system adds minimal overhead—just a heartbeat monitor and cache management—so it works well even in resource-limited edge environments.
Building similarity graph...
Analyzing shared references across papers
Loading...
Avgustin Chynarbekov
Ala-Too International University
Building similarity graph...
Analyzing shared references across papers
Loading...
Avgustin Chynarbekov (Wed,) studied this question.
synapsesocial.com/papers/69a75bf4c6e9836116a24355 — DOI: https://doi.org/10.5281/zenodo.18401884