March 3, 2026Open Access

Edge-Aware Model Warmup for Latency-Optimized MLOps in Cloud Computing

Key Points

Reduced latency during traffic spikes by over 40% with our edge-aware model warmup approach, compared to current methods.
Maintained above 90% accuracy during transitions from compressed models to full models for efficient operation.
Auto-scaling methods typically wait for increased traffic, causing delays and possible resource waste, which our system addresses.
The solution adds minimal overhead, making it suitable for resource-limited edge environments, enhancing cost efficiency.

Abstract

Deploying machine learning models at the edge presents a difficult trade-off: we need fast inference to meet real-time requirements, but keeping models ready consumes expensive GPU resources. Current auto-scaling solutions wait until traffic increases before spinning up full GPU instances, which causes noticeable delays when users hit the system. Meanwhile, keeping everything running during quiet periods wastes money. We developed Edge-Aware Model Warmup to tackle both problems. Our approach uses a three-layer system that keeps lightweight, compressed versions of models always ready on edge nodes. When traffic patterns suggest a spike is coming, we automatically swap in the full model. During normal periods, the compressed surrogates handle requests efficiently. In our tests, this reduced latency by more than 40% during traffic bursts compared to standard auto-scaling. We also cut GPU-hour costs by over 25% on Azure Edge Zones. The compressed models maintain above 90% accuracy during the transition period. The whole system adds minimal overhead—just a heartbeat monitor and cache management—so it works well even in resource-limited edge environments.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Avgustin Chynarbekov

Actions

Institutions

Ala-Too International University

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Edge-Aware Model Warmup for Latency-Optimized MLOps in Cloud Computing

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study