What does this research mean for the field?

Collaborative overload prevention, stream-based state-machine replication, and a framework for hybrid protocols enhance the reliability, maintainability, and resilience of small-scale state-machine replication systems. Novelty: ClaimNovelty.NOVEL_FINDING. Consensus alignment: ConsensusAlignment.CHALLENGES_CONSENSUS.

What question did this study set out to answer?

The main aim is to enhance the practicality of state-machine replication in small-scale systems by addressing reliability, maintenance, and resilience.

March 16, 2026Open Access

Practical State-Machine Replication for Small-Scale Systems

Key Points

The main aim is to enhance the practicality of state-machine replication in small-scale systems by addressing reliability, maintenance, and resilience.
Identified key requirements for small-scale state-machine replication.
Proposed collaborative overload prevention to manage load and minimize tail latency.
Introduced stream-based state-machine replication to simplify implementation and maintainability.
Developed a framework for tailoring hybrid protocols for resilience against complex faults.
Implemented approaches reduced tail latency effectively, providing stable response times.
Stream-based methodology lowered complexity in state-machine replication implementation.
Hybrid protocols adapted to specific use cases improved the system's resilience without excessive resource usage.

Abstract

State-machine replication is a popular technique for providing fault tolerance to important services. For core infrastructure services, such as coordination services or key-value stores used by other systems, the de-facto industry standard is to use small-scale systems that provide crash fault tolerance. While such systems are based on well-researched theoretical properties, the practical aspects of the deployment also have a strong influence on their behavior and effectiveness. In this thesis, I consider three requirements for small-scale state-machine replication: reliability, low maintenance, and resilience. While all are essential aspects for state-machine replication, especially for core services, they are often not met by state-of-the-art solutions. I present three approaches to improve these requirements for practical, small-scale systems. Reliability is an important factor for core services, which not only includes fault tolerance but also providing a stable response time. However, the latter is impeded by tail latency, a common issue in state-machine replication, which results in sudden and unexpected increases in latency. To mitigate tail latency caused by system overload, I propose a technique called collaborative overload prevention. Here, replicas monitor the system’s load and independently accept or reject incoming requests. This allows the system to maintain a manageable load and provide stable latency. Explicit rejection notifications also enable the client to react to impending overload, for example, by activating a fallback computation. The implementation of state-machine replication is considered notoriously complex and tedious. Since most existing state-machine–replication systems are designed as their own applications, this also affects to their maintainability. To reduce the effort of implementing and deploying state-machine replication, I propose the novel approach of stream-based state-machine replication. The idea is to design an agreement protocol as an application that runs on a stream-processing framework. Such frameworks are often readily available in data centers and handle the generic infrastructure tasks, such as replica deployment and communication, for the protocol. This already reduces the complexity of the protocol implementation, and stream-based state-machine replication can even further benefit from additional features offered by the framework, such as the automatic recovery of crashed nodes. When a system is particularly important or vulnerable, it is desirable to increase its resilience against faults that exceed pure crashes. Since full-fledged Byzantine fault tolerance is often considered very expensive, hybrid protocols promise a balance between increased resilience and adequate resource consumption and performance. However, existing hybrid protocols often require special hardware or target specific deployment scenarios, which is why I propose a framework to tailor hybrid protocols to custom use cases. By using the paradigm of micro replication, the framework allows the system’s administrator to flexibly chose which parts of the system should consider a Byzantine fault model. In addition to this selective hybridization, the approach also enables selective diversification, which further enhances the system’s resilience by making diversification of replicas both more affordable and effective. Experimentally evaluating each of the presented approaches confirms that they do improve the system regarding their targeted requirements while also providing practical state-machine–replication systems with adequate performance.

Bookmark

View Full Paper

Bookmark

View Full Paper

Practical State-Machine Replication for Small-Scale Systems

Key Points

Abstract

Cite This Study