Student Publications [Scholarly]

Revisiting State Machine Replication in Practice: Lessons from Building an etcd-inspired System

Document Type

Conference Proceeding

Abstract

State Machine Replication (SMR) is a foundational technique for building fault-tolerant distributed systems. It underpins infrastructure across cloud platforms, databases, and microservices, yet remains surprisingly difficult to implement efficiently in real world. While prior works in both academia and industry technical blogs have explored individual components, such as consensus protocols or deployment techniques, there is still no clear, integrated guide for building high-performance SMR systems end-to-end.In this paper, we revisit SMR from a practical perspective by building a high-performance SMR system inspired by etcd. Rather than modifying etcd directly, which integrates tightly coupled components across multiple versions, we isolate and recompose its foundation building blocks, including its Raft library, write-ahead log, and BoltDB. This approach allows us to systematically evaluate performance bottlenecks and design tradeoffs in a more modular setting. Based on this experience, we propose a reference architecture that generalizes to a wide class of SMR systems and serves as a practical framework for engineers and researchers. Using simple, well-understood techniques, such as parallelization and quorum reads, we demonstrate throughput gains of 3x under representative workloads, while preserving fault tolerance guarantees. © 2025 Copyright is held by the owner/author(s). Publication rights licensed to ACM.

Publication Title

SoCC 2025 - Proceedings of the 2025 ACM Symposium on Cloud Computing

Publication Date

1-13-2026

First Page

456

Last Page

463

ISBN

9798400722769

DOI

10.1145/3772052.3772246

Keywords

etcd, performance, practical lessons, state machine replication

Share

COinS