Introduction
In last post, we have implemented the leader election mechanism of Raft. In this post, we will implement the log replication mechanism of Raft.
Log replication is a critical component of the Raft consensus algorithm. It ensures that log entries are consistently replicated across a majority of servers, maintaining the integrity and consistency of the system state. In this lab, we will implement the log replication mechanism of Raft.
How Log Replication Works?
For example, consider a K/V server cluster built on the Raft consensus protocol. When a client sends a put request to the K/V server, the server creates a log entry containing the request and forwards it to the Raft leader. The leader then replicates the log entry to its followers. Once the log entry is replicated to a majority of servers (more than half), the leader commits the log entry and applies the put request to store the key-value pair in the K/V server.
Handling Leader Crashes
If the leader crashes before the log entry is committed, there are three possible scenarios:
- Majority of followers have the log entry. In this case, a new leader with the log entry will be elected, and log replication will continue seamlessly.
- Majority of followers do not have the log entry, but a new leader with the log entry is elected. The new leader will replicate the log entry to the followers, ensuring consistency.
- Majority of followers do not have the log entry, and a new leader without the log entry is elected. In this situation, some followers may have the log entry while others do not. Followers with the log entry will truncate it, preventing the log entry from being committed. This is not an issue, as the client is aware of the failed operation and can handle it appropriately.
After a long time, the old leader becomes online again. The old leader will become a follower and replicate the log entries from the new leader to catch up with the current state of the system.
The key takeaway is that log replication ensures that log entries are consistently replicated across a majority of servers, maintaining the integrity and consistency of the system state.
Handling Network Partitions
Consider a Raft cluster with five servers: S1, S2, S3, S4, and S5. The leader is S1, with S2, S3, S4, and S5 as followers. S1 and S2 are in the same network segment (rank), while S3, S4, and S5 are in another. Suddenly, the switch connecting the two segments fails, causing a network partition. The servers are split into two groups: [S1, S2] and [S3, S4, S5].
In this situation, S3, S4, and S5 will elect a new leader, such as S3, while S2 will still believe S1 is the leader. Log entries sent to S1 will not be committed because a majority of servers do not receive them. However, log entries sent to S3 will be committed.
When the switch is fixed and the network partition is resolved, but S3 goes offline, S1 will still think it is the leader. After sending AppendEntries
RPCs to other servers, S1 will realize its term is lower than the others. It will step down to a follower role, triggering a new leader election. The new leader will not be S1 or S2; it might be S4 or S5. S1 and S2 will truncate any conflicting log entries and synchronize with the current system state.
If the servers are partitioned into three or more groups, with no group holding a majority (e.g., [S1, S2], [S3, S4], [S5]), and S5 is the old leader, log entries sent to S5 will not be committed. The other groups will repeatedly attempt to elect a leader but will fail. Once the network partition is healed, S5 will first become a follower. It might be elected as a leader again, or not—it doesn’t matter. Since the log entries sent to S5 were not committed, they can be safely truncated.
How to vote for avoiding unnecessary log truncation?
Imagine a more complex scenario: after a series of network partitions and leader crashes, all issues are eventually resolved. However, due to many rounds of leader elections, each server now has different uncommitted log entries. The challenge is to minimize the number of log entries that need to be truncated to restore consistency.
Before voting in a leader election, a follower must check if the candidate’s log is at least as up-to-date as its own log. The follower uses the following rules:
Compare Terms: The follower first looks at the term of the last log entry from the candidate. A higher term is preferred because it likely contains more recent log entries.
Compare Log Indexes: If the terms are the same, the follower then compares the log index. If the candidate’s log index is equal to or greater than the follower’s log index, the follower will vote for the candidate.
Rejecting Candidates: If the candidate’s log is not up-to-date (i.e., lower term or smaller log index), the follower will reject the candidate.
By following these rules, only candidates with the most up-to-date logs can become the leader. This approach helps avoid unnecessary log truncation and ensures the system converges to a consistent state more efficiently.
Summary
In summary, in a Raft cluster with 2F + 1
servers, the system can tolerate up to F
server failures and still function correctly. If more than F
servers fail, the system will stop working.
Implementing Log Replication
States
At first, we must add log entries to the Raft state. We can define a LogEntry
struct to represent a log entry. Each log entry contains a term and a command. The term is the term in which the log entry was created, and the command is provided from application layer.
commitIndex
is the index of the highest log entry known to be committed. lastApplied
is the index of the highest log entry applied to the application state machine.
nextIndex[i]
is the index of the next log entry to send to server i
. matchIndex[i]
is the index of the highest log entry known to be replicated on server i
, only leader maintains these two arrays.
|
|
AppendEntries
RPC
It’s a bit complex to implement the AppendEntries
RPC. The leader sends AppendEntries
RPCs to followers to replicate log entries. The followers respond with success or failure, depending on whether the log entry can be replicated. We must handle conflicting log entries and ensure that the system converges to a consistent state.
|
|
For leader, it sends AppendEntries
RPCs to followers periodically, the request includes the term, leader ID, previous log index and term, log entries to replicate, and the leader commit index. The follower checks the term and updates its state accordingly. If the term is higher than the follower’s current term, it transitions to the follower state and updates its term. If the term is lower, the follower rejects the request.
Handle Conflicting Log Entries
For optimization, we can add two fields to the AppendEntriesReply
struct: ConflictIndex
and ConflictTerm
. If the follower detects a conflict, it sets these fields to the index and term of the conflicting log entry. The leader can use this information to optimize log replication by removing conflicting log entries and appending new ones. It’s more easy to understand and implement than binary search.
If the follower’s log have different log entries with leader’s log, the follower will truncate the conflicting log entries and append new log entries.
|
|
For leader, it will update nextIndex
and matchIndex
after receiving the AppendEntries
reply. If the follower’s log is shorter than the leader’s log, the leader will decrease nextIndex
and matchIndex
to the follower’s log length. If the follower’s log has a term conflict, the leader will find the last log entry with the same term and update nextIndex
and matchIndex
to that index.
|
|
Commit Log Entries
After the leader receives the reply from the followers, it will update the commitIndex
to the minimum value of the majority of matchIndex
. If the leader finds a majority of servers have replicated a log entry at index i
, it will set commitIndex
to i
.
|
|
For follower, it just need to set commitIndex
to the minimum value of the leader’s commit index and the last log index.
|
|
After updating the commitIndex
, applier
goroutine will apply the log entries to the application state machine. It will apply the log entries from lastApplied + 1
to commitIndex
.
|
|
RequestVote
RPC
Here is just a little change in the RequestVote
RPC. If the candidate’s log is not up-to-date, the follower will reject the candidate.
|
|
Conclusion
In conclusion, log replication is a critical component of the Raft consensus algorithm. It ensures that log entries are consistently replicated across a majority of servers, maintaining the integrity and consistency of the system state. In this lab, we implemented the log replication mechanism of Raft, handling conflicting log entries, updating the commit index, and applying log entries to the application state machine. By implementing log replication, we have taken a significant step towards building a fault-tolerant and consistent distributed system.
In the next post, we will implement the Raft persistence and log compaction mechanism, ensuring that the Raft state is durable and efficient. Stay tuned for more updates!
My implementation of this lab can be found on GitHub, hope it helps you understand the concepts better. If you have any questions or feedback, feel free to send me an email. Thank you for reading!