Understanding Etcd Consensus and How to Recover from Failure

by Ashley Schuett

on May 30, 2018


Kubernetes is backed by a distributed key value store called etcd, which is used for storing and replicating the state of the cluster. While many Kubernetes users are removed from the complexities of running etcd through the use of cluster management services like Containership, it is always interesting to understand how the underlying technology works. In this post, we will do a deep dive into the main concept behind etcd: distributed consensus based on a quorum model. We also walk through an example of how to recover an etcd cluster that has fallen out of quorum due to a failure. 

Under the hood

Etcd uses a distributed census algorithm called raft. Many popular applications use this same technology, including Docker Swarm and Consul. While consensus can be a complicated problem, raft was written for the purpose of being easy to understand by breaking the problem of consensus down into smaller subproblems. This makes writing applications that use distributed consensus easier to create and understand.

The basis of the algorithm is that for any action to take place there has to be quorum which is decided by (N/2)+1, with N being the number of nodes, or voting members of the distributed system. When an action is requested, it has to be voted on by all etcd members. If it receives greater than 50% of the votes then the action is allowed to take place. In this case, the action is written to a log on each node which is the source of truth. By having distributed consens you get the security of knowing anything written to the log is an allowed action as well as having log replication and leader elections.

Because having quorum requires >50% of the votes, odd numbers of nodes increase your level of fault tolerance. Even numbers of nodes do not increase the failure tolerance of nodes that can go down. In the case of having two nodes the chance of losing quorum increases, since if either node goes down no new action can take place. Along the same lines as a two node cluster another dangerous scenario is scaling a cluster from one node, to two. If after adding the new member it is not online the cluster will not have quorum until it comes online and is able to communicate. To learn more about optimal cluster sizes you can read these CoreOS docs

When a cluster has quorum, the responsibility of the leader is to keep a consistent state which is done by writing logs of the agreed upon actions. To learn more about raft you can visit https://raft.github.io/. There is also a visual guide on how distributed consensus works.

Losing quorum

Let’s assume, you have your etcd cluster up and running and it’s chugging along, but then one day you go to add a new deployment to your Kubernetes cluster, and it’s timing out. You look at `etcdctl endpoint health` and realize that your etcd state is no longer healthy. When you lose quorum etcd goes into a read only state where it can respond with data, but no new actions can take place since it will be unable to decide if the action is allowed.  

One way to fix this would be to get the nodes in the initial cluster back up and running. If you are able to get enough of the initial nodes up and running, your cluster will be able to reach a quorum, and etcd will begin writing logs again. However, let’s say you can’t get those initial nodes back up, and you can’t add new nodes since that would be a write action, now what?

Restoring quorum

If you have a snapshot of the the state, which can be taken using `etcdctl snapshot save /var/lib/etcd_backup/database.db` (I recommend saving them in a backup directory for ease of use when restoring), then it’s easy to start a new one node cluster. Once the one node cluster is initialized and healthy you can go on adding new nodes to it as usual. To restore the backup first you’ll want to locate where the data for etcd is being stored, which can be specified with the `—data-dir` flag in the `/etc/systemd/system/etcd.service` file used when the etcd cluster was first started. In this case let’s say it’s stored in `/var/lib/etcd`. Since you have your backups saved in a separate directory as specified above you can now restore your snapshot to a new cluster:

etcdctl snapshot --data-dir /var/lib/etcd_backup/etcd restore /var/lib/etcd_backup/backup.db \ --name nodename \ --initial-cluster nodename=https://PRIVATE_IP:2380 \ --initial-cluster-token new-etcd-cluster-1 \ --initial-advertise-peer-urls https://PRIVATE_IP:2380

This will create the state of a new cluster with only one node. To get systemd to recognize this new cluster, and use the new data instead of the unhealthy one, you’ll have to get rid of the old cluster’s data in the data directory. You will need to replace the data that the systemd process is referencing with the data that was created when the snapshot was restored. This can be done with the following commands: 

mv /var/lib/etcd_backup/etcd /var/lib/ systemctl restart etcd-member.service

check that the new cluster was created and is running

etcdctl member list

which should return only the one node that had the snapshot restored to it. Also you’ll want to check the health of the etcd endpoint

etcdctl endpoint health

which should return a status with “is healthy”. From this point you can start adding more nodes to your cluster using the `member add` command. Now that your etcd cluster once again has quorum, you will be able to start writing data again, such as creating and updating resources in your Kubernetes clusters. But remember, the cluster will be in the state of the snapshot you restored, not the latest state before it lost quorum.

Next steps

Hopefully this has helped you understand how distributed consensus is used in an etcd cluster, as well as how to do some basic disaster recovery for your etcd cluster in the case that quorum isn’t being met. To have some fun and test this out you can set up an initial etcd cluster with three nodes, destroy two of them and see if you can get your etcd back to a functioning state!