The Ultimate Guide to Disaster Recovery for Your Kubernetes Clusters

Prafull Ladha

Cloud & DevOps

Tags:

kubedm

disaster recovery

kubernetes

Kubernetes allows us to run a containerized application at scale without drowning in the details of application load balancing. You can ensure high availability for your applications running on Kubernetes by running multiple replicas (pods) of the application. All the complexity of container orchestrations is hidden away safely so that you can focus on developing application instead of deploying it. Learn more about high availability of Kubernetes Clusters and how you can use Kubedm for high availability in Kubernetes here.

But using Kubernetes has its own challenges and getting Kubernetes up and running takes some real work. If you are not familiar with getting Kubernetes up and running, you might want to take a look here.

Kubernetes allows us to have a zero downtime deployment, yet service interrupting events are inevitable and can occur at any time. Your network can go down, your latest application push can introduce a critical bug, or in the rarest case, you might even have to face a natural disaster.

When you are using Kubernetes, sooner or later, you need to set up a backup. In case your cluster goes into an unrecoverable state, you will need a backup to go back to the previous stable state of the Kubernetes cluster.

Why Backup and Recovery?

There are three reasons why you need a backup and recovery mechanism in place for your Kubernetes cluster. These are:

To recover from Disasters: like someone accidentally deleted the namespace where your deployments reside.
Replicate the environment: You want to replicate your production environment to staging environment before any major upgrade.
Migration of Kubernetes Cluster: Let’s say, you want to migrate your Kubernetes cluster from one environment to another.

What to Backup?

Now that you know why, let’s see what exactly do you need to backup. The two things you need to backup are:

Your Kubernetes control plane is stored into etcd storage and you need to backup the etcd state to get all the Kubernetes resources.
If you have stateful containers (which you will have in real world), you need a backup of persistent volumes as well.

How to Backup?

There have been various tools like Heptio ark and Kube-backup to backup and restore the Kubernetes cluster for cloud providers. But, what if you are not using managed Kubernetes cluster? You might have to get your hands dirty if you are running Kubernetes on Baremetal, just like we are.

We are running 3 master Kubernetes cluster with 3 etcd members running on each master. If we lose one master, we can still recover the master because etcd quorum is intact. Now if we lose two masters, we need a mechanism to recover from such situations as well for production grade clusters.

Want to know how to set up multi-master Kubernetes cluster? Keep reading!

Taking etcd backup:

There is a different mechanism to take etcd backup depending on how you set up your etcd cluster in Kubernetes environment.

There are two ways to setup etcd cluster in kubernetes environment:

Internal etcd cluster: It means you’re running your etcd cluster in the form of containers/pods inside the Kubernetes cluster and it is the responsibility of Kubernetes to manage those pods.
External etcd cluster: Etcd cluster you’re running outside of Kubernetes cluster mostly in the form of Linux services and providing its endpoints to Kubernetes cluster to write to.

Backup Strategy for Internal Etcd Cluster:

To take a backup from inside a etcd pod, we will be using Kubernetes CronJob functionality which will not require any etcdctl client to be installed on the host.

Following is the definition of Kubernetes CronJob which will take etcd backup every minute:

CODE: https://gist.github.com/velotiotech/57cb0774508f0594e97efbcc2d9f3f82.js

Backup Strategy for External Etcd Cluster:

If you running etcd cluster on Linux hosts as a service, you should set up a Linux cron job to take backup of your cluster.

Run the following command to save etcd backup

CODE: https://gist.github.com/velotiotech/d5a5c5f990e233944ae928b4a001a18e.js

Disaster Recovery

Now, Let’s say the Kubernetes cluster went completely down and we need to recover the Kubernetes cluster from the etcd snapshot.

Normally, start the etcd cluster and do the kubeadm init on the master node with etcd endpoints.

Make sure you put the backup certificates into /etc/kubernetes/pki folder before kubeadm init. It will pick up the same certificates.

Restore Strategy for Internal Etcd Cluster:

CODE: https://gist.github.com/velotiotech/53c5dd7737853b69fcc6e2888f72274a.js

Restore Strategy for External Etcd Cluster

Restore the etcd on 3 nodes using following commands:

CODE: https://gist.github.com/velotiotech/ae59e9c96435c3c644ddf9ef10749661.js

The above three commands will give you three restored folders on three nodes named master:

0.etcd, master-1.etcd and master-2.etcd

Now, Stop all the etcd service on the nodes, replace the restored folder with the restored folders on all nodes and start the etcd service. Now you can see all the nodes, but in some time you will see that only master node is ready and other nodes went into the not ready state. You need to join those two nodes again with the existing ca.crt file (you should have a backup of that).

Run the following command on master node:

CODE: https://gist.github.com/velotiotech/b41f74d33fa9d970226e1c54287183a1.js

It will give you kubeadm join command, add one --ignore-preflight-errors and run that command on other two nodes for them to come into the ready state.

Conclusion

One way to deal with master failure is to set up multi-master Kubernetes cluster, but even that does not allow you to completely eliminate the Kubernetes etcd backup and restore, and it is still possible that you may accidentally destroy data on the HA environment.

Need help with disaster recovery for your Kubernetes Cluster? Connect with the experts at Velotio!

For more insights into Kubernetes Disaster Recovery check out here.

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

The Ultimate Guide to Disaster Recovery for Your Kubernetes Clusters

Why Backup and Recovery?

There are three reasons why you need a backup and recovery mechanism in place for your Kubernetes cluster. These are:

To recover from Disasters: like someone accidentally deleted the namespace where your deployments reside.
Replicate the environment: You want to replicate your production environment to staging environment before any major upgrade.
Migration of Kubernetes Cluster: Let’s say, you want to migrate your Kubernetes cluster from one environment to another.

What to Backup?

Now that you know why, let’s see what exactly do you need to backup. The two things you need to backup are:

Your Kubernetes control plane is stored into etcd storage and you need to backup the etcd state to get all the Kubernetes resources.
If you have stateful containers (which you will have in real world), you need a backup of persistent volumes as well.

How to Backup?

Want to know how to set up multi-master Kubernetes cluster? Keep reading!

Taking etcd backup:

There is a different mechanism to take etcd backup depending on how you set up your etcd cluster in Kubernetes environment.

There are two ways to setup etcd cluster in kubernetes environment:

Internal etcd cluster: It means you’re running your etcd cluster in the form of containers/pods inside the Kubernetes cluster and it is the responsibility of Kubernetes to manage those pods.
External etcd cluster: Etcd cluster you’re running outside of Kubernetes cluster mostly in the form of Linux services and providing its endpoints to Kubernetes cluster to write to.

Backup Strategy for Internal Etcd Cluster:

To take a backup from inside a etcd pod, we will be using Kubernetes CronJob functionality which will not require any etcdctl client to be installed on the host.

Following is the definition of Kubernetes CronJob which will take etcd backup every minute:

CODE: https://gist.github.com/velotiotech/57cb0774508f0594e97efbcc2d9f3f82.js

Backup Strategy for External Etcd Cluster:

If you running etcd cluster on Linux hosts as a service, you should set up a Linux cron job to take backup of your cluster.

Run the following command to save etcd backup

CODE: https://gist.github.com/velotiotech/d5a5c5f990e233944ae928b4a001a18e.js

Disaster Recovery

Now, Let’s say the Kubernetes cluster went completely down and we need to recover the Kubernetes cluster from the etcd snapshot.

Normally, start the etcd cluster and do the kubeadm init on the master node with etcd endpoints.

Make sure you put the backup certificates into /etc/kubernetes/pki folder before kubeadm init. It will pick up the same certificates.

Restore Strategy for Internal Etcd Cluster:

CODE: https://gist.github.com/velotiotech/53c5dd7737853b69fcc6e2888f72274a.js

Restore Strategy for External Etcd Cluster

Restore the etcd on 3 nodes using following commands:

CODE: https://gist.github.com/velotiotech/ae59e9c96435c3c644ddf9ef10749661.js

The above three commands will give you three restored folders on three nodes named master:

0.etcd, master-1.etcd and master-2.etcd

Run the following command on master node:

CODE: https://gist.github.com/velotiotech/b41f74d33fa9d970226e1c54287183a1.js

It will give you kubeadm join command, add one --ignore-preflight-errors and run that command on other two nodes for them to come into the ready state.

Conclusion

Need help with disaster recovery for your Kubernetes Cluster? Connect with the experts at Velotio!

For more insights into Kubernetes Disaster Recovery check out here.

kubedm

disaster recovery

kubernetes

About the Author

Did you like the blog? If yes, we're sure you'll also like to work with the people who write them - our best-in-class engineering team.

We're looking for talented developers who are passionate about new emerging technologies. If that's you, get in touch with us.

Explore current openings

Subscribe to get the latest technology updates

The Ultimate Guide to Disaster Recovery for Your Kubernetes Clusters

Prafull Ladha

Why Backup and Recovery?

What to Backup?

How to Backup?

Taking etcd backup:

Backup Strategy for Internal Etcd Cluster:

Backup Strategy for External Etcd Cluster:

Disaster Recovery

Restore Strategy for Internal Etcd Cluster:

Restore Strategy for External Etcd Cluster

Conclusion

MORE POSTS BY THIS AUTHOR

Prafull Ladha

You may also like

🐉 Taming the OpenStack Beast – A Fun & Easy Guide!

Shruti Anekar

Linux Internals of Kubernetes Networking

Shiwam Jaiswal

Strategies for Cost Optimization Across Amazon EKS Clusters

Saurabh Taneja

The Ultimate Guide to Disaster Recovery for Your Kubernetes Clusters

Why Backup and Recovery?

What to Backup?

How to Backup?

Taking etcd backup:

Backup Strategy for Internal Etcd Cluster:

Backup Strategy for External Etcd Cluster:

Disaster Recovery

Restore Strategy for Internal Etcd Cluster:

Restore Strategy for External Etcd Cluster

Conclusion

About the Author

Did you like the blog? If yes, we're sure you'll also like to work with the people who write them - our best-in-class engineering team.

We're looking for talented developers who are passionate about new emerging technologies. If that's you, get in touch with us.

About Velotio

Subscribe to get the latest technology updates

Related Posts

🐉 Taming the OpenStack Beast – A Fun & Easy Guide!

Linux Internals of Kubernetes Networking

Strategies for Cost Optimization Across Amazon EKS Clusters

Mastering Prow: A Guide to Developing Your Own Plugin for Kubernetes CI/CD Workflow

Simplifying MySQL Sharding with ProxySQL: A Step-by-Step Guide

Streamline Kubernetes Storage Upgrades

Unlocking Key Insights in NATS Development: My Journey from Novice to Expert - Part 1

Unveiling the Magic of Kubernetes: Exploring Pod Priority, Priority Classes, and Pod Preemption

How to deploy GitHub Actions Self-Hosted Runners on Kubernetes

How to Setup HashiCorp Vault HA Cluster with Integrated Storage (Raft)

How To Get Started With Logging On Kubernetes?

Create CI/CD Pipeline in GitLab in under 10 mins

Acquiring Temporary AWS Credentials with Browser Navigated Authentication

How to Avoid Screwing Up CI/CD: Best Practices for DevOps Team

How to Make Your Terminal More Productive with Z-Shell (ZSH)

Setting Up A Robust Authentication Environment For OpenSSH Using QR Code PAM

Hacking Your Way Around AWS IAM Roles

Monitoring a Docker Container with Elasticsearch, Kibana, and Metricbeat

Autoscaling in Kubernetes using HPA and VPA

Managing a TLS Certificate for Kubernetes Admission Webhook

Prow + Kubernetes - A Perfect Combination To Execute CI/CD At Scale

Building A Containerized Microservice in Golang: A Step-by-step Guide

Kubernetes Migration: How To Move Data Freely Across Clusters

OPA On Kubernetes: An Introduction For Beginners

To Go Serverless Or Not Is The Question

Ensure Continuous Delivery On Kubernetes With GitOps’ Argo CD

How To Implement Chaos Engineering For Microservices Using Istio

Helm 3: A More Secured and Simpler Kubernetes Package Manager

An Introduction To Cloudflare Workers And Cloudflare KV store

Getting Started With Kubernetes Operators (Golang Based) - Part 3

Getting Started With Kubernetes Operators (Ansible Based) - Part 2

Getting Started With Kubernetes Operators (Helm Based) - Part 1

How to Write Jenkinsfile for Angular and .Net Based Applications

Kubernetes CSI in Action: Explained with Features and Use Cases

A Comprehensive Tutorial to Implementing OpenTracing With Jaeger

Know Everything About Spinnaker & How to Deploy Using Kubernetes Engine

Mesosphere DC/OS Masterclass : Tips and Tricks to Make Life Easier

Managing Secrets Using AWS Systems Manager Parameter Store and IAM Roles

Taking Amazon's Elastic Kubernetes Service for a Spin

Extending Kubernetes APIs with Custom Resource Definitions (CRDs)

Jenkins X - A Cloud-native Approach to CI/CD