Are you running a Kubernetes cluster to optimize your workload? Excellent idea. Running apps in containers rather than VMs or bare metal very often is quite helpful for your budget and usually much easier to manage.
While there is always your standalone Docker or Docker swarm, Kubernetes gives you a good number of options on top of that, say for easier management, better loadbalancing, API usage, and a lot more.
Of course, running a production grade container orchestration also requires a bit of regular management on your side. While it usually runns pretty smooth, a few steps are required regularly to keep it like this. One of the things is to keep your cluster certs up to date. Kubernetes makes extensive use of x509 certificates to allow for good security measures, so the nodes and all those gizmos know whom they are allowed to talk to.
There are many ways of dealing with certs, some of them are documented here. Well, actually that page tells you almost nothing, software developers or technical writes can be funny at times. This screenshot is a bit more informative:
If you happened to create your cluster with kubeadm, this is what it should look like, at a minimum. Note the certificate authority, this should be at least 10 years when you create the certs. The actual certs however are expiring after a maximum of 1 year and need to be renewed before then. There are ways of automating this and you should look into those.
Well, I am using this cluster for development purposes only, therefore I spend not too much time on managing it. Today was one of those days when the certs expired, and the cluster went down. Usually all you have to do is running
$ sudo kubeadm certs check-expiration
This command actually produces the output above, while the certs might show as expired or about to be expiring. In that case you go and do the ones which require renewal. Or all of them, like so:
$ sudo kubeadm certs renew all
Usually that does the trick.
Today for some reason it did not work that well. The certs were renewed alright. But the client certs somehow did not get properly updated. The kubelet.conf , which is required for the kubelet service to run, is not managed by kubeadmin. It points to the location of the client certs, and the service was unable to start. It required me a good while of searching the net, and the usual commands recommended here did always return syntax errors.
In the end, the re-initializing the cluster on the control-plane did the trick. Sounds more dangerous than it actually is. Basically, we need to get rid of the old client certs (/var/lib/kubelet/pki/kubelet-client*) and /etc/kubernetes/kubelet.conf. (We do not need to talk about backups, right? Right??) So far as described on above link, but the commands listed there did not work for me. However, this one did:
$ sudo kubeadm init phase kubeconfig all
Basically all this does is recreating the cluster configuration. You will need to do this on all your control planes. Once the master nodes are back up and running, you may have to fix the worker nodes too. Just check on the nodes like this:
$ kubectl get nodes
If you see any nodes here listed as Not-Ready, you'll need to fix that too. Just follow the procedure to (re)join the cluster.
In worst case scenario you may have to re-apply the deployments. Well, I got a nice CICD pipeline for that, using a separate git repo and jenkins. It redeploys in a few seconds. If you are managing a production cluster, this is a must-have. By the way, if you happen to have to fully re-initialize the cluster from scratch, that will take only little more time, since all the binaries should be in place for that. I use Puppet for that.