Has been a while since I blogged something. It’s summer time: get your cocktail and stretch yourself on the beach and enjoy the scenery!

Usually not my way of spending free time. This week I am enjoying some holiday, though, in Belgian nature (it does exist) with lots of action for the kids in proximity. I made a little trip to Germany to pick up some spare parts for my broken -propeller snapped when the rudder got loose- RC boat. I really enjoyed my 300hp car and the German roads as you would imagine, if you know me.

At work I have been busy with development lately (Python/Go). Some cool projects around ElasticSearch and Kafka. I have some blog posts in the pipeline about this recent work but nothing final yet. It takes a bit more time to create a clean polished dev story.

But this is a Kubernetes/Openshift/OKD story: I ‘performed’ some OKD upgrades on our semi-production on premise bare metal cluster, with the goal to reach the latest ‘4.10’ version of that time, the end of June. This is supposed to be a painless process where your system gets updated in all its components: OS, kernel, OKD with a simple click on the ‘upgrade’ button. I was pretty confident in the upgrade process since I chose the latest version.

Well, this time shit hits the fan which forced me to dive into the ‘automatic’ upgrade process to debug what is going on. I also learned some stuff about SELinux on the way and did a small contribution to the OKD community by suggesting possible root causes and providing debug information promptly via GitHub issues.

Luckily I was the only one suffering: mainly one worker node (out of three) was affected and down for a couple of days before I could get it back online. Kubernetes and Ceph storage, although degraded, were fully available. My colleagues hardly noticed anything while I was sacrificing sleep over this. Obviously not a comfortable situation: we were one problem away from full downtime. Guess that is life for infra people!

Network issues to spark up the fun

Not the core issue but mayhem broke loose starting with a network issue. The upgrade was a two step process because I was a little behind. The OKD people recommend you follow a ‘graph’ stating the supported upgrade steps, which I obviously follow, assuming this returns ease of mind while upgrading: just wait a bit while I am meditating or something like that while stating in the work log I was working for several hours. I wish! And boy, was I wrong about that ’ease of mind’.

If I recall correctly, the first step went smooth. The second step, as you already guessed, not so much. The three masters upgrade still went smooth: they gracefully rebooted one by one and oc get nodes -o wide nicely showed the updated versions of the OS, kernel, Kubernetes and OKD patches.

The first worker, though, did not return after the usual reboot which is part of the upgrade process to bootstrap the new OS version and kernel. The console looked normal though, so it did boot. What the hell is going on?

This was the first time I could not access an OKD host over SSH so I had to research how to get in my host from the console to investigate this. Following OKD practices, the core user has no password set out of the box.

So then the usual dance of ‘boot in single user mode, update core password and finally reboot’ dance has to take place. Fedora CoreOS (FHCOS) is different than the more traditional operating systems, the procedure is similar. It took me some time but I can’t remember real gotchas. If you did not do this before, read a blog post such as https://www.vultr.com/docs/reset-passwords-and-ssh-keys-in-fedora-coreos/

Misconfiguration bond interfaces

I found out that layer 2 network connectivity was gone. This should have put me on the right track and let me study the low level network configuration of my nodes and switches. But it did not..

Instead I embarked yet again in a fun process of blaming OKD networking. Surely it had to be broken after upgrade. We use OVNKubernetes which uses Open vSwitch under the hood. So somehow Open vSwitch can break layer 2?

I had discovered a problem with the physical connections of my host so obviously this can not be on the Open vSwitch layer since the observed layer 2 network issue has nothing to do with containers and their networking. Still, I believed it had to, and spent many hours in studying OVNKubernetes and how it is built up. Very insightful but utterly useless in the end other than me learning something new. Still a great day in my professional life! :-)

The underlying issue turned out to be a simple physical network problem: bond interface 2 became the default master for some reason, which had some configuration issue on the switch rendering the network to become unavailable.

I believe this change in ‘bonding’ behaviour is due to some kernel change (or bug?) but I did not look that up yet. Once I fixed the switch configuration, we where back on our feet again, or so I thought..

A rogue DHCP IP

Another annoying network issue interfered with our little cluster. Some secondary network interface that we are using for KubeVirt, got assigned a DHCP IP since that is the default behavior of Fedora CoreOS. Obviously I disabled this behavior following the recommended approach: ‘day 2’ network config via the NMState operator. Very cool, since you push some YAML to Kubernetes and the operator takes care of your network configuration for you.

But in an attempt to recover from this issue, I reinstalled my host from scratch. This should be fine but my host was lacking the day 2 network config initially, so my OS happily assigned the DHCP IP it got offered by some VLAN that was erroneously configured on the switch since I normally only use tagged traffic on this interfaces and there should not be any untagged traffic allowed. A ‘great’ feature of Kubelet daemon is that it seems to pick up the latest configured IP as the Kubernetes host IP.

Beats me why the host IP is configured like that. Some little bash script gets called by systemd before the kubelet service fires up. It is really surprising to see how modern platforms are patched up using little ugly hidden scripts such as this. The fix is simple: fix the switch config once more and/or remove the DHCP IP manually from the host, allowing kubelet to start. Once OKD is up and running. The ‘day 2’ network configuration will kick in, disabling this undesired DHCP activity.

The result of this wrongfully assigned host IP is that the host become unavailable since the other hosts cannot reach it using that rogue DHCP IP. kubectl get nodes -o wide immediately exposes the problem (IP column). Unfortunately once the IP is removed and the kubelet service is restarted, and comes up with the correct IP, you are not out of the woods: host certificates are not valid for this ’new’ IP.

I ended up reinstalling the host from scratch, after making sure that rogue DHCP server could not strike again. Only later I found out that I probably just needed to restart some services on the host, or delete and re-add the host again to Kubernetes.

Lesson learned (once more): first stick to the basics of system debugging before diving into the many complexities that container networking has to offer. It might be useful to study the diagrams you (obviously) made of your systems prior to building them. It is easy to look over implementation details such as network bonding running your systems for a while.

But surely we finally get out of this upgrade mess, right? Wrong guess: SELinux and KubeVirt still have to haunt us.

Issue 3: Kubernetes node unavailable due to Kubelet restart loop

This is the interesting part. You can read about it all by yourself: https://github.com/openshift/okd/issues/1285

As described in previous paragraphs, we already suffered for days trying to upgrade the worker nodes of our on premise semi-production OKD cluster. The upgrade hiccups were basically configuration errors so I was quite assured the rest would go smooth, as expected.

Unfortunately, the misery continues: the node its networking is restored but stays unavailable as far as Kubernetes is concerned. I quickly figured out that the kubelet daemon could not start, spitting out error messages like this:

systemd[1]: Starting Kubernetes Kubelet...
systemd[4040]: kubelet.service: Failed to locate executable /usr/bin/hyperkube: Permission denied
systemd[4040]: kubelet.service: Failed at step EXEC spawning /usr/bin/hyperkube: Permission denied
systemd[1]: kubelet.service: Main process exited, code=exited, status=203/EXEC

This looks like an OKD bug but since I already struggled for days and followed some drastic approaches, like reinstalling the host, trying to recover from the situation, this error messages surely had to be caused by my actions.

Once more I was stuck for days and tried some recovery actions to no avail. Weirdly I could start kubelet myself without problems. File permissions etc. looked normal. I already suspected SELinux was blocking kubelet but I didn’t find clear logging at that time. I still was on the path of ‘what the hell did I do wrong to get in this situation?’

The ‘quick and dirty’ workaround

Luckily on some beautiful day somebody created an issue on Github describing a similar issue, mentioning he got around it by disabling SELinux. This made me trigger to run this, effectively starting kubelet without SELinux enforcing policies:

setenforce 0 && systemctl start kubelet

And sure enough, immediately the host became an active cluster member and the upgrade (which was still stuck) could continue. And when I examined my journal logging more closely I found these telling messages:

kernel: SELinux:  Context system_u:object_r:kubelet_exec_t:s0 is not valid (left unmapped).
AVC avc:  denied  { execute } for  pid=4040 comm="(yperkube)" name="hyperkube" dev="sda4" ino=342043427 scontext=system_u:system_r:init_t:s0 tcontext=system_u:object_r:unlabeled_t:s0 tclass=file permissive=0 trawcon="system_u:object_r:kubelet_exec_t:s0"

Furthermore, every other worker node in the cluster suffers the same problem. This is not an ideal situation because kubelet is running without SELinux protection and the host will not become available automatically after reboot.

Clean workaround and fix?

Thanks to the OKD community it is finally clear what caused this: the cluster is running the Virtualization operator (KubeVirt) to host virtual machines. At startup time, KubeVirt will setup some SELinux policies and apparently this stops Fedora CoreOS (FHCOS) to upgrade policies when the OS gets upgraded.

Some folks who know more about this FHCOS issue explained me how to work around this problem. This will be my first work action after my holiday this week. FHCOS already should have a fix in place so it should become available for OKD soon too. The issue is still open so guess we will find out sooner or later how it gets fixed.