Debugging a complex containerized application

The story of how we resolved a hard to debug issue migrating a complex containerized application to a Kubernetes cluster. I learned some lessons that might be useful for you too.

Problem statement

Application

Sorry for the vague description but the application involves three pods. One pod contains a container running an OpenVPN (client) process. Users access the application using a web interface and traffic flows back and forth through an VPN tunnel accessing a service running in a restricted network on the other end of the tunnel.

This application is customer specific because of the VPN tunnel and we run many instances of this setup, one per customer.

As you might imagine, the networking for this architecture is quite complicated. Running OpenVPN is not an everyday workload for k8s and the pod needs special permissions (SCC) to be able to open an tun interface and other low level stuff a VPN solution typically needs.

To containerize an application like this, some experience comes in handy. I was impressed learning about this prior work. In my opinion, at that time, this was at the limit of what makes sense to containerize. I really learned to broaden my horizon of what is possible with containers, thanks to use cases like this.

Without going into details, security is also a big concern. You definitely always should take precautions to shield your internal network from any k8s service exposed to the wild west that is the internet, but this is even more complicated due to the OpenVPN tunnel.

Migration

Not my problems: the application already ran for a few years and could handle any workload that got thrown at it. No doubt, it needed a few tweaks on the way but nothing major as far as I know.

At some point the need arises to move this to a brand new OKD 4 platform, set up by yours truly.

Turns out this was a lot harder as anticipated, as you will see shortly. Migrating from OKD 3 to 4 involved some changes on the level of YAML because you basically move from Kubernetes 1.11 to 1.20+. Also some Openshift specific things changed. Explaining this details is out of scope, I hope you can start from a clean slate and run the latest version.

But in the end: Kubernetes promises to be a perfect generalized abstraction for any infrastructure that makes moving around services a peace of cake. Maybe you don’t know but Google’s huge effort in k8s was mainly driven by the competition with their popular rival AWS.

Easy migration? Not this time!

Problem 1: networking

But this was not my problem: the application already ran containerized for a few years and could handle any workload that was thrown at it. No doubt it evolved and needed a few tweaks on the way, but this is not my story to tell.

I came in the picture moving the application instances to a new OKD 4 platform. Initially things seem to go smooth. Deploying and testing in test and staging went fine so a few weeks later we informed some customers and started to move the least used instances to the new cluster.

The deploys succeeded. Basic manual tests satisfied us so I went to sleep with the nice feeling I did something useful that day. Shortly after, probably the next morning around 9, the phone rang..

The nice feeling took place for mild panic. Now what? A customer could not use the application: he/she could login into the web portal but the application itself did not work: a red error dialog popped up after some time, signalling something went horribly wrong.

Long story short, the technical translation of the customers stating “it does not work” is: the customer seems to experience connection timeouts.

Here we could state a first lesson: test more thoroughly. As manual test I just logged in, checked OpenVPN logs and performed a basic ping test to test the tunnel. What I failed to do was test the application end to end. The reason is simple: it takes a lot of time to setup a full environment. The existing dev environment was not longer working so I decided to skip this, assuming I could do this without.

First step in debugging: determine the root issue and buy time

It seems clear that ’networking’ is to blame. But a few facts are mystifying what is going on:

Other customers had no problems at all.
I created a new version of the application container images with updated components. Surely, this was to blame, or so I thought.
I didn’t have a clear understanding of the application architecture. I just bothered with a general understanding of what was going on: VPN tunnel, web sockets, …
I didn’t have deep knowledge of the Openshift networking yet.

Luckily the suffering of customers was short: we just brought down the new VPN tunnels and brought back up the ‘old’ ones and updated the customers DNS records (which fortunately have a TTL of five minutes) back to the old cluster.

So we had what now seems a monumental task at hand: which networking bit is biting us? Is it OpenVPN? The Openshift networking OVNKubernetes? Or maybe the physical host networking? Since we move from another Kubernetes platform it might be a good idea to focus on the differences between those.

Anyhow, guess no way around it: we have to understand the Openshift networking (OVNKubernetes) and the application, bummer. The customers are migrated back to the old cluster so we can take the time to tackle this problem.

Initial black box debugging

Also known as trial and error. This technique we all know and apply often, don’t deny it. It enabled us to narrow down the huge problem space. Soon it was clear that varying container image versions had no effect.

The next important step here was figuring out what was the difference between the affected customers and the unaffected. To make things worse, the application has many parameters to configure the OpenVPN tunnel: TCP or UDP, port, HTTP proxy or not, …

Luckily we have plenty of customers so after some research we could eliminate OpenVPN specifics. And after studying the different namespaces in k8s it became clear that scheduling decisions of the pods might matter.

Bear in mind, our application is architected using 3 pods. It became clear the affected customers had pods scheduled on different hosts. The unaffected not.

This is a very important clue: internally in k8s something goes haywire when the traffic has to pass host boundaries: it gets lost somehow.

The good news is we had a quick fix: Introduce so called pod affinity rules to make sure pods are scheduled on the same host. Read more about in the k8s docs if you are interested: https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#affinity-and-anti-affinity

Still I didn’t feel comfortable running a network stack we didn’t really understand and with a potential issue. We had to find out what was going on before we could trust our most important customers to this new platform.

Kubernetes networking

Next step was diving deeper in the network specifics of k8s. Moving from version 3 to version 4 involves a change from Openshift SDN to OVNKubernetes. So this must be it!

As you might now, k8s specifies how the pod networking should look logically: pods see a flat network, it is possible to define services, etc.

The implementation is left to others, Openshift nowadays pushes OVNKubernetes as the main supported so called Container Network Interface (CNI) plugin.

Evidently I started reading about Openshift specifics, you can go see for yourself: https://docs.okd.io/latest/networking/ovn_kubernetes_network_provider/about-ovn-kubernetes.html

For a while we were stuck here. Surely it was some low level complicated thing that would take us weeks to figure out. Guess we have to keep that old cluster online for a while longer, pity.

I had my fair share of frustration and even lost my believe in OVNKubernetes as a stable CNI networking plugin for a while.

Luckily I found my cool after some rest and focusing on other things. At some point I returned to the basics of networking: IP, TCP and routing tables and studied the OpenVPN containers more closely in the old and new cluster.

And that was the golden idea: there was a clear difference in the routing table. The problem was that k8s traffic was inadvertently sent through the VPN tunnel when the pod running OpenVPN needed to communicate to a pod running on another host. The solution is then simple: add an extra route directing k8s traffic to the OVNKubernetes gateway. OpenVPN has a hook mechanism to perform custom actions when the tunnel is brought up.

Finally case closed! OVNKubernetes was not to blame. Guess we where just lucky that this problem not hit us with Openshift SDN on the old cluster. Although k8s logical networking is standardized, implementation details of your specific CNI plugin might bite you.

What I remember from cases like this: keep your cool. When frustration takes the better of you, just step away. Focus on other things and revisit the problem later. Problems like this demand creativity and a clear mind can improve your efficiency.

So we could finally trust our customers to the new platform, or so we thought…

Problem 2: container limits

Murphy still was not out of ideas. After running the fixed application for weeks, we were confident, it was rock solid. On some beautiful day we deployed the updated application for our biggest customer late at night. To make things even more fun: my family and I were going to leave for some well deserved time off the day after.

A recipe for disaster.. Guess you are curious? Sorry but this post is getting out of hand and my family demands some time. Some other time. Some other blog post.

Preliminary conclusion

Many times you are a full stack engineer in our line of work. I hope you have a broad interest and just deal with it. Don’t loose courage when it takes some time, it just does.

I (should have) learned a few lessons on the way:

Test your application thoroughly, do not just assume you can get away with some limited testing. Take the time to build a fully functional end-to-end test setup.
Understand the application and its architecture. Read the code, ask you colleagues, do whatever it takes to really get what makes the application tick.
Demystify the inner workings of your infrastructure. You’ll learn a lot and become the expert debugger you always wanted to be.
Deal with the fact that it will take time. Take some rest and revisit the problem later with a fresh mind if frustration takes over.

Guess we all know this after some experiences and keep bumping into them. Work load might hinder us to work thoroughly and quite often problems like this force you later to do it the right way anyhow.

Problem statement#

Application#

Migration#

Problem 1: networking#

First step in debugging: determine the root issue and buy time#

Initial black box debugging#

Kubernetes networking#

Problem 2: container limits#

Preliminary conclusion#