This post is a continuation of https://maarten.gent/posts/debugging/debug-complex-containerized-application/

A quick recap: we moved a complex 3-pod containerized application from OKD (Openshift) 3 to 4 and hit a network issue. Some lengthy debugging was required but we felt confident we could move our biggest customers to the new platform once resolved.

Set the stage

The weekend is family time (or blog time) but not this time: a switch died in the datacenter which brought the storage down of the old OKD cluster. Luckily only a few applications where still harboured there, so I decided to move the remaining workload to the brand new OKD cluster. Our beloved containerized application, the biggest customer, was one of those..

My family and I were due for a well deserved week of vacation and I decided to move that last big customer the weekend before. Since many deployments worked great for weeks we were relaxed.

Mayhem

Monday morning around 9 Murphy strikes again, and hard: a colleague got a phone call from the customer that it was not working at all. Holiday or not, this obviously could not wait.

Together with my colleague we did some quick debugging and could determine that there were timeouts under load, but sometimes, when trying over and over, a connection could get through. Unfortunately nothing obvious popped up checking logs, so I was fearing yet another obscure network bug, possibly in the Openshift networking itself.

My children were shouting and crying in the background.. My colleague at first didn’t realize I was in holiday and suspected me to host a daycare as side job. Yeah, it was real fun.

Anyhow, the customer was down and losing money so the only feasible solution was moving the customer back to the old platform. But.. the old platform was no more. Switch died, remember?

Still, resurrecting the old platform, seemed easier to me than debugging the vague network issue in a ‘daycare’ environment.

I will not go into details but basically I had to update NFS exports (after shutting down ‘storage clustering’ etc.) on some storage system I didn’t dare to touch before, to use the backup switch. I could update the NFS provisioner in kubernetes to use the backup server IP. Unfortunately k8s does not allow you to update existing persistent volumes. I had to delete the non working and create new ones, allowing the NFS provisioner to use the new storage IP. Luckily we do backups, and they even work, and we only had to resurrect the platform for this one customer. I didn’t care about others, or even core k8s services such as monitoring, logging, web console, …

About one hour later we could relaunch the services and the client was back on its way. I definitely could use the week of de-stressing, that is for sure. Apart from the occasional issue that surfaced because of the emergency migrations the weekend before, I did enjoy the time off.

The debug task at hand

As in part 1, again no hurry and the debugging task promised to be complicated: the problem only occurs when there were many concurrent connections. We could not afford any more downtime experimenting with the clients setup, so we really had to invest time in a test setup replicating the problem environment as much as possible.

We first setup a parallel OpenVPN tunnel to the customer, we could use for experimenting. For some reason I suspected that the MTU used in OVNKubernetes networking might be biting us. The service involves some possibly large frames that are not allowed to be fragmented.

Indeed, the MTU size was 50 bytes lower on the new platform, so it definitely was possible. Some initial tests using ping seemed to suggest large frames could get lost without OpenVPN signalling this to the application using ICMP. So we must be on the right track here, or so we thought. We should have realized that TCP is really resilient for this kind of problems: if packets get dropped, retransmissions and automatic readjustments should put things straight.

Although the initial debugging yielded some suspicion concerning the lower MTU size, we could not fully simulate the problem since the customer must be disturbed if we want to initiate many concurrent connections.

Another detail that matters here: the service involves Windows Remote Desktop (RDP) and due to Microsoft licensing, it is really hard to setup a test involving many concurrent connections. Windows 10, for example, only supports 1 RDP connection.

It turned out, once we did have a working lab setup that could support many users, the debugging itself took only 30 minutes!

Building the test setup

We used our in house Openstack platform to build a test ‘customer’ network with a Windows 10 VM. Using https://github.com/stascorp/rdpwrap we could convince Windows to allow concurrent connections, as many as the VM resources allowed.

The legality of this product is obviously questionable, please read https://github.com/stascorp/rdpwrap/issues/26 if you are a lawyer. Please forgive us, we only needed it for 5 minutes to help many customers that pay tons of money to Microsoft, no doubt.

We setup an Openshift instance of our product and finally could check what is going on and sure we did, right away.

Eureka

Immediately we noticed the timeouts when creating many connections. We noticed ’timeout’ log messages in one of the pods and when trying to connect to the console kubectl returned this strange error:

ERRO[0000] exec failed: container_linux.go:380: starting container process caused: read init-p: connection reset by peer
command terminated with exit code 1

Or, which reveals the core issue:

bash: fork: retry: Resource temporarily unavailable
bash: fork: retry: Resource temporarily unavailable
bash: fork: retry: Resource temporarily unavailable

If you have some experience like me, you immediately realize you are hitting the so called pids.max limit: the maximum number of processes and/or threads a system can spawn. To prevent one process to ‘DDoS’ your system spawning thousands of processes or threads, this is usually limited.

This definitely makes sense in container context: the promise is containers ‘contain’ your application and allow to share a system with possibly thousands of others in a secure way.

And indeed, when diving deeper, the cgroup setting was set to 1024 and the service worked great after manually increasing this limit.

Some pointers for the interested:

# cat ./kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod5fc72040_5873_4ace_b4fc_3913f0372284.slice/crio-396495f6de551417a06a7a11f4b474fdcfbf5d6ecb2b2c94a9f5e95c2a2a7b62.scope/pids.max
1024
# echo 2048 > ./kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod5fc72040_5873_4ace_b4fc_3913f0372284.slice/crio-396495f6de551417a06a7a11f4b474fdcfbf5d6ecb2b2c94a9f5e95c2a2a7b62.scope/pids.max
# cat ./kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-pod5fc72040_5873_4ace_b4fc_3913f0372284.slice/crio-396495f6de551417a06a7a11f4b474fdcfbf5d6ecb2b2c94a9f5e95c2a2a7b62.scope/pids.max
2048

Implementation solution

A big relieve: it all finally makes sense. Turns out OKD 3 used a higher limit than OKD 4. A Red Hat issue describing the problem and fix: https://access.redhat.com/solutions/5366631

The solution is then really simple: commit the 20 lines of YAML to our GitOps repository and watch how the nodes reboot one by one, applying the kubelet fix.

If you like YAML, here it is:

apiVersion: machineconfiguration.openshift.io/v1
kind: ContainerRuntimeConfig
metadata:
 name: set-pids-limit
spec:
 machineConfigPoolSelector:
    matchLabels:
      pools.operator.machineconfiguration.openshift.io/worker: ""
 containerRuntimeConfig:
   pidsLimit: 16384
---
apiVersion: machineconfiguration.openshift.io/v1
kind: KubeletConfig
metadata:
  name: worker-kubeconfig-morepids
spec:
  machineConfigPoolSelector:
    matchLabels:
      pools.operator.machineconfiguration.openshift.io/worker: ""
  kubeletConfig:
      podPidsLimit: 16384

Conclusion

Debugging is hard work. You need persistence and should never give up. You might be puzzled for weeks but in the end it always makes sense.

Applications can really obfuscate what is going on: only timeouts were logged. The application was frantically creating threads that were killed immediately without mentioning anything in the logs. Developers and their frameworks and libraries usually do not foresee low level problems like this and so you cannot expect help from them.

That is just the way it is, deal with it.