The Kubernetes contribution that could have been

2020-01-04

The What

The problem we are going to explore in this post arised whenever you have at least a Layer 4 component between the client and the server. These components, such as load balancers, don't understand anything further than UDP or TCP if they are operating at the L4. Of course you can have L7 aware load balancers, but that is not in the scope of this blog post.

Also, these components have to deal with problems like dangling connections, and so they will try to protect themselves from bad actors. One way in which they can protect their precious resources is by killing connections they believe unused for a given period of time.

This is what I spotted. Stay with me.

Note that in Kubernetes there are many clients. This can range from the standard kubectl CLI tool, to the client-go library that many projects use, to other client libraries. In the end, many components in Kubernetes are clients themselves, watching resources and performing changes on them. What we perceive as Kubernetes is the sum of all these components working together, each one doing a well defined and controlled task.

The issue we are exploring today involves any client that needs to go through a Layer 4 component to reach the apiserver, while keeping long lived connections in an idle state.

And so, our work bench starts to shape up. Let's use kind to create a local HA cluster. kind will create an HAProxy balancer instance automatically if we request to set up more than one control plane instance.

I have written a patch for kind that makes the problem pop up in a shorter period of time. If you are going to try this out I strongly suggest you to apply this patch and build kind with it for the rest of the exercise.

Let's create a kind cluster:

~ > kind create cluster --config ~/.kind/3-masters-1-worker.yaml

Where my ~/.kind/3-masters-1-worker.yaml configuration file looks like:

kind: Cluster
apiVersion: kind.sigs.k8s.io/v1alpha3
nodes:
- role: control-plane
- role: control-plane
- role: control-plane
- role: worker

After the cluster has been created and is up, we can try to reproduce one of the commands that trigger the problem:

~ > kubectl exec -it kube-apiserver-kind-control-plane -n kube-system sh
# ⏎
~ >

The connection was closed. That's right. I did nothing. Let's check some other commands, like kubectl logs -f:

~ > kubectl logs -lk8s-app=kube-dns -f -n kube-system
...
error: unexpected EOF
~ >

So you start to see what's going on. You can reproduce this problem with every command that leaves a connection open, on a stream where changes might or might not happen.

Let's look at what HAProxy has to say when inspecting its stats socket. Install socat:

~ > docker exec $(docker ps -q -f name=kind-external-load-balancer) apk add socat
fetch http://dl-cdn.alpinelinux.org/alpine/v3.9/main/x86_64/APKINDEX.tar.gz
fetch http://dl-cdn.alpinelinux.org/alpine/v3.9/community/x86_64/APKINDEX.tar.gz
(1/5) Installing ncurses-terminfo-base (6.1_p20190105-r0)
(2/5) Installing ncurses-terminfo (6.1_p20190105-r0)
(3/5) Installing ncurses-libs (6.1_p20190105-r0)
(4/5) Installing readline (7.0.003-r1)
(5/5) Installing socat (1.7.3.2-r5)
Executing busybox-1.29.3-r10.trigger
OK: 15 MiB in 22 packages

And watch the HAProxy sessions:

~ > docker exec $(docker ps -q -f name=kind-external-load-balancer) watch -n1 'echo "show sess" | socat /var/run/haproxy.sock stdio' | grep "172.17.0.1"

Now, let's try to tail again, while we are watching the sessions:

~ > kubectl logs -lk8s-app=kube-dns -f -n kube-system
...
error: unexpected EOF
~ >

And so, what we see on the output from the session watch is the following:

0x55f6155b8720: proto=tcpv4 src=172.17.0.1:60120 fe=control-plane be=kube-apiservers srv=kind-control-plane ts=00 age=0s calls=2 rate=2 cpu=0 lat=0 rq[f=848202h,i=0,an=00h,rx=4s,wx=,ax=] rp[f=80048202h,i=0,an=00h,rx=4s,wx=,ax=] s0=[8,200008h,fd=31,ex=] s1=[8,200018h,fd=46,ex=] exp=4s
0x55f6155b8720: proto=tcpv4 src=172.17.0.1:60120 fe=control-plane be=kube-apiservers srv=kind-control-plane ts=00 age=1s calls=2 rate=0 cpu=0 lat=0 rq[f=848202h,i=0,an=00h,rx=3s,wx=,ax=] rp[f=80048202h,i=0,an=00h,rx=3s,wx=,ax=] s0=[8,200008h,fd=31,ex=] s1=[8,200018h,fd=46,ex=] exp=3s
0x55f6155b8720: proto=tcpv4 src=172.17.0.1:60120 fe=control-plane be=kube-apiservers srv=kind-control-plane ts=00 age=2s calls=2 rate=0 cpu=0 lat=0 rq[f=848202h,i=0,an=00h,rx=2s,wx=,ax=] rp[f=80048202h,i=0,an=00h,rx=2s,wx=,ax=] s0=[8,200008h,fd=31,ex=] s1=[8,200018h,fd=46,ex=] exp=2s
0x55f6155b8720: proto=tcpv4 src=172.17.0.1:60120 fe=control-plane be=kube-apiservers srv=kind-control-plane ts=00 age=3s calls=2 rate=0 cpu=0 lat=0 rq[f=848202h,i=0,an=00h,rx=1s,wx=,ax=] rp[f=80048202h,i=0,an=00h,rx=1s,wx=,ax=] s0=[8,200008h,fd=31,ex=] s1=[8,200018h,fd=46,ex=] exp=1s
0x55f6155b8720: proto=tcpv4 src=172.17.0.1:60120 fe=control-plane be=kube-apiservers srv=kind-control-plane ts=00 age=4s calls=2 rate=0 cpu=0 lat=0 rq[f=848202h,i=0,an=00h,rx=0s,wx=,ax=] rp[f=80048202h,i=0,an=00h,rx=0s,wx=,ax=] s0=[8,200008h,fd=31,ex=] s1=[8,200018h,fd=46,ex=] exp=0s

Here we can see the expiration countdown (through the exp in each log). If no activity happens during this time, when exp reaches 0, HAProxy will close both connections.

So, as long as there is some traffic, everything will be fine. The L4 component will update its stats and will not close the connection. When that stream is silent for long enough, the L4 component will rightfully close the connection. After all it cannot tell if it's bad actors at the other side causing this behavior, or if any of the myriad of things that can go wrong with networking might have happened.

I could spot this behavior with a number of client commands:

kubectl exec -it
kubectl attach
kubectl logs -f
kubectl get -w

I'm sure this is not a complete list, but these are the ones I spotted and tried to fix in some way.

Speaking of which…

The How

L4 fine tuning

At a first sight it looks like configuration could help here. We increase the timeout, and that's it, right? Well, there are some downsides to that:

Increasing timeouts in general calls for harder situations to debug. They happen less frequently, but they do happen. And when they do…
It takes more time to reproduce until you find what's going on.
Resources are limited; if you mix higher timeouts with bad actors you get a higher allocation of sockets and possibly an exhaustion of certain resources.

All in all, I tried this approach nevertheless just for the sake of it.

I'll spare you all the possibilities I tried and went through. I ended up setting HAProxy's timeout tunnel to an unreasonable high enough timeout value (like 1 year), to see what happened.

The timeout tunnel seemed to be exactly what I was looking for:

The tunnel timeout applies when a bidirectional connection is established between a client and a server, and the connection remains inactive in both directions. This timeout supersedes both the client and server timeouts once the connection becomes a tunnel. In TCP, this timeout is used as soon as no analyzer remains attached to either connection (e.g. tcp content rules are accepted).

And so, HAProxy refused to start with the given configuration:

[ALERT] 330/181418 (1) : parsing [/usr/local/etc/haproxy/haproxy.cfg:17] : timer overflow in argument '8760h' to 'timeout tunnel' (maximum value is 2147483647 ms or ~24.8 days)

Even if I wanted to go with this apparently bad idea, HAProxy would refuse to keep a connection alive any longer than ~24.8 days. Fair enough.

If it wasn't obvious enough before, playing with timeouts on the L4 load balancer configuration was not going to get me much further. Moreover, if anything L4-only aware got in the middle, we would still have the same issue if this new component had lower timeouts, what would be perfectly possible (and reasonable).

At this point I started to look at the problem from a different angle. The application should make sure to generate some kind of traffic, even if there's no data on the stream, so anything between the client and the server would notice that there's something happening, and that the stream is not idle from the TCP perspective.

The Kubernetes patch

The idea at this point was to include a new parameter in kubectl called --ping-interval. We'll revisit what ping means in what context.

I started looking into how kubectl implemented the commands previously mentioned.

Let's run kubectl with high verbosity to learn what it is doing behind the scenes.

`kubectl exec -it`

~ > kubectl exec -v8 -it kube-apiserver-kind-control-plane -n kube-system sh
(output omitted)
I1127 23:54:11.979201   28755 round_trippers.go:427] Request Headers:
I1127 23:54:11.979209   28755 round_trippers.go:431]     X-Stream-Protocol-Version: v4.channel.k8s.io
I1127 23:54:11.979216   28755 round_trippers.go:431]     X-Stream-Protocol-Version: v3.channel.k8s.io
I1127 23:54:11.979221   28755 round_trippers.go:431]     X-Stream-Protocol-Version: v2.channel.k8s.io
I1127 23:54:11.979228   28755 round_trippers.go:431]     X-Stream-Protocol-Version: channel.k8s.io
I1127 23:54:11.979234   28755 round_trippers.go:431]     User-Agent: kubectl/v1.18.0 (linux/amd64) kubernetes/9905a33
I1127 23:54:11.995835   28755 round_trippers.go:446] Response Status: 101 Switching Protocols in 16 milliseconds
I1127 23:54:11.995852   28755 round_trippers.go:449] Response Headers:
I1127 23:54:11.995860   28755 round_trippers.go:452]     Connection: Upgrade
I1127 23:54:11.995867   28755 round_trippers.go:452]     Upgrade: SPDY/3.1
I1127 23:54:11.995873   28755 round_trippers.go:452]     X-Stream-Protocol-Version: v4.channel.k8s.io
I1127 23:54:11.995878   28755 round_trippers.go:452]     Date: Wed, 27 Nov 2019 22:54:11 GMT
# ⏎
~ >

`kubectl attach`

~ > kubectl attach -v8 kube-apiserver-kind-control-plane -n kube-system
(output omitted)
I1127 23:59:52.210600   31733 round_trippers.go:420] POST https://127.0.0.1:35411/api/v1/namespaces/kube-system/pods/kube-apiserver-kind-control-plane/attach?container=kube-apiserver&stderr=true&stdout=true
I1127 23:59:52.210609   31733 round_trippers.go:427] Request Headers:
I1127 23:59:52.210614   31733 round_trippers.go:431]     X-Stream-Protocol-Version: v4.channel.k8s.io
I1127 23:59:52.210618   31733 round_trippers.go:431]     X-Stream-Protocol-Version: v3.channel.k8s.io
I1127 23:59:52.210621   31733 round_trippers.go:431]     X-Stream-Protocol-Version: v2.channel.k8s.io
I1127 23:59:52.210625   31733 round_trippers.go:431]     X-Stream-Protocol-Version: channel.k8s.io
I1127 23:59:52.210631   31733 round_trippers.go:431]     User-Agent: kubectl/v1.18.0 (linux/amd64) kubernetes/9905a33
I1127 23:59:52.298756   31733 round_trippers.go:446] Response Status: 101 Switching Protocols in 88 milliseconds
I1127 23:59:52.298777   31733 round_trippers.go:449] Response Headers:
I1127 23:59:52.298783   31733 round_trippers.go:452]     X-Stream-Protocol-Version: v4.channel.k8s.io
I1127 23:59:52.298788   31733 round_trippers.go:452]     Date: Wed, 27 Nov 2019 22:59:52 GMT
I1127 23:59:52.298793   31733 round_trippers.go:452]     Connection: Upgrade
I1127 23:59:52.298798   31733 round_trippers.go:452]     Upgrade: SPDY/3.1
~ >

`kubectl logs -f`

~ > kubectl logs -v8 -lk8s-app=kube-dns -f -n kube-system
(output omitted)
I1127 23:58:21.008411   31143 round_trippers.go:420] GET https://127.0.0.1:35411/api/v1/namespaces/kube-system/pods/coredns-5644d7b6d9-8vflb/log?follow=true&tailLines=10
I1127 23:58:21.008411   31143 round_trippers.go:420] GET https://127.0.0.1:35411/api/v1/namespaces/kube-system/pods/coredns-5644d7b6d9-r2vkw/log?follow=true&tailLines=10
I1127 23:58:21.008436   31143 round_trippers.go:427] Request Headers:
I1127 23:58:21.008440   31143 round_trippers.go:427] Request Headers:
I1127 23:58:21.008445   31143 round_trippers.go:431]     Accept: application/json, */*
I1127 23:58:21.008447   31143 round_trippers.go:431]     Accept: application/json, */*
I1127 23:58:21.008452   31143 round_trippers.go:431]     User-Agent: kubectl/v1.18.0 (linux/amd64) kubernetes/9905a33
I1127 23:58:21.008456   31143 round_trippers.go:431]     User-Agent: kubectl/v1.18.0 (linux/amd64) kubernetes/9905a33
I1127 23:58:21.021126   31143 round_trippers.go:446] Response Status: 200 OK in 12 milliseconds
I1127 23:58:21.021151   31143 round_trippers.go:449] Response Headers:
I1127 23:58:21.021162   31143 round_trippers.go:452]     Content-Type: text/plain
I1127 23:58:21.021179   31143 round_trippers.go:452]     Date: Wed, 27 Nov 2019 22:58:21 GMT
I1127 23:58:21.021187   31143 round_trippers.go:452]     Cache-Control: no-cache, private
.:53
2019-11-27T22:37:59.639Z [INFO] plugin/reload: Running configuration MD5 = f64cb9b977c7dfca58c4fab108535a76
2019-11-27T22:37:59.639Z [INFO] CoreDNS-1.6.2
2019-11-27T22:37:59.639Z [INFO] linux/amd64, go1.12.8, 795a3eb
CoreDNS-1.6.2
linux/amd64, go1.12.8, 795a3eb
I1127 23:58:21.023865   31143 round_trippers.go:446] Response Status: 200 OK in 15 milliseconds
I1127 23:58:21.023881   31143 round_trippers.go:449] Response Headers:
I1127 23:58:21.023890   31143 round_trippers.go:452]     Cache-Control: no-cache, private
I1127 23:58:21.023908   31143 round_trippers.go:452]     Content-Type: text/plain
I1127 23:58:21.023915   31143 round_trippers.go:452]     Date: Wed, 27 Nov 2019 22:58:21 GMT
.:53
2019-11-27T22:37:59.608Z [INFO] plugin/reload: Running configuration MD5 = f64cb9b977c7dfca58c4fab108535a76
2019-11-27T22:37:59.608Z [INFO] CoreDNS-1.6.2
2019-11-27T22:37:59.608Z [INFO] linux/amd64, go1.12.8, 795a3eb
CoreDNS-1.6.2
linux/amd64, go1.12.8, 795a3eb
F1127 23:58:22.025841   31143 helpers.go:114] error: unexpected EOF
~ >

`kubectl get -w`

~ > kubectl get -v8 -w pods -n kube-system
(output omitted)
I1128 00:07:33.132398   36629 round_trippers.go:420] GET https://127.0.0.1:35411/api/v1/namespaces/kube-system/pods?resourceVersion=3023&watch=true
I1128 00:07:33.132412   36629 round_trippers.go:427] Request Headers:
I1128 00:07:33.132421   36629 round_trippers.go:431]     Accept: application/json;as=Table;v=v1beta1;g=meta.k8s.io, application/json
I1128 00:07:33.132430   36629 round_trippers.go:431]     User-Agent: kubectl/v1.18.0 (linux/amd64) kubernetes/9905a33
I1128 00:07:33.133569   36629 round_trippers.go:446] Response Status: 200 OK in 1 milliseconds
I1128 00:07:33.133583   36629 round_trippers.go:449] Response Headers:
I1128 00:07:33.133590   36629 round_trippers.go:452]     Date: Wed, 27 Nov 2019 23:07:33 GMT
I1128 00:07:33.133594   36629 round_trippers.go:452]     Cache-Control: no-cache, private
I1128 00:07:33.133599   36629 round_trippers.go:452]     Content-Type: application/json
I1128 00:07:34.135522   36629 streamwatcher.go:114] Unexpected EOF during watch stream event decoding: unexpected EOF
~ >

Observations

At first sight, exec and attach are very similar. Both are upgrading the HTTP connection to SPDY/3.1.

On the other hand, logs -f and get -w are HTTP/1.1 requests.

The patch

The pull request I opened can be seen here. Please, bear in mind that it was a heavy WIP and many things are not even considered. Let's go through it.

The strategy

Different commands had slightly different strategies.

`exec -it` and `attach`

Both commands are relying on apimachinery's httpstream. The main idea was to add a Ping() method to the Connection interface. In this case, this method would rely directly on calling to the docker's spdystream backend (that looks like this.)

`logs -f`

ResponseWrapper interface has a Stream() function, so the idea was to add a new StreamWithPing() function. So users of the client-go library could choose which one to use depending on their needs, while keeping the original behavior intact.

`get -w`

get -w is another slightly different case. It will run in a loop watching for changes, so there's no need to strictly keep the connection alive. If the connection is closed on the server end, the client could just be instructed to reissue the same request.

`client-go`

Then, a keepalive package would be included into client-go, so it could be used with different backends (e.g. spdystreamer, golang's net/http and net/http2…). This package would define a Pinger interface containing a single function: Ping(). The different implementations here is in part what I liked the least about this proposal. There were several pinger implementations in the PR:

While the spdy pinger was clearly the implementation I felt more comfortable with, it wasn't the case with the restclient and http pinger implementations.

All my tests were successful with all backends, but the restclient and http solutions were working because of the inner implementation of the internal socket reuse of golang's net/http internally.

Apart from that, it wasn't possible to have a dedicated net/http2 backend because if HTTP/2 is used through golang's net/http (that includes net/http2 in a bundle), it's impossible to access net/http2 internals from within net/http – in particular the ClientConn associated with the connection, to Ping() the other end. Had golang permitted this somehow it would have been possible to take advantage of HTTP/2's native ping frame, what would have been sufficient to generate some TCP traffic, knowing that the right socket would have been used.

The main problem with the restclient and http pingers is that if golang's internal implementation changed and didn't reuse sockets under certain circumstances our "pings" wouldn't go through the expected TCP socket, what wouldn't count towards TCP traffic after all, and eventually the connection would have been closed nevertheless.

Also, there is the problem of what ping means in an HTTP/1.1 context. In my very first proposal of the PR it was a mere HTTP request to scheme://hostPort/healthz. Whether this operation was allowed, whether we had permissions to perform this operation, or even if the endpoint didn't exist are things we didn't really care about. As long as the request was made, and some TCP traffic made it through, the goal was achieved.

In any case, the solution was not optimal for restclient, neither for http and it wasn't possible to properly extract and create a dedicated http2 pinger because of how golang hides the internal implementation of net/http2 behind net/http when using HTTP/2 through net/http.

And so, while I was thinking about all these problems, and all these things I didn't really like about this possible solution, another possibility was proposed.

The golang patch

There was a proposal for golang to perform these pings on http2 automatically by the runtime.

Despite this solution might not be the definitive one (and only fits on http2), I think it's good enough for the ball to get rolling in the golang space.

Knowing that golang will automatically send ping frames to the other end at intervals allows us to just configure the intervals and let the golang runtime do the rest. Since this is clearly an application layer problem I think it makes a lot of sense to have the solution in this space.

Conclusion

Depending on how this feature lands in golang, and depending on its design I might think about this problem again in the Kubernetes space, and check again if it makes sense to implement something like --ping-interval baked into kubectl and client-go.

For now, to something different.

The Kubernetes contribution that could have been

The What

The How

L4 fine tuning

The Kubernetes patch

kubectl exec -it

kubectl attach

kubectl logs -f

kubectl get -w