The Kubernetes contribution that could have been
2020-01-04The What
The problem we are going to explore in this post arised whenever you have at least a Layer 4 component between the client and the server. These components, such as load balancers, don't understand anything further than UDP or TCP if they are operating at the L4. Of course you can have L7 aware load balancers, but that is not in the scope of this blog post.
Also, these components have to deal with problems like dangling connections, and so they will try to protect themselves from bad actors. One way in which they can protect their precious resources is by killing connections they believe unused for a given period of time.
This is what I spotted. Stay with me.
Note that in Kubernetes there are many clients. This can range from
the standard kubectl
CLI tool, to the client-go
library that many
projects use, to other client libraries. In the end, many components
in Kubernetes are clients themselves, watching resources and
performing changes on them. What we perceive as Kubernetes is the sum
of all these components working together, each one doing a well defined
and controlled task.
The issue we are exploring today involves any client that needs to go through a Layer 4 component to reach the apiserver, while keeping long lived connections in an idle state.
And so, our work bench starts to shape up. Let's use kind
to create
a local HA cluster. kind
will create an HAProxy balancer instance
automatically if we request to set up more than one control plane
instance.
I have written a patch for kind
that makes the problem pop up in a
shorter period of time. If you are going to try this out I strongly
suggest you to apply this patch and build kind
with it for the rest
of the exercise.
Let's create a kind
cluster:
~ > kind create cluster --config ~/.kind/3-masters-1-worker.yaml
Where my ~/.kind/3-masters-1-worker.yaml
configuration file looks
like:
kind: Cluster apiVersion: kind.sigs.k8s.io/v1alpha3 nodes: - role: control-plane - role: control-plane - role: control-plane - role: worker
After the cluster has been created and is up, we can try to reproduce one of the commands that trigger the problem:
~ > kubectl exec -it kube-apiserver-kind-control-plane -n kube-system sh # ⏎ ~ >
The connection was closed. That's right. I did nothing. Let's check
some other commands, like kubectl logs -f
:
~ > kubectl logs -lk8s-app=kube-dns -f -n kube-system ... error: unexpected EOF ~ >
So you start to see what's going on. You can reproduce this problem with every command that leaves a connection open, on a stream where changes might or might not happen.
Let's look at what HAProxy has to say when inspecting its stats
socket. Install socat
:
~ > docker exec $(docker ps -q -f name=kind-external-load-balancer) apk add socat fetch http://dl-cdn.alpinelinux.org/alpine/v3.9/main/x86_64/APKINDEX.tar.gz fetch http://dl-cdn.alpinelinux.org/alpine/v3.9/community/x86_64/APKINDEX.tar.gz (1/5) Installing ncurses-terminfo-base (6.1_p20190105-r0) (2/5) Installing ncurses-terminfo (6.1_p20190105-r0) (3/5) Installing ncurses-libs (6.1_p20190105-r0) (4/5) Installing readline (7.0.003-r1) (5/5) Installing socat (1.7.3.2-r5) Executing busybox-1.29.3-r10.trigger OK: 15 MiB in 22 packages
And watch the HAProxy sessions:
~ > docker exec $(docker ps -q -f name=kind-external-load-balancer) watch -n1 'echo "show sess" | socat /var/run/haproxy.sock stdio' | grep "172.17.0.1"
Now, let's try to tail again, while we are watching the sessions:
~ > kubectl logs -lk8s-app=kube-dns -f -n kube-system ... error: unexpected EOF ~ >
And so, what we see on the output from the session watch is the following:
0x55f6155b8720: proto=tcpv4 src=172.17.0.1:60120 fe=control-plane be=kube-apiservers srv=kind-control-plane ts=00 age=0s calls=2 rate=2 cpu=0 lat=0 rq[f=848202h,i=0,an=00h,rx=4s,wx=,ax=] rp[f=80048202h,i=0,an=00h,rx=4s,wx=,ax=] s0=[8,200008h,fd=31,ex=] s1=[8,200018h,fd=46,ex=] exp=4s 0x55f6155b8720: proto=tcpv4 src=172.17.0.1:60120 fe=control-plane be=kube-apiservers srv=kind-control-plane ts=00 age=1s calls=2 rate=0 cpu=0 lat=0 rq[f=848202h,i=0,an=00h,rx=3s,wx=,ax=] rp[f=80048202h,i=0,an=00h,rx=3s,wx=,ax=] s0=[8,200008h,fd=31,ex=] s1=[8,200018h,fd=46,ex=] exp=3s 0x55f6155b8720: proto=tcpv4 src=172.17.0.1:60120 fe=control-plane be=kube-apiservers srv=kind-control-plane ts=00 age=2s calls=2 rate=0 cpu=0 lat=0 rq[f=848202h,i=0,an=00h,rx=2s,wx=,ax=] rp[f=80048202h,i=0,an=00h,rx=2s,wx=,ax=] s0=[8,200008h,fd=31,ex=] s1=[8,200018h,fd=46,ex=] exp=2s 0x55f6155b8720: proto=tcpv4 src=172.17.0.1:60120 fe=control-plane be=kube-apiservers srv=kind-control-plane ts=00 age=3s calls=2 rate=0 cpu=0 lat=0 rq[f=848202h,i=0,an=00h,rx=1s,wx=,ax=] rp[f=80048202h,i=0,an=00h,rx=1s,wx=,ax=] s0=[8,200008h,fd=31,ex=] s1=[8,200018h,fd=46,ex=] exp=1s 0x55f6155b8720: proto=tcpv4 src=172.17.0.1:60120 fe=control-plane be=kube-apiservers srv=kind-control-plane ts=00 age=4s calls=2 rate=0 cpu=0 lat=0 rq[f=848202h,i=0,an=00h,rx=0s,wx=,ax=] rp[f=80048202h,i=0,an=00h,rx=0s,wx=,ax=] s0=[8,200008h,fd=31,ex=] s1=[8,200018h,fd=46,ex=] exp=0s
Here we can see the expiration countdown (through the exp
in each
log). If no activity happens during this time, when exp
reaches 0,
HAProxy will close both connections.
So, as long as there is some traffic, everything will be fine. The L4 component will update its stats and will not close the connection. When that stream is silent for long enough, the L4 component will rightfully close the connection. After all it cannot tell if it's bad actors at the other side causing this behavior, or if any of the myriad of things that can go wrong with networking might have happened.
I could spot this behavior with a number of client commands:
kubectl exec -it
kubectl attach
kubectl logs -f
kubectl get -w
I'm sure this is not a complete list, but these are the ones I spotted and tried to fix in some way.
Speaking of which…
The How
L4 fine tuning
At a first sight it looks like configuration could help here. We increase the timeout, and that's it, right? Well, there are some downsides to that:
- Increasing timeouts in general calls for harder situations to debug. They happen less frequently, but they do happen. And when they do…
- It takes more time to reproduce until you find what's going on.
- Resources are limited; if you mix higher timeouts with bad actors you get a higher allocation of sockets and possibly an exhaustion of certain resources.
All in all, I tried this approach nevertheless just for the sake of it.
I'll spare you all the possibilities I tried and went through. I ended
up setting HAProxy's timeout tunnel
to an unreasonable high enough
timeout value (like 1 year), to see what happened.
The timeout tunnel seemed to be exactly what I was looking for:
The tunnel timeout applies when a bidirectional connection is established between a client and a server, and the connection remains inactive in both directions. This timeout supersedes both the client and server timeouts once the connection becomes a tunnel. In TCP, this timeout is used as soon as no analyzer remains attached to either connection (e.g. tcp content rules are accepted).
And so, HAProxy refused to start with the given configuration:
[ALERT] 330/181418 (1) : parsing [/usr/local/etc/haproxy/haproxy.cfg:17] : timer overflow in argument '8760h' to 'timeout tunnel' (maximum value is 2147483647 ms or ~24.8 days)
Even if I wanted to go with this apparently bad idea, HAProxy would refuse to keep a connection alive any longer than ~24.8 days. Fair enough.
If it wasn't obvious enough before, playing with timeouts on the L4 load balancer configuration was not going to get me much further. Moreover, if anything L4-only aware got in the middle, we would still have the same issue if this new component had lower timeouts, what would be perfectly possible (and reasonable).
At this point I started to look at the problem from a different angle. The application should make sure to generate some kind of traffic, even if there's no data on the stream, so anything between the client and the server would notice that there's something happening, and that the stream is not idle from the TCP perspective.
The Kubernetes patch
The idea at this point was to include a new parameter in kubectl
called --ping-interval
. We'll revisit what ping means in what context.
I started looking into how kubectl
implemented the commands
previously mentioned.
Let's run kubectl
with high verbosity to learn what it is doing
behind the scenes.
kubectl exec -it
~ > kubectl exec -v8 -it kube-apiserver-kind-control-plane -n kube-system sh (output omitted) I1127 23:54:11.979201 28755 round_trippers.go:427] Request Headers: I1127 23:54:11.979209 28755 round_trippers.go:431] X-Stream-Protocol-Version: v4.channel.k8s.io I1127 23:54:11.979216 28755 round_trippers.go:431] X-Stream-Protocol-Version: v3.channel.k8s.io I1127 23:54:11.979221 28755 round_trippers.go:431] X-Stream-Protocol-Version: v2.channel.k8s.io I1127 23:54:11.979228 28755 round_trippers.go:431] X-Stream-Protocol-Version: channel.k8s.io I1127 23:54:11.979234 28755 round_trippers.go:431] User-Agent: kubectl/v1.18.0 (linux/amd64) kubernetes/9905a33 I1127 23:54:11.995835 28755 round_trippers.go:446] Response Status: 101 Switching Protocols in 16 milliseconds I1127 23:54:11.995852 28755 round_trippers.go:449] Response Headers: I1127 23:54:11.995860 28755 round_trippers.go:452] Connection: Upgrade I1127 23:54:11.995867 28755 round_trippers.go:452] Upgrade: SPDY/3.1 I1127 23:54:11.995873 28755 round_trippers.go:452] X-Stream-Protocol-Version: v4.channel.k8s.io I1127 23:54:11.995878 28755 round_trippers.go:452] Date: Wed, 27 Nov 2019 22:54:11 GMT # ⏎ ~ >
kubectl attach
~ > kubectl attach -v8 kube-apiserver-kind-control-plane -n kube-system (output omitted) I1127 23:59:52.210600 31733 round_trippers.go:420] POST https://127.0.0.1:35411/api/v1/namespaces/kube-system/pods/kube-apiserver-kind-control-plane/attach?container=kube-apiserver&stderr=true&stdout=true I1127 23:59:52.210609 31733 round_trippers.go:427] Request Headers: I1127 23:59:52.210614 31733 round_trippers.go:431] X-Stream-Protocol-Version: v4.channel.k8s.io I1127 23:59:52.210618 31733 round_trippers.go:431] X-Stream-Protocol-Version: v3.channel.k8s.io I1127 23:59:52.210621 31733 round_trippers.go:431] X-Stream-Protocol-Version: v2.channel.k8s.io I1127 23:59:52.210625 31733 round_trippers.go:431] X-Stream-Protocol-Version: channel.k8s.io I1127 23:59:52.210631 31733 round_trippers.go:431] User-Agent: kubectl/v1.18.0 (linux/amd64) kubernetes/9905a33 I1127 23:59:52.298756 31733 round_trippers.go:446] Response Status: 101 Switching Protocols in 88 milliseconds I1127 23:59:52.298777 31733 round_trippers.go:449] Response Headers: I1127 23:59:52.298783 31733 round_trippers.go:452] X-Stream-Protocol-Version: v4.channel.k8s.io I1127 23:59:52.298788 31733 round_trippers.go:452] Date: Wed, 27 Nov 2019 22:59:52 GMT I1127 23:59:52.298793 31733 round_trippers.go:452] Connection: Upgrade I1127 23:59:52.298798 31733 round_trippers.go:452] Upgrade: SPDY/3.1 ~ >
kubectl logs -f
~ > kubectl logs -v8 -lk8s-app=kube-dns -f -n kube-system (output omitted) I1127 23:58:21.008411 31143 round_trippers.go:420] GET https://127.0.0.1:35411/api/v1/namespaces/kube-system/pods/coredns-5644d7b6d9-8vflb/log?follow=true&tailLines=10 I1127 23:58:21.008411 31143 round_trippers.go:420] GET https://127.0.0.1:35411/api/v1/namespaces/kube-system/pods/coredns-5644d7b6d9-r2vkw/log?follow=true&tailLines=10 I1127 23:58:21.008436 31143 round_trippers.go:427] Request Headers: I1127 23:58:21.008440 31143 round_trippers.go:427] Request Headers: I1127 23:58:21.008445 31143 round_trippers.go:431] Accept: application/json, */* I1127 23:58:21.008447 31143 round_trippers.go:431] Accept: application/json, */* I1127 23:58:21.008452 31143 round_trippers.go:431] User-Agent: kubectl/v1.18.0 (linux/amd64) kubernetes/9905a33 I1127 23:58:21.008456 31143 round_trippers.go:431] User-Agent: kubectl/v1.18.0 (linux/amd64) kubernetes/9905a33 I1127 23:58:21.021126 31143 round_trippers.go:446] Response Status: 200 OK in 12 milliseconds I1127 23:58:21.021151 31143 round_trippers.go:449] Response Headers: I1127 23:58:21.021162 31143 round_trippers.go:452] Content-Type: text/plain I1127 23:58:21.021179 31143 round_trippers.go:452] Date: Wed, 27 Nov 2019 22:58:21 GMT I1127 23:58:21.021187 31143 round_trippers.go:452] Cache-Control: no-cache, private .:53 2019-11-27T22:37:59.639Z [INFO] plugin/reload: Running configuration MD5 = f64cb9b977c7dfca58c4fab108535a76 2019-11-27T22:37:59.639Z [INFO] CoreDNS-1.6.2 2019-11-27T22:37:59.639Z [INFO] linux/amd64, go1.12.8, 795a3eb CoreDNS-1.6.2 linux/amd64, go1.12.8, 795a3eb I1127 23:58:21.023865 31143 round_trippers.go:446] Response Status: 200 OK in 15 milliseconds I1127 23:58:21.023881 31143 round_trippers.go:449] Response Headers: I1127 23:58:21.023890 31143 round_trippers.go:452] Cache-Control: no-cache, private I1127 23:58:21.023908 31143 round_trippers.go:452] Content-Type: text/plain I1127 23:58:21.023915 31143 round_trippers.go:452] Date: Wed, 27 Nov 2019 22:58:21 GMT .:53 2019-11-27T22:37:59.608Z [INFO] plugin/reload: Running configuration MD5 = f64cb9b977c7dfca58c4fab108535a76 2019-11-27T22:37:59.608Z [INFO] CoreDNS-1.6.2 2019-11-27T22:37:59.608Z [INFO] linux/amd64, go1.12.8, 795a3eb CoreDNS-1.6.2 linux/amd64, go1.12.8, 795a3eb F1127 23:58:22.025841 31143 helpers.go:114] error: unexpected EOF ~ >
kubectl get -w
~ > kubectl get -v8 -w pods -n kube-system (output omitted) I1128 00:07:33.132398 36629 round_trippers.go:420] GET https://127.0.0.1:35411/api/v1/namespaces/kube-system/pods?resourceVersion=3023&watch=true I1128 00:07:33.132412 36629 round_trippers.go:427] Request Headers: I1128 00:07:33.132421 36629 round_trippers.go:431] Accept: application/json;as=Table;v=v1beta1;g=meta.k8s.io, application/json I1128 00:07:33.132430 36629 round_trippers.go:431] User-Agent: kubectl/v1.18.0 (linux/amd64) kubernetes/9905a33 I1128 00:07:33.133569 36629 round_trippers.go:446] Response Status: 200 OK in 1 milliseconds I1128 00:07:33.133583 36629 round_trippers.go:449] Response Headers: I1128 00:07:33.133590 36629 round_trippers.go:452] Date: Wed, 27 Nov 2019 23:07:33 GMT I1128 00:07:33.133594 36629 round_trippers.go:452] Cache-Control: no-cache, private I1128 00:07:33.133599 36629 round_trippers.go:452] Content-Type: application/json I1128 00:07:34.135522 36629 streamwatcher.go:114] Unexpected EOF during watch stream event decoding: unexpected EOF ~ >
Observations
At first sight, exec
and attach
are very similar. Both are
upgrading the HTTP connection to SPDY/3.1
.
On the other hand, logs -f
and get -w
are HTTP/1.1
requests.
The patch
The pull request I opened can be seen here. Please, bear in mind that it was a heavy WIP and many things are not even considered. Let's go through it.
The strategy
Different commands had slightly different strategies.
exec -it
and attach
Both commands are relying on apimachinery's httpstream
. The main
idea was to add a Ping()
method to the Connection
interface. In
this case, this method would rely directly on calling to the docker's
spdystream
backend (that looks like this.)
logs -f
ResponseWrapper
interface has a Stream()
function, so the idea
was to add a new StreamWithPing()
function. So users of the
client-go
library could choose which one to use depending on their
needs, while keeping the original behavior intact.
get -w
get -w
is another slightly different case. It will run in a loop
watching for changes, so there's no need to strictly keep the
connection alive. If the connection is closed on the server end, the
client could just be instructed to reissue the same request.
client-go
Then, a keepalive
package would be included into client-go
, so it
could be used with different backends (e.g. spdystreamer
, golang's
net/http
and net/http2
…). This package would define a Pinger
interface containing a single function: Ping()
. The different
implementations here is in part what I liked the least about this
proposal. There were several pinger implementations in the PR:
While the spdy
pinger was clearly the implementation I felt more
comfortable with, it wasn't the case with the restclient
and http
pinger implementations.
All my tests were successful with all backends, but the restclient
and http
solutions were working because of the inner implementation
of the internal socket reuse of golang's net/http
internally.
Apart from that, it wasn't possible to have a dedicated net/http2
backend because if HTTP/2
is used through golang's net/http
(that
includes net/http2
in a bundle), it's impossible to access
net/http2
internals from within net/http
– in particular the
ClientConn
associated with the connection, to Ping()
the other end.
Had golang permitted this somehow it would have been possible to take
advantage of HTTP/2
's native ping frame, what would have been
sufficient to generate some TCP traffic, knowing that the right socket
would have been used.
The main problem with the restclient
and http
pingers is that if
golang's internal implementation changed and didn't reuse sockets
under certain circumstances our "pings" wouldn't go through the
expected TCP socket, what wouldn't count towards TCP traffic after
all, and eventually the connection would have been closed
nevertheless.
Also, there is the problem of what ping means in an HTTP/1.1
context. In my very first proposal of the PR it was a mere HTTP
request to scheme://hostPort/healthz
. Whether this operation was
allowed, whether we had permissions to perform this operation, or even
if the endpoint didn't exist are things we didn't really care
about. As long as the request was made, and some TCP traffic made
it through, the goal was achieved.
In any case, the solution was not optimal for restclient
, neither
for http
and it wasn't possible to properly extract and create a
dedicated http2
pinger because of how golang hides the internal
implementation of net/http2
behind net/http
when using HTTP/2
through net/http
.
And so, while I was thinking about all these problems, and all these things I didn't really like about this possible solution, another possibility was proposed.
The golang patch
There was a proposal for golang to perform these pings on http2
automatically by the runtime.
Despite this solution might not be the definitive one (and only fits
on http2
), I think it's good enough for the ball to get rolling in
the golang space.
Knowing that golang will automatically send ping frames to the other end at intervals allows us to just configure the intervals and let the golang runtime do the rest. Since this is clearly an application layer problem I think it makes a lot of sense to have the solution in this space.
Conclusion
Depending on how this feature lands in golang, and depending on its
design I might think about this problem again in the Kubernetes space,
and check again if it makes sense to implement something like
--ping-interval
baked into kubectl
and client-go
.
For now, to something different.