New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DNS intermittent delays of 5s #56903
Comments
@kubernetes/sig-network-misc |
I have similar issue: consistently slow DNS resolution from pods, 20 seconds plus Name: google.com I just created 1.8.5 cluster in AWS with kops, and only deviation from standard config is that I am using CentOS host machines (ami-e535c59d for us-west-2) resolution from hosts is instanteneous, from pods: consistently slow |
we observe the same on GKE with version v1.8.4-gke0 and both Busybox (latest) or Debian9: $ kubectl exec -ti busybox -- time nslookup storage.googleapis.com Name: storage.googleapis.com DNS latency varies between 10 and 40s in multiples of 5s. |
5s is pretty much ALWAYS indicating a DNS timeout, meaning some packet got dropped somewhere. |
Yes, it seems as if the local DNS servers timeout instead of answering : [root@busybox /]# nslookup google.com [root@busybox /]# tcpdump port 53 |
Just in case someone got here because of dns delays, in our case it was arp table overflow on the nodes (arp -n showing more than 1000 entries). Increasing the limits solved the problem. |
We have the same issue within all of our kops deployed aws clusters (5). We tried moving from weave to flannel to rule out the CNI but the issue is the same. Our kube-dns pods are healthy, one on every host and they have not crashed recently. Our arp tables are no where near full (less than 100 entries usually) |
There are QPS limits on DNS at various places. I think in the past, people have hit AWS DNS server QPS limits in some cases, that may be worth checking. |
@bowei sadly this happens in very small clusters as well for us, ones that have so few containers that there is no feasible way we'd be hitting the QPS limit from AWS |
Same here, small clusters, no arp nor QPS limits. |
@mikksoone exact same situation as us then, dnsPolicy: Default fixes the problem entirely, but of course breaks accessing services internal to the cluster which is a no-go for most of ours. |
@bowei We have the same problem here. |
Its seems a problem with glibc. On CoreOS Stable (glibc 2.23) this problem appears. Setting with 0 in timeout on resolv.conf you will get a 1 secound delay.... I've try disable the IPv6, without success.... |
In my tests, using this option on /etc/resolv.conf Fixed the problem.
@mikksoone Could you try if it solve your problem too? |
Also experiencing this on 1.8.6-gke.0 - @vasartori suggested solution resolved the issue for us too 👍🏻 |
Doesn't solve the issue for me. Even with this option in resolv.conf I get timeouts of 5s, 2.5s and 3.5s - and they happen very often, twice per minute or so. |
We have the same symptoms on 1.8, intermittent DNS resolution stall of 5 seconds. The suggested workaround seems to be effective for us as well. Thank you @vasartori ! |
I've been having this issue for some time on kubernetes 1.7 and 1.8. I was dropping dns queries from time to time |
Same problem, but the most strange thing is that it appears on some nodes. |
requesting your feedback therefore tagging you. can this be of any interest here: https://tech.xing.com/a-reason-for-unexplained-connection-timeouts-on-kubernetes-docker-abd041cf7e02 there are multiple issues reported for this issue in kubernetes project, and will be great to have it resolved for everyone. |
tried with several versions of kubernetes on fresh clusters, all hae the same problem to some degree, dns lookups get lost on the way and retryes have to be made. I've also tested kubenet, flanel, canal and weave as network providers, having the lowest incidence in flanel. I've also tried overloading the nodes and splitting the nodes (dns on it's own machine) but it made no difference. On my production cluster the incidence of this issue is way higher than on a brand new cluster and i can't find the way to isolate the problem :( |
We faced the same issue on a small self-managed cluster. Cluster info:
|
similar with this issue. my solution is: ps: i doubt that the cache relate part has bug. |
Alpine 3.18 with included musl 1.2.4 seemed to have finally fixed this issue https://www.alpinelinux.org/posts/Alpine-3.18.0-released.html |
I upgraded a container image to Alpine 3.18 (the container is running on AWS EKS Fargate), I'm still getting the issue intermittently. 5s delay on some DNS requests and occasional failures with |
I want to know which version of Linux kernel has completely solved this problem? |
@jmjoy Sry, it's a problem of libc other than linux kernel. Alpine 3.18 uses musl libc v1.2.4 which fix this issue. Also alpine3.8 uses kernel 6.1. Different kind of linux release may use different version, most releases uses glibc instead of musl libc. So I assume what you really want know is which version of glibc fixes this issue. There's a fix by https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ed07d9a021df6da53456663a76999189badc432a |
This seems to be just fixed by musl, not the kernel. |
@jm I think the root cause in conntrack it's fixed by this patch which is included first time at v4.19-rc1, so kernel 4.19 or later version |
I have read the complete issue and it should have been fully fixed in 5.0. (#56903 (comment)). |
Is this a BUG REPORT or FEATURE REQUEST?:
/kind bug
What happened:
DNS lookup is sometimes taking 5 seconds.
What you expected to happen:
No delays in DNS.
How to reproduce it (as minimally and precisely as possible):
Anything else we need to know?:
Environment:
kubectl version
):uname -a
):Similar issues
/sig network
The text was updated successfully, but these errors were encountered: