Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DNS intermittent delays of 5s #56903

Closed
mikksoone opened this issue Dec 6, 2017 · 261 comments
Closed

DNS intermittent delays of 5s #56903

mikksoone opened this issue Dec 6, 2017 · 261 comments
Labels
area/dns kind/bug Categorizes issue or PR as related to a bug. sig/network Categorizes an issue or PR as relevant to SIG Network.

Comments

@mikksoone
Copy link

mikksoone commented Dec 6, 2017

Is this a BUG REPORT or FEATURE REQUEST?:
/kind bug

What happened:
DNS lookup is sometimes taking 5 seconds.

What you expected to happen:
No delays in DNS.

How to reproduce it (as minimally and precisely as possible):

  1. Create a cluster in AWS using kops with cni networking:
kops create cluster     --node-count 3     --zones eu-west-1a,eu-west-1b,eu-west-1c     --master-zones eu-west-1a,eu-west-1b,eu-west-1c     --dns-zone kube.example.com   --node-size t2.medium     --master-size t2.medium  --topology private --networking cni   --cloud-labels "Env=Staging"  ${NAME}
  1. CNI plugin:
kubectl apply -f "https://cloud.weave.works/k8s/net?k8s-version=$(kubectl version | base64 | tr -d '\n')"
  1. Run this script in any pod with that has curl:
var=1
while true ; do
  res=$( { curl -o /dev/null -s -w %{time_namelookup}\\n  http://www.google.com; } 2>&1 )
  var=$((var+1))
  if [[ $res =~ ^[1-9] ]]; then
    now=$(date +"%T")
    echo "$var slow: $res $now"
    break
  fi
done

Anything else we need to know?:

  1. I am encountering this issue in both staging and production clusters, but for some reason staging cluster is having a lot more 5s delays.
  2. Delays happen both for external services (google.com) or internal, such as service.namespace.
  3. Happens on both 1.6 and 1.7 version of kubernetes, but did not encounter these issues in 1.5 (though the setup was a bit different - no CNI back then).
  4. Have not tested with 1.7 without CNI yet.

Environment:

  • Kubernetes version (use kubectl version):
Client Version: version.Info{Major:"1", Minor:"8", GitVersion:"v1.8.2", GitCommit:"bdaeafa71f6c7c04636251031f93464384d54963", GitTreeState:"clean", BuildDate:"2017-10-24T19:48:57Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"7", GitVersion:"v1.7.10", GitCommit:"bebdeb749f1fa3da9e1312c4b08e439c404b3136", GitTreeState:"clean", BuildDate:"2017-11-03T16:31:49Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}
  • Cloud provider or hardware configuration:
AWS
  • OS (e.g. from /etc/os-release):
PRETTY_NAME="Ubuntu 16.04.3 LTS"
  • Kernel (e.g. uname -a):
Linux ingress-nginx-3882489562-438sm 4.4.65-k8s #1 SMP Tue May 2 15:48:24 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

Similar issues

  1. Kube DNS Latency dns#96 - closed but seems to be exactly the same
  2. kube-dns: dnsmasq intermittent connection refused #45976 - has some comments matching this issue, but is taking the direction of fixing kube-dns up/down scaling problem, and is not about the intermittent failures.

/sig network

@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Dec 6, 2017
@k8s-github-robot k8s-github-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Dec 6, 2017
@k8s-ci-robot k8s-ci-robot added the sig/network Categorizes an issue or PR as relevant to SIG Network. label Dec 6, 2017
@k8s-github-robot k8s-github-robot removed the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Dec 6, 2017
@cmluciano
Copy link

@kubernetes/sig-network-misc

@kgignatyev-inspur
Copy link

I have similar issue: consistently slow DNS resolution from pods, 20 seconds plus
from busybox:
time nslookup google.com
Server: 100.64.0.10
Address 1: 100.64.0.10

Name: google.com
Address 1: 2607:f8b0:400a:806::200e
Address 2: 172.217.3.206 sea15s12-in-f14.1e100.net
real 0m 50.03s
user 0m 0.00s
sys 0m 0.00s
/ #

I just created 1.8.5 cluster in AWS with kops, and only deviation from standard config is that I am using CentOS host machines (ami-e535c59d for us-west-2)

resolution from hosts is instanteneous, from pods: consistently slow

@ani82
Copy link

ani82 commented Dec 23, 2017

we observe the same on GKE with version v1.8.4-gke0 and both Busybox (latest) or Debian9:

$ kubectl exec -ti busybox -- time nslookup storage.googleapis.com
Server: 10.39.240.10
Address 1: 10.39.240.10 kube-dns.kube-system.svc.cluster.local

Name: storage.googleapis.com
Address 1: 2607:f8b0:400c:c06::80 vl-in-x80.1e100.net
Address 2: 74.125.141.128 vl-in-f128.1e100.net
real 0m 10.02s
user 0m 0.00s
sys 0m 0.00s

DNS latency varies between 10 and 40s in multiples of 5s.

@thockin
Copy link
Member

thockin commented Jan 6, 2018

5s is pretty much ALWAYS indicating a DNS timeout, meaning some packet got dropped somewhere.

@ani82
Copy link

ani82 commented Jan 9, 2018

Yes, it seems as if the local DNS servers timeout instead of answering :

[root@busybox /]# nslookup google.com
;; connection timed out; trying next origin
;; connection timed out; trying next origin
;; connection timed out; trying next origin
;; connection timed out; trying next origin
;; connection timed out; trying next origin
;; connection timed out; no servers could be reached

[root@busybox /]# tcpdump port 53
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
15:38:10.423547 IP busybox.46239 > kube-dns.kube-system.svc.cluster.local.domain: 51779+ A? google.com.default.svc.cluster.local. (54)
15:38:10.424120 IP busybox.46757 > kube-dns.kube-system.svc.cluster.local.domain: 41018+ PTR? 10.240.39.10.in-addr.arpa. (43)
15:38:10.424595 IP kube-dns.kube-system.svc.cluster.local.domain > busybox.46757: 41018 1/0/0 PTR kube-dns.kube-system.svc.cluster.local. (95)
15:38:15.423611 IP busybox.46239 > kube-dns.kube-system.svc.cluster.local.domain: 51779+ A? google.com.default.svc.cluster.local. (54)
15:38:20.423809 IP busybox.46239 > kube-dns.kube-system.svc.cluster.local.domain: 51779+ A? google.com.default.svc.cluster.local. (54)
15:38:25.424247 IP busybox.44496 > kube-dns.kube-system.svc.cluster.local.domain: 63451+ A? google.com.svc.cluster.local. (46)
15:38:30.424508 IP busybox.39936 > kube-dns.kube-system.svc.cluster.local.domain: 14687+ A? google.com.cluster.local. (42)
15:38:35.424767 IP busybox.56675 > kube-dns.kube-system.svc.cluster.local.domain: 37241+ A? google.com.c.retailcatalyst-187519.internal. (61)
15:38:40.424992 IP busybox.35842 > kube-dns.kube-system.svc.cluster.local.domain: 22668+ A? google.com.google.internal. (44)
15:38:45.425295 IP busybox.52037 > kube-dns.kube-system.svc.cluster.local.domain: 6207+ A? google.com. (28)

@aguerra
Copy link

aguerra commented Jan 19, 2018

Just in case someone got here because of dns delays, in our case it was arp table overflow on the nodes (arp -n showing more than 1000 entries). Increasing the limits solved the problem.

@lbrictson
Copy link

We have the same issue within all of our kops deployed aws clusters (5). We tried moving from weave to flannel to rule out the CNI but the issue is the same. Our kube-dns pods are healthy, one on every host and they have not crashed recently.

Our arp tables are no where near full (less than 100 entries usually)

@bowei
Copy link
Member

bowei commented Jan 19, 2018

There are QPS limits on DNS at various places. I think in the past, people have hit AWS DNS server QPS limits in some cases, that may be worth checking.

@lbrictson
Copy link

@bowei sadly this happens in very small clusters as well for us, ones that have so few containers that there is no feasible way we'd be hitting the QPS limit from AWS

@mikksoone
Copy link
Author

Same here, small clusters, no arp nor QPS limits.
dnsPolicy: Default works without delays, but this unfortunately can not be used for all deployments.

@lbrictson
Copy link

@mikksoone exact same situation as us then, dnsPolicy: Default fixes the problem entirely, but of course breaks accessing services internal to the cluster which is a no-go for most of ours.

@vasartori
Copy link
Contributor

vasartori commented Jan 19, 2018

@bowei We have the same problem here.
But we are not using AWS.

@vasartori
Copy link
Contributor

Its seems a problem with glibc.
If you set a timeout on your /etc/resolv.conf this timeout will be respected.

On CoreOS Stable (glibc 2.23) this problem appears.

Setting with 0 in timeout on resolv.conf you will get a 1 secound delay....

I've try disable the IPv6, without success....

@vasartori
Copy link
Contributor

In my tests, using this option on /etc/resolv.conf
options single-request-reopen

Fixed the problem.
But I don't find a "clean" way to put it on pods in kubernetes 1.8.
What I do:

        lifecycle:
          postStart:
            exec:
              command:
              - /bin/sh
              - -c 
              - "/bin/echo 'options single-request-reopen' >> /etc/resolv.conf"

@mikksoone Could you try if it solve your problem too?

@aca02djr
Copy link

Also experiencing this on 1.8.6-gke.0 - @vasartori suggested solution resolved the issue for us too 👍🏻

@mikksoone
Copy link
Author

Doesn't solve the issue for me. Even with this option in resolv.conf I get timeouts of 5s, 2.5s and 3.5s - and they happen very often, twice per minute or so.

@lauri-elevant
Copy link

lauri-elevant commented Feb 5, 2018

We have the same symptoms on 1.8, intermittent DNS resolution stall of 5 seconds. The suggested workaround seems to be effective for us as well. Thank you @vasartori !

@sdtokkolabs
Copy link

I've been having this issue for some time on kubernetes 1.7 and 1.8. I was dropping dns queries from time to time
Yesterday I upgraded my cluster from 1.8.10 to 1.9.6 (kops from 1.8 to 1.9.0-alpha.3) and I started having this same issue ALL THE TIME. The workaround sugested in this issue has no effect and I can't find any way of stopping it. I've made a small workaround by assigning the most requested (and poblematic) DNS to fixed IPs in /etc/hosts.
Any idea on where the real problem is?
I'll test with a brand new cluster in the same versions and report back.

@xiaoxubeii
Copy link
Member

Same problem, but the most strange thing is that it appears on some nodes.

@rajatjindal
Copy link
Contributor

@thockin @bowei

requesting your feedback therefore tagging you.

can this be of any interest here: https://tech.xing.com/a-reason-for-unexplained-connection-timeouts-on-kubernetes-docker-abd041cf7e02

there are multiple issues reported for this issue in kubernetes project, and will be great to have it resolved for everyone.

@sdtokkolabs
Copy link

tried with several versions of kubernetes on fresh clusters, all hae the same problem to some degree, dns lookups get lost on the way and retryes have to be made. I've also tested kubenet, flanel, canal and weave as network providers, having the lowest incidence in flanel. I've also tried overloading the nodes and splitting the nodes (dns on it's own machine) but it made no difference. On my production cluster the incidence of this issue is way higher than on a brand new cluster and i can't find the way to isolate the problem :(

@debMan
Copy link

debMan commented Jan 3, 2021

We faced the same issue on a small self-managed cluster.
The problem solved by scaling down the coreDNS pods to 1 pod.
This is a strange and unexpected solution, but it has solved the problem.

Cluster info:

nodes arch/OS:  amd64/debian
master nodes:   1
worker nodes:   6
deployments:    100
pods:           150

@Mainintowhile
Copy link

Mainintowhile commented Nov 17, 2021

similar with this issue.
first serval time got right response, and then got Address: 127.0.0.1 every time.

my solution is:
version:1.19.0
1 use the k8s coredns model(sed -f transforms2sed.sed coredns.yaml.base > coredns.yaml)
2 simplify the config such as
.:53 {
errors
health {
lameduck 5s
}
ready
kubernetes cluster.local {
ttl 0
}
forward . /etc/resolv.conf
loop
reload
}

ps: i doubt that the cache relate part has bug.
notice that ip setting keep same. such as 10.244.0.1/16 (perhaps node is 10.244.0.1/24)
compare:
1 keep etcd and coredns together resolve perfect
2 simplify the config does not stable

@onedr0p
Copy link

onedr0p commented May 10, 2023

Alpine 3.18 with included musl 1.2.4 seemed to have finally fixed this issue

https://www.alpinelinux.org/posts/Alpine-3.18.0-released.html

@tomwidmer
Copy link

tomwidmer commented Oct 25, 2023

I upgraded a container image to Alpine 3.18 (the container is running on AWS EKS Fargate), I'm still getting the issue intermittently. 5s delay on some DNS requests and occasional failures with getaddrinfo EAI_AGAIN from NodeJS. This is resolving local service addresses in the same namespace using the very short form unqualified domain name service-name.

@jmjoy
Copy link

jmjoy commented Apr 18, 2024

I want to know which version of Linux kernel has completely solved this problem?

@haorenfsa
Copy link

haorenfsa commented Apr 19, 2024

Alpine 3.18 with included musl 1.2.4 seemed to have finally fixed this issue

https://www.alpinelinux.org/posts/Alpine-3.18.0-released.html

@jmjoy at least we know it's fixed on on Linux kernel 6.1 , since alpine 3.18 is using 6.1

Sry, it's a problem of libc other than linux kernel. Alpine 3.18 uses musl libc v1.2.4 which fix this issue. Also alpine3.8 uses kernel 6.1.

Different kind of linux release may use different version, most releases uses glibc instead of musl libc.

So I assume what you really want know is which version of glibc fixes this issue. I guess not yet?

There's a fix by https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ed07d9a021df6da53456663a76999189badc432a

@jmjoy
Copy link

jmjoy commented Apr 19, 2024

Alpine 3.18 with included musl 1.2.4 seemed to have finally fixed this issue
https://www.alpinelinux.org/posts/Alpine-3.18.0-released.html

@jmjoy at least we know it's fixed on on Linux kernel 6.1 , since alpine 3.18 is using 6.1

This seems to be just fixed by musl, not the kernel.

@haorenfsa
Copy link

@jm I think the root cause in conntrack it's fixed by this patch which is included first time at v4.19-rc1, so kernel 4.19 or later version

@jmjoy
Copy link

jmjoy commented Apr 19, 2024

@jm I think the root cause in conntrack it's fixed by this patch which is included first time at v4.19-rc1, so kernel 4.19 or later version

I have read the complete issue and it should have been fully fixed in 5.0. (#56903 (comment)).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/dns kind/bug Categorizes issue or PR as related to a bug. sig/network Categorizes an issue or PR as relevant to SIG Network.
Projects
None yet
Development

No branches or pull requests