DNS intermittent delays of 5s #56903

mikksoone · 2017-12-06T22:14:47Z

Is this a BUG REPORT or FEATURE REQUEST?:
/kind bug

What happened:
DNS lookup is sometimes taking 5 seconds.

What you expected to happen:
No delays in DNS.

How to reproduce it (as minimally and precisely as possible):

Create a cluster in AWS using kops with cni networking:

kops create cluster     --node-count 3     --zones eu-west-1a,eu-west-1b,eu-west-1c     --master-zones eu-west-1a,eu-west-1b,eu-west-1c     --dns-zone kube.example.com   --node-size t2.medium     --master-size t2.medium  --topology private --networking cni   --cloud-labels "Env=Staging"  ${NAME}

CNI plugin:

kubectl apply -f "https://cloud.weave.works/k8s/net?k8s-version=$(kubectl version | base64 | tr -d '\n')"

Run this script in any pod with that has curl:

var=1
while true ; do
  res=$( { curl -o /dev/null -s -w %{time_namelookup}\\n  http://www.google.com; } 2>&1 )
  var=$((var+1))
  if [[ $res =~ ^[1-9] ]]; then
    now=$(date +"%T")
    echo "$var slow: $res $now"
    break
  fi
done

Anything else we need to know?:

I am encountering this issue in both staging and production clusters, but for some reason staging cluster is having a lot more 5s delays.
Delays happen both for external services (google.com) or internal, such as service.namespace.
Happens on both 1.6 and 1.7 version of kubernetes, but did not encounter these issues in 1.5 (though the setup was a bit different - no CNI back then).
Have not tested with 1.7 without CNI yet.

Environment:

Kubernetes version (use kubectl version):

Client Version: version.Info{Major:"1", Minor:"8", GitVersion:"v1.8.2", GitCommit:"bdaeafa71f6c7c04636251031f93464384d54963", GitTreeState:"clean", BuildDate:"2017-10-24T19:48:57Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"7", GitVersion:"v1.7.10", GitCommit:"bebdeb749f1fa3da9e1312c4b08e439c404b3136", GitTreeState:"clean", BuildDate:"2017-11-03T16:31:49Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}

Cloud provider or hardware configuration:

AWS

OS (e.g. from /etc/os-release):

PRETTY_NAME="Ubuntu 16.04.3 LTS"

Kernel (e.g. uname -a):

Linux ingress-nginx-3882489562-438sm 4.4.65-k8s #1 SMP Tue May 2 15:48:24 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

Similar issues

Kube DNS Latency dns#96 - closed but seems to be exactly the same
kube-dns: dnsmasq intermittent connection refused #45976 - has some comments matching this issue, but is taking the direction of fixing kube-dns up/down scaling problem, and is not about the intermittent failures.

/sig network

The text was updated successfully, but these errors were encountered:

cmluciano · 2017-12-08T04:11:32Z

@kubernetes/sig-network-misc

kgignatyev-inspur · 2017-12-10T07:19:22Z

I have similar issue: consistently slow DNS resolution from pods, 20 seconds plus
from busybox:
time nslookup google.com
Server: 100.64.0.10
Address 1: 100.64.0.10

Name: google.com
Address 1: 2607:f8b0:400a:806::200e
Address 2: 172.217.3.206 sea15s12-in-f14.1e100.net
real 0m 50.03s
user 0m 0.00s
sys 0m 0.00s
/ #

I just created 1.8.5 cluster in AWS with kops, and only deviation from standard config is that I am using CentOS host machines (ami-e535c59d for us-west-2)

resolution from hosts is instanteneous, from pods: consistently slow

ani82 · 2017-12-23T14:45:56Z

we observe the same on GKE with version v1.8.4-gke0 and both Busybox (latest) or Debian9:

$ kubectl exec -ti busybox -- time nslookup storage.googleapis.com
Server: 10.39.240.10
Address 1: 10.39.240.10 kube-dns.kube-system.svc.cluster.local

Name: storage.googleapis.com
Address 1: 2607:f8b0:400c:c06::80 vl-in-x80.1e100.net
Address 2: 74.125.141.128 vl-in-f128.1e100.net
real 0m 10.02s
user 0m 0.00s
sys 0m 0.00s

DNS latency varies between 10 and 40s in multiples of 5s.

thockin · 2018-01-06T00:29:18Z

5s is pretty much ALWAYS indicating a DNS timeout, meaning some packet got dropped somewhere.

ani82 · 2018-01-09T20:42:03Z

Yes, it seems as if the local DNS servers timeout instead of answering :

[root@busybox /]# nslookup google.com
;; connection timed out; trying next origin
;; connection timed out; trying next origin
;; connection timed out; trying next origin
;; connection timed out; trying next origin
;; connection timed out; trying next origin
;; connection timed out; no servers could be reached

[root@busybox /]# tcpdump port 53
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
15:38:10.423547 IP busybox.46239 > kube-dns.kube-system.svc.cluster.local.domain: 51779+ A? google.com.default.svc.cluster.local. (54)
15:38:10.424120 IP busybox.46757 > kube-dns.kube-system.svc.cluster.local.domain: 41018+ PTR? 10.240.39.10.in-addr.arpa. (43)
15:38:10.424595 IP kube-dns.kube-system.svc.cluster.local.domain > busybox.46757: 41018 1/0/0 PTR kube-dns.kube-system.svc.cluster.local. (95)
15:38:15.423611 IP busybox.46239 > kube-dns.kube-system.svc.cluster.local.domain: 51779+ A? google.com.default.svc.cluster.local. (54)
15:38:20.423809 IP busybox.46239 > kube-dns.kube-system.svc.cluster.local.domain: 51779+ A? google.com.default.svc.cluster.local. (54)
15:38:25.424247 IP busybox.44496 > kube-dns.kube-system.svc.cluster.local.domain: 63451+ A? google.com.svc.cluster.local. (46)
15:38:30.424508 IP busybox.39936 > kube-dns.kube-system.svc.cluster.local.domain: 14687+ A? google.com.cluster.local. (42)
15:38:35.424767 IP busybox.56675 > kube-dns.kube-system.svc.cluster.local.domain: 37241+ A? google.com.c.retailcatalyst-187519.internal. (61)
15:38:40.424992 IP busybox.35842 > kube-dns.kube-system.svc.cluster.local.domain: 22668+ A? google.com.google.internal. (44)
15:38:45.425295 IP busybox.52037 > kube-dns.kube-system.svc.cluster.local.domain: 6207+ A? google.com. (28)

aguerra · 2018-01-19T17:30:29Z

Just in case someone got here because of dns delays, in our case it was arp table overflow on the nodes (arp -n showing more than 1000 entries). Increasing the limits solved the problem.

lbrictson · 2018-01-19T20:01:33Z

We have the same issue within all of our kops deployed aws clusters (5). We tried moving from weave to flannel to rule out the CNI but the issue is the same. Our kube-dns pods are healthy, one on every host and they have not crashed recently.

Our arp tables are no where near full (less than 100 entries usually)

bowei · 2018-01-19T20:23:50Z

There are QPS limits on DNS at various places. I think in the past, people have hit AWS DNS server QPS limits in some cases, that may be worth checking.

lbrictson · 2018-01-19T20:31:17Z

@bowei sadly this happens in very small clusters as well for us, ones that have so few containers that there is no feasible way we'd be hitting the QPS limit from AWS

mikksoone · 2018-01-19T20:50:07Z

Same here, small clusters, no arp nor QPS limits.
dnsPolicy: Default works without delays, but this unfortunately can not be used for all deployments.

lbrictson · 2018-01-19T20:52:30Z

@mikksoone exact same situation as us then, dnsPolicy: Default fixes the problem entirely, but of course breaks accessing services internal to the cluster which is a no-go for most of ours.

vasartori · 2018-01-19T20:56:00Z

@bowei We have the same problem here.
But we are not using AWS.

vasartori · 2018-01-22T16:28:45Z

Its seems a problem with glibc.
If you set a timeout on your /etc/resolv.conf this timeout will be respected.

On CoreOS Stable (glibc 2.23) this problem appears.

Setting with 0 in timeout on resolv.conf you will get a 1 secound delay....

I've try disable the IPv6, without success....

vasartori · 2018-01-23T19:09:47Z

In my tests, using this option on /etc/resolv.conf
options single-request-reopen

Fixed the problem.
But I don't find a "clean" way to put it on pods in kubernetes 1.8.
What I do:

        lifecycle:
          postStart:
            exec:
              command:
              - /bin/sh
              - -c 
              - "/bin/echo 'options single-request-reopen' >> /etc/resolv.conf"

@mikksoone Could you try if it solve your problem too?

aca02djr · 2018-01-24T00:03:23Z

Also experiencing this on 1.8.6-gke.0 - @vasartori suggested solution resolved the issue for us too 👍🏻

mikksoone · 2018-01-26T20:51:40Z

Doesn't solve the issue for me. Even with this option in resolv.conf I get timeouts of 5s, 2.5s and 3.5s - and they happen very often, twice per minute or so.

lauri-elevant · 2018-02-05T14:56:35Z

We have the same symptoms on 1.8, intermittent DNS resolution stall of 5 seconds. The suggested workaround seems to be effective for us as well. Thank you @vasartori !

sdtokkolabs · 2018-04-03T17:38:23Z

I've been having this issue for some time on kubernetes 1.7 and 1.8. I was dropping dns queries from time to time
Yesterday I upgraded my cluster from 1.8.10 to 1.9.6 (kops from 1.8 to 1.9.0-alpha.3) and I started having this same issue ALL THE TIME. The workaround sugested in this issue has no effect and I can't find any way of stopping it. I've made a small workaround by assigning the most requested (and poblematic) DNS to fixed IPs in /etc/hosts.
Any idea on where the real problem is?
I'll test with a brand new cluster in the same versions and report back.

xiaoxubeii · 2018-04-12T12:11:36Z

Same problem, but the most strange thing is that it appears on some nodes.

rajatjindal · 2018-04-14T15:35:44Z

@thockin @bowei

requesting your feedback therefore tagging you.

can this be of any interest here: https://tech.xing.com/a-reason-for-unexplained-connection-timeouts-on-kubernetes-docker-abd041cf7e02

there are multiple issues reported for this issue in kubernetes project, and will be great to have it resolved for everyone.

sdtokkolabs · 2018-04-17T21:20:32Z

tried with several versions of kubernetes on fresh clusters, all hae the same problem to some degree, dns lookups get lost on the way and retryes have to be made. I've also tested kubenet, flanel, canal and weave as network providers, having the lowest incidence in flanel. I've also tried overloading the nodes and splitting the nodes (dns on it's own machine) but it made no difference. On my production cluster the incidence of this issue is way higher than on a brand new cluster and i can't find the way to isolate the problem :(

debMan · 2021-01-03T01:17:32Z

We faced the same issue on a small self-managed cluster.
The problem solved by scaling down the coreDNS pods to 1 pod.
This is a strange and unexpected solution, but it has solved the problem.

Cluster info:

nodes arch/OS:  amd64/debian
master nodes:   1
worker nodes:   6
deployments:    100
pods:           150

Mainintowhile · 2021-11-17T02:41:13Z

similar with this issue.
first serval time got right response, and then got Address: 127.0.0.1 every time.

my solution is:
version:1.19.0
1 use the k8s coredns model(sed -f transforms2sed.sed coredns.yaml.base > coredns.yaml)
2 simplify the config such as
.:53 {
errors
health {
lameduck 5s
}
ready
kubernetes cluster.local {
ttl 0
}
forward . /etc/resolv.conf
loop
reload
}

ps: i doubt that the cache relate part has bug.
notice that ip setting keep same. such as 10.244.0.1/16 (perhaps node is 10.244.0.1/24)
compare:
1 keep etcd and coredns together resolve perfect
2 simplify the config does not stable

onedr0p · 2023-05-10T18:22:30Z

Alpine 3.18 with included musl 1.2.4 seemed to have finally fixed this issue

https://www.alpinelinux.org/posts/Alpine-3.18.0-released.html

tomwidmer · 2023-10-25T13:11:52Z

I upgraded a container image to Alpine 3.18 (the container is running on AWS EKS Fargate), I'm still getting the issue intermittently. 5s delay on some DNS requests and occasional failures with getaddrinfo EAI_AGAIN from NodeJS. This is resolving local service addresses in the same namespace using the very short form unqualified domain name service-name.

jmjoy · 2024-04-18T10:08:12Z

I want to know which version of Linux kernel has completely solved this problem?

haorenfsa · 2024-04-19T02:42:10Z

Alpine 3.18 with included musl 1.2.4 seemed to have finally fixed this issue

https://www.alpinelinux.org/posts/Alpine-3.18.0-released.html

@jmjoy ~~at least we know it's fixed on on Linux kernel 6.1 , since alpine 3.18 is using 6.1~~

Sry, it's a problem of libc other than linux kernel. Alpine 3.18 uses musl libc v1.2.4 which fix this issue. Also alpine3.8 uses kernel 6.1.

Different kind of linux release may use different version, most releases uses glibc instead of musl libc.

So I assume what you really want know is which version of glibc fixes this issue. ~~I guess not yet?~~

There's a fix by https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ed07d9a021df6da53456663a76999189badc432a

jmjoy · 2024-04-19T02:56:41Z

Alpine 3.18 with included musl 1.2.4 seemed to have finally fixed this issue
https://www.alpinelinux.org/posts/Alpine-3.18.0-released.html

@jmjoy at least we know it's fixed on on Linux kernel 6.1 , since alpine 3.18 is using 6.1

This seems to be just fixed by musl, not the kernel.

haorenfsa · 2024-04-19T03:05:11Z

@jm I think the root cause in conntrack it's fixed by this patch which is included first time at v4.19-rc1, so kernel 4.19 or later version

jmjoy · 2024-04-19T03:38:28Z

@jm I think the root cause in conntrack it's fixed by this patch which is included first time at v4.19-rc1, so kernel 4.19 or later version

I have read the complete issue and it should have been fully fixed in 5.0. (#56903 (comment)).

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Dec 6, 2017

k8s-github-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Dec 6, 2017

k8s-ci-robot added the sig/network Categorizes an issue or PR as relevant to SIG Network. label Dec 6, 2017

k8s-github-robot removed the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Dec 6, 2017

thockin added the area/dns label Jan 6, 2018

gardener-robot-ci-1 mentioned this issue Jan 13, 2018

DNS intermittent delay gardener/gardener#57

Closed

pierreozoux mentioned this issue Jan 30, 2018

Configure DNS options at cluster level #59031

Closed

joelittlejohn mentioned this issue Feb 6, 2018

DNS mostly fails inside application pods on brand new cluster kubernetes/kops#4391

Closed

manalibhutiyani mentioned this issue Feb 8, 2018

kube-dns does not respond or takes more than 5 sec to respond on Ginkgo CI multinode setup cilium/cilium#2750

Closed

123BLiN mentioned this issue Jun 30, 2020

DNS intermittent delays of 1/2s with nodelocaldns kubernetes/dns#387

Closed

coufalja mentioned this issue Jul 28, 2020

Add --random-fully to MASQ iptables rules to mitigate conntrack issues cloudnativelabs/kube-router#958

Merged

mjsabby mentioned this issue Sep 13, 2020

Adding support for fully managed name resolution APIs or as app-level toggle dotnet/runtime#42185

Open

chrisohaver mentioned this issue Nov 16, 2020

Getting UnkownHostException from SpringBoot Microservice running on top of kubernetes coredns/coredns#4297

Closed

asreich mentioned this issue Jan 14, 2021

Search domain stops DNS from working properly k3s-io/k3s#2801

Closed

jprzychodzen mentioned this issue Feb 16, 2021

Prow jobs are failing with 'Could not resolve host: github.com' kubernetes/test-infra#20716

Closed

zfl9 mentioned this issue Feb 22, 2021

UDP 查询超时 zfl9/chinadns-ng#73

Closed

dginther mentioned this issue Aug 31, 2021

[Kubernetes] Intermittent DNS failures with RDS database for Sentry department-of-veterans-affairs/va.gov-team#29511

Open

5 tasks

inhumantsar mentioned this issue Sep 2, 2021

Seeing Timeout while contacting DNS servers with latest v2.19.1 aws/aws-for-fluent-bit#233

Closed

lhotari mentioned this issue Apr 29, 2022

Use dnsConfig single-request-reopen to address DNS intermittent delays of 5s datastax/pulsar-helm-chart#203

Closed

sli720 mentioned this issue Sep 26, 2022

nodelocaldns resolution issues kubernetes-sigs/kubespray#9328

Closed

shane965 mentioned this issue Oct 11, 2022

Add --random-fully to VPC Gateway snat rule kubeovn/kube-ovn#1956

Closed

tgruenert mentioned this issue Feb 20, 2023

DNS query time out after receiving some GB of data rancher/rke2#3924

Closed

chamindac mentioned this issue Apr 6, 2023

Intermittent CrashLoopBackOff in Windows Containers Running on AKS (.NET 6 Apps with System.Net.Sockets.SocketException 11001 and 10060) Azure/AKS#3598

Closed

jaswanthikolla mentioned this issue Apr 23, 2023

Upscaling of coredns pods leads to DNS timeout errors #113080

Closed

agolks mentioned this issue Oct 17, 2023

regular 5 seconds delay between "Attempting to connect" and "Socket connected" npgsql/efcore.pg#2904

Closed

swiatekm-sumo mentioned this issue Oct 27, 2023

feat: allow setting the cluster DNS domain SumoLogic/sumologic-kubernetes-collection#3362

Merged

1 task

hfreire mentioned this issue Nov 14, 2023

Upgrade Alpine to 3.18 to avoid known DNS timeout issues in k8s 0xERR0R/blocky#1242

Closed

This was referenced Nov 17, 2023

io.netty.resolver.dns.DnsNameResolverTimeoutException when the application is running in Kubernetes reactor/reactor-netty#2978

Closed

io.netty.resolver.dns.DnsNameResolverTimeoutException when application is running Kubernetes netty/netty#13705

Open

a180285 mentioned this issue Dec 24, 2023

add single-request to default DNS options for clusterFirst DNS policy #122463

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DNS intermittent delays of 5s #56903

DNS intermittent delays of 5s #56903

mikksoone commented Dec 6, 2017 •

edited

cmluciano commented Dec 8, 2017

kgignatyev-inspur commented Dec 10, 2017

ani82 commented Dec 23, 2017 •

edited

thockin commented Jan 6, 2018

ani82 commented Jan 9, 2018

aguerra commented Jan 19, 2018

lbrictson commented Jan 19, 2018

bowei commented Jan 19, 2018

lbrictson commented Jan 19, 2018

mikksoone commented Jan 19, 2018

lbrictson commented Jan 19, 2018

vasartori commented Jan 19, 2018 •

edited

vasartori commented Jan 22, 2018

vasartori commented Jan 23, 2018

aca02djr commented Jan 24, 2018

mikksoone commented Jan 26, 2018

lauri-elevant commented Feb 5, 2018 •

edited

sdtokkolabs commented Apr 3, 2018

xiaoxubeii commented Apr 12, 2018

rajatjindal commented Apr 14, 2018

sdtokkolabs commented Apr 17, 2018

debMan commented Jan 3, 2021

Mainintowhile commented Nov 17, 2021 •

edited

onedr0p commented May 10, 2023 •

edited

tomwidmer commented Oct 25, 2023 •

edited

jmjoy commented Apr 18, 2024

haorenfsa commented Apr 19, 2024 •

edited

jmjoy commented Apr 19, 2024

haorenfsa commented Apr 19, 2024

jmjoy commented Apr 19, 2024

DNS intermittent delays of 5s #56903

DNS intermittent delays of 5s #56903

Comments

mikksoone commented Dec 6, 2017 • edited

cmluciano commented Dec 8, 2017

kgignatyev-inspur commented Dec 10, 2017

ani82 commented Dec 23, 2017 • edited

thockin commented Jan 6, 2018

ani82 commented Jan 9, 2018

aguerra commented Jan 19, 2018

lbrictson commented Jan 19, 2018

bowei commented Jan 19, 2018

lbrictson commented Jan 19, 2018

mikksoone commented Jan 19, 2018

lbrictson commented Jan 19, 2018

vasartori commented Jan 19, 2018 • edited

vasartori commented Jan 22, 2018

vasartori commented Jan 23, 2018

aca02djr commented Jan 24, 2018

mikksoone commented Jan 26, 2018

lauri-elevant commented Feb 5, 2018 • edited

sdtokkolabs commented Apr 3, 2018

xiaoxubeii commented Apr 12, 2018

rajatjindal commented Apr 14, 2018

sdtokkolabs commented Apr 17, 2018

debMan commented Jan 3, 2021

Mainintowhile commented Nov 17, 2021 • edited

onedr0p commented May 10, 2023 • edited

tomwidmer commented Oct 25, 2023 • edited

jmjoy commented Apr 18, 2024

haorenfsa commented Apr 19, 2024 • edited

jmjoy commented Apr 19, 2024

haorenfsa commented Apr 19, 2024

jmjoy commented Apr 19, 2024

mikksoone commented Dec 6, 2017 •

edited

ani82 commented Dec 23, 2017 •

edited

vasartori commented Jan 19, 2018 •

edited

lauri-elevant commented Feb 5, 2018 •

edited

Mainintowhile commented Nov 17, 2021 •

edited

onedr0p commented May 10, 2023 •

edited

tomwidmer commented Oct 25, 2023 •

edited

haorenfsa commented Apr 19, 2024 •

edited