-
Notifications
You must be signed in to change notification settings - Fork 455
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kube DNS Latency #96
Comments
5s is clearly a timeout - is kube-DNS crashing? Do you have enough
replicas?
…On Wed, May 24, 2017 at 10:21 PM, Alok Kumar Singh ***@***.*** > wrote:
We have dns pods running in our cluster (cluster details below)
Issue is every 2-3 requests out of 5 is having a latency of 5 seconds
because of the dns.
***@***.***:/# time curl http://myservice.central:8080/status
{
"host": "myservice-3af719a-805113283-x35p1",
"status": "OK"
}
real 0m5.523s
user 0m0.004s
sys 0m0.000s
***@***.***:/# time curl http://myservice.central:8080/status
{
"host": "myservice-3af719a-805113283-x35p1",
"status": "OK"
}
real 0m0.013s
user 0m0.000s
sys 0m0.004s
*Cluster details*: We are running Kubernetes latest version 1.6.4
installed using kops.
Below are the kube dns details
- kubedns: gcr.io/google_containers/k8s-dns-kube-dns-amd64:1.14.1
- dnsmaq: gcr.io/google_containers/k8s-dns-dnsmasq-nanny-amd64:1.14.1
- sidecar: gcr.io/google_containers/k8s-dns-sidecar-amd64:1.14.1
Our Kube dns is running with below requests
cpu : 200m
memory : 70Mi
Please let us know the issue and how to fix this
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#96>, or mute the thread
<https://github.com/notifications/unsubscribe-auth/AFVgVLrUlcCxq60MYbbtXyTsUVvJA28Vks5r9Q-8gaJpZM4Nl-YQ>
.
|
@thockin No restarts in kube dns pods. We are running t2.large 10 nodes and each nodes have a dns running. This a new cluster with very less pressure on it. |
@bowei :)
…On Wed, May 24, 2017 at 10:24 PM, Alok Kumar Singh ***@***.*** > wrote:
@thockin <https://github.com/thockin> No restarts in kube dns pods. We
are running t2.large 10 nodes and each nodes have a dns running. This a new
cluster with very less pressure on it.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#96 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AFVgVAGGqI_9XdX_hgn9ORkCADM2c7lSks5r9RCTgaJpZM4Nl-YQ>
.
|
@bowei We are running the same version of cluster and dns in another environment but it is NON AZ. But no such lags there. Also we tried directly hitting the cluster IP - and it works without lag (in both the clusters) Issue is there with dns resolution in this cluster only. Can u help debug the problem ? |
@SleepyBrett: are you on AWS as well? |
I am on aws as well the graphs above are from a multiaz cluster HOWEVER we are seeing the same thing on a single az cluster |
AZ = Azure? |
availability zone, I'm on slack if you want to chat. |
Is this maybe related to #76 |
@SleepyBrett i meannt to say not highly highly available as that cluster is in singapore with only two zones.(AWS) |
@bowei @SleepyBrett Observation: This issue is happening in the morning staging cluster gets a lot of load as developers create a lot of services, deployments and pods in k8s cluster. The DNS latency kicks in. But in the night when the resources gets deleted it gets ok. Is the latency because it has to scan a lot of entries in the records in the morning ? |
What is the # of services etc? You can also try setting logging level to --v=0 for kube-dns container in the kube-dns pod as that is impacted by # of services as well. |
@bowei its happening in our prod cluster also now... |
After few hundred requests one request shows huge latency of around 5 seconds. |
@bowei @thockin Skipping the kuberntes dns service completely resolved our issue. We changed our
instead of
So there are two things -
request
tcpdump ouput of kube dns nameserver
|
I have been looking into this also. this is from an alpine:3.6 container (after obtaining curl
so the [AAAA?] (ipv6) lookup seems to be causing the timing latency. edit: for reference on the @curl-format.txt: https://blog.josephscott.org/2011/10/14/timing-details-with-curl/ |
@andrewgdavis : Is it possible to post (in a gist) the output of (For each of the executions of |
digging a bit further this behavior does not happen with a busybox container:
tried with alpine:3.6 and 3.5 (and added /etc/nsswitch setting of |
Am I reading your test right in that you are making it return with NXDOMAIN if mdns does not respond with name? What happens if you remove |
by default there isn't anything in /etc/nsswitch.conf in the alpine:{3.6,3.5} containers ... is there some other configuration that you want me to try? |
narrowing this down a bit more... by modifying the configmap from:
to a singular corporate dns nameserver, alpine works fine. still investigating; but it seems like at least a valid work around...
|
Does 10.0.2.3 have different entries than the public DNS servers? All servers in the From http://www.thekelleys.org.uk/dnsmasq/docs/dnsmasq-man.html
|
@andrewgdavis which config map are you defining these nameservers? I also want to use a corporate nameserver for a particular domain(say |
http://blog.kubernetes.io/2017/04/configuring-private-dns-zones-upstream-nameservers-kubernetes.html
|
@bowei The issue of dns lookup came again... even after skipping the kube dns completely by changing the resolv.conf to point to our aws managed dns service(10.0.0.2). But the no has reduced. |
If you are not using kube-dns, then it seems like there is an issue with your 10.0.0.2 DNS service. What DNS service/server are you using? |
@bowei The nos have reduced significantly, i think because the no of lookups have reduced as there is no search happening for |
From the |
Sorry if there are conflated issues here--- latency can be caused by many different factors. My particular issue seems to point to a problem with alpine containers in conjunction with virtualbox networking when resolving ipv6 traffic: work arounds: |
@alok87 -- you can check the logs for dnsmasq-nanny, it will log when an update has been applied. The config map update is set to around ~30 s. I think by default. |
@bowei yep its showing server flag for my domain. |
As alluded to above, if you are still seeing delays even if you are skipping kube-dns, then I suspect an issue with your DNS server itself and I would investigate that leg of the DNS query communications. |
@bowei Yep the issue is with that DNS but i wanted to cache request in this dnsmasq we already have in place to reduce the no of requests to the aws managed DNS. Does dnsmasq cache custom domains if I use the private DNS setup you mentioned? |
yes it will be cached with the TTL given by the server. Is it ok to close this issue as it is not related to k8s itself? |
@bowei We have solved this issue in our production cluster. We did below
|
Might not be the same issue, but we have noticed systemic kube-dns problems since moving to 1.6 from 1.4, also in AWS. Sporadic kube-dns failures. It also happens consistently (but sort of rarely) when kube-dns is deployed. We only use kube-dns for internal resolution, everything else gets directed to our own DNS servers which have never had a problem. When I have some time I'll try to reproduce it. |
@bowei I ran the perf-test on our prod cluster and queried |
I suspect you probably exceeded some QPS bounds (the tool reports 40k/s) resulting in dropped packets... |
@bowei What is the QPS supported by kube-dns at present |
That depends on how many replicas you have. It should be scaled to match your QPS. If you are running dnsperf to max out the system, it will increase # of packets sent unless there is loss... There is also the issue of running out of conntrack entries for UDP. |
I suspect this has bitten us too now. Every morning one of our micro services that renders images from cloudinary times out resolving dns via kube dns. I scaled up the kube-dns pods from 2 to 9 but in vain. This was even though we had dns auto scaler pod from the add on manager.
So I tried injecting google's DNS servers in /etc/resolv.conf in my containers but they were over written with cluster DNS, 100.64.0.10. I don't know what overwrites that. @alok87 How did you bypass kubedns to use your own DNS ? Not to forget, my cluster is on k8s 1.5.7 |
You can set the pod DnsPolicy to Default instead of ClusterFirst to disable use of kube-dns. What QPS for DNS queries are you seeing? What platform are you on? |
@Miyurz We in our
@bowei Hope we could fix this problem permanently... |
We're having similiar problems in our cluster. Occasionally DNS requests time out for clients. We already scaled up kube-dns deployment to about 8 pods manually, so we have about 20% CPU load per pod. The CPU load is shared between kube-dns and dnsmasq. We're also not hitting nf_conntrack limits and we're using calico in AWS with a single AZ setup. |
Similar to @hanikesn I’m also observing sporadic dns failures against kube-dns. Even using dnsmasq to mitigate it, kube-dns requests will inexplicably time out over a period of seconds. External DNS requests to our own dns servers are all fine. I haven’t dug into it yet though. But there seems some sort of systemic issue based on other people’s experiences. |
Cross posting as this all seems related: kubernetes/kubernetes#45976 (comment) |
What are your environments (platform, kernel, k8s version)? Can you open a new bug as this one is closed. |
/area dns |
We have dns pods running in our cluster (cluster details below)
Issue is every 2-3 requests out of 5 is having a latency of 5 seconds because of the dns.
Cluster details: We are running Kubernetes latest version 1.6.4 installed using
kops
. Its mutli AZ cluster in aws.Below are the kube dns details
Our Kube dns is running with below requests
Please let us know the issue and how to fix this
The text was updated successfully, but these errors were encountered: