Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dns i/o timeout with dns01 when trying to issue a certificate via cloddns provider #896

Closed
jar3b opened this issue Sep 13, 2018 · 5 comments · Fixed by #1111
Closed

dns i/o timeout with dns01 when trying to issue a certificate via cloddns provider #896

jar3b opened this issue Sep 13, 2018 · 5 comments · Fixed by #1111

Comments

@jar3b
Copy link

jar3b commented Sep 13, 2018

Hello.

I try to get cetificates using Letsencrypt with Google Clouddns provider and dns01 challenge.

cert-manager installed on bare metal with kubeadm, helm.

Helm: 2.10.0
Kubectl: 1.11.2
Kubeadm: 1.11.2
Cert-manager: 0.4.1

ClusterIssuer config:

apiVersion: certmanager.k8s.io/v1alpha1
kind: ClusterIssuer
metadata:
  name: dns-issuer
  namespace: {{.Release.Namespace}}
spec:
  acme:
    server: https://acme-staging-v02.api.letsencrypt.org/directory
    email: mymail@mail.com
    privateKeySecretRef:
      name: dns-issuer-account-key
    dns01:
      providers:
      - name: clouddns
        clouddns:
          project: "{{.Values.googleProjectId}}"
          serviceAccountSecretRef:
            name: clouddns-svc-acct-secret
            key: service-account.json

Certificate config:

apiVersion: certmanager.k8s.io/v1alpha1
kind: Certificate
metadata:
  name: main-domain-crt
  namespace: {{.Release.Namespace}}
spec:
  secretName: main-domain-crt-secret
  commonName: '*.{{.Values.mainDomain}}'
  dnsNames:
  - "{{.Values.mainDomain}}"
  acme:
    config:
    - dns01:
        provider: clouddns
      domains:
      - '*.{{.Values.mainDomain}}'
      - "{{.Values.mainDomain}}"
  issuerRef:
    name: dns-issuer
    kind: ClusterIssuer

I use helm deploy command with flag --set podDnsConfig.nameservers={"8.8.8.8","8.8.4.4"}" and
216.239.32.109 is Google DNS IP (NS for my domain) in my case.

And cert-manager logs with errors (multiple times):

I0913 08:24:02.489967       1 dns.go:79] Checking DNS propagation for "example.com" using name servers: [10.96.0.10:53 8.8.8.8:53 8.8.4.4:53]
I0913 08:24:22.542534       1 helpers.go:188] Found status change for Certificate "main-domain-crt" condition "Ready": "False" -> "False"; setting lastTransitionTime to 2018-09-13 08:24:22.542505872 +0000 UTC m=+58735.909437624
I0913 08:24:22.542617       1 sync.go:244] Error preparing issuer for certificate my-namespace/main-domain-crt: [read udp 10.244.0.89:42150->216.239.32.109:53: i/o timeout, another authorization for domain "example.com" is in progress]
E0913 08:24:22.542813       1 sync.go:165] [my-namespace/main-domain-crt] Error getting certificate 'main-domain-crt-secret': secret "main-domain-crt-secret" not found
E0913 08:24:22.553475       1 controller.go:190] certificates controller: Re-queuing item "my-namespace/main-domain-crt" due to error processing: [read udp 10.244.0.89:42150->216.239.32.109:53: i/o timeout, another authorization for domain "exapmle.com" is in progress]

The main problem is in i/o timeout, another authorization for domain "example.com" is in progress and as result the cerificate is not issued.

If i try to run kubectl exec -ti cert-manager-xxxxxxxxxxxx-xxxxx nslookup example.com -n kube-system then i got:

nslookup: can't resolve '(null)': Name does not resolve

Name:      example.com
Address 1: <my ip> <my ip>.kubernetes.default.svc.cluster.local

It's means dns resolving works as expected or not? I think yes, because i can reach host by name inside pod, but it seems that the cert-manager does not have access to dns server.

I don't use --dns01-self-check-nameservers= bcs don't understand how to pass this paramerter via helm install. This flag may solve the problem?

And what the proper way to obtain certificates? Thanks!

@jar3b
Copy link
Author

jar3b commented Sep 18, 2018

So, problem was in the following: slow (or with no reponse at all, i don't check it yet) UDP requests. Similar issues here: kubernetes/kubernetes#62628, kubernetes/kubernetes#56903, etc. Proposed solutions (mostly modifying resolv.conf) was not relevant, bcs cert-manager don't uses resolv.conf options on DNS lookups.

My solution is to patch cert-manager code to failback to TCP if timeout was ocurred.
Resulting code for file pkg/issuer/acme/dns/util/wait.go starting from line 144:

		if err == dns.ErrTruncated ||
			(err != nil && strings.HasPrefix(err.Error(), "read udp") && strings.HasSuffix(err.Error(), "i/o timeout")) {
			tcp := &dns.Client{Net: "tcp", Timeout: DNSTimeout}
			// If the TCP request succeeds, the err will reset to nil
			in, _, err = tcp.Exchange(m, ns)
		}

With this dirty fix cert issuance is finally working... But problem with kubernetes and UDP unfortunately doesn't solved :(

My proposal for this project: allow user to specify "only TCP resolution" in checkAuthoritativeNss() by flag or env variable, so-so. Allowing to change DNSTimeout from default 10 sec also can be a good option.

@chriskolenko
Copy link

For the helm chart use extraArgs: ["--dns01-self-check-nameservers=8.8.8.8:53"]

@kellycampbell
Copy link
Contributor

I ran into a similar issue using AWS. The problem started a week or two ago. Before that it was running fine.

The error from 0.5.2 was:

I1126 12:51:37.281048       1 sync.go:276] Error preparing issuer for certificate ambassador/ambassador: [read udp [f00d::6460:400:0:39f6]:39068->[2600:9000:5307:1400::1]:53: i/o timeout, read udp [f00d::6460:400:0:39f6]:33329->[2600:9000:5305:c000::1]:53: i/o timeout]

I was able to do nslookup from within the cert-manager pod using the ipv6 address above.

Similar error from a version I built from master and after setting the --dns01-self-check-nameservers flag:

E1127 16:01:29.784036       1 controller.go:162] challenges controller: Re-queuing item "ambassador/ambassador-1311446307-1" due to error processing: read udp [f00d::6460:400:0:b64a]:50841->[2600:9000:5305:c000::1]:53: i/o timeout

I noticed the dns library revision is over a year old so I tried updating it, but it didn't fix the issue.

The tcp fallback from comment above worked.

I can send PR's for both the dns dependency update and the tcp fallback.

@munnerz
Copy link
Member

munnerz commented Nov 27, 2018

Interesting that there appears to be issues with IPv6 too. I've not got an environment setup to test this in, nor have I been able to reproduce it.

@kellycampbell
Copy link
Contributor

kellycampbell commented Nov 27, 2018

This was on a test cluster using k8s 1.11.4 built with kops 1.11alpha, cillium network provider, on t3 instance types.

I found this issue on the dns project which says that dns servers will ignore invalid requests: miekg/dns#784

The fact that a TCP request works ok makes me think it's something with the length of the request. Maybe the ipv6 addressing contributes to this?

FYI, the hostnames it was failing on were 32 and 30 chars long.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants