dns i/o timeout with dns01 when trying to issue a certificate via cloddns provider #896

jar3b · 2018-09-13T09:56:24Z

Hello.

I try to get cetificates using Letsencrypt with Google Clouddns provider and dns01 challenge.

cert-manager installed on bare metal with kubeadm, helm.

Helm: 2.10.0
Kubectl: 1.11.2
Kubeadm: 1.11.2
Cert-manager: 0.4.1

ClusterIssuer config:

apiVersion: certmanager.k8s.io/v1alpha1
kind: ClusterIssuer
metadata:
  name: dns-issuer
  namespace: {{.Release.Namespace}}
spec:
  acme:
    server: https://acme-staging-v02.api.letsencrypt.org/directory
    email: mymail@mail.com
    privateKeySecretRef:
      name: dns-issuer-account-key
    dns01:
      providers:
      - name: clouddns
        clouddns:
          project: "{{.Values.googleProjectId}}"
          serviceAccountSecretRef:
            name: clouddns-svc-acct-secret
            key: service-account.json

Certificate config:

apiVersion: certmanager.k8s.io/v1alpha1
kind: Certificate
metadata:
  name: main-domain-crt
  namespace: {{.Release.Namespace}}
spec:
  secretName: main-domain-crt-secret
  commonName: '*.{{.Values.mainDomain}}'
  dnsNames:
  - "{{.Values.mainDomain}}"
  acme:
    config:
    - dns01:
        provider: clouddns
      domains:
      - '*.{{.Values.mainDomain}}'
      - "{{.Values.mainDomain}}"
  issuerRef:
    name: dns-issuer
    kind: ClusterIssuer

I use helm deploy command with flag --set podDnsConfig.nameservers={"8.8.8.8","8.8.4.4"}" and
216.239.32.109 is Google DNS IP (NS for my domain) in my case.

And cert-manager logs with errors (multiple times):

I0913 08:24:02.489967       1 dns.go:79] Checking DNS propagation for "example.com" using name servers: [10.96.0.10:53 8.8.8.8:53 8.8.4.4:53]
I0913 08:24:22.542534       1 helpers.go:188] Found status change for Certificate "main-domain-crt" condition "Ready": "False" -> "False"; setting lastTransitionTime to 2018-09-13 08:24:22.542505872 +0000 UTC m=+58735.909437624
I0913 08:24:22.542617       1 sync.go:244] Error preparing issuer for certificate my-namespace/main-domain-crt: [read udp 10.244.0.89:42150->216.239.32.109:53: i/o timeout, another authorization for domain "example.com" is in progress]
E0913 08:24:22.542813       1 sync.go:165] [my-namespace/main-domain-crt] Error getting certificate 'main-domain-crt-secret': secret "main-domain-crt-secret" not found
E0913 08:24:22.553475       1 controller.go:190] certificates controller: Re-queuing item "my-namespace/main-domain-crt" due to error processing: [read udp 10.244.0.89:42150->216.239.32.109:53: i/o timeout, another authorization for domain "exapmle.com" is in progress]

The main problem is in i/o timeout, another authorization for domain "example.com" is in progress and as result the cerificate is not issued.

If i try to run kubectl exec -ti cert-manager-xxxxxxxxxxxx-xxxxx nslookup example.com -n kube-system then i got:

nslookup: can't resolve '(null)': Name does not resolve

Name:      example.com
Address 1: <my ip> <my ip>.kubernetes.default.svc.cluster.local

It's means dns resolving works as expected or not? I think yes, because i can reach host by name inside pod, but it seems that the cert-manager does not have access to dns server.

I don't use --dns01-self-check-nameservers= bcs don't understand how to pass this paramerter via helm install. This flag may solve the problem?

And what the proper way to obtain certificates? Thanks!

The text was updated successfully, but these errors were encountered:

jar3b · 2018-09-18T08:23:14Z

So, problem was in the following: slow (or with no reponse at all, i don't check it yet) UDP requests. Similar issues here: kubernetes/kubernetes#62628, kubernetes/kubernetes#56903, etc. Proposed solutions (mostly modifying resolv.conf) was not relevant, bcs cert-manager don't uses resolv.conf options on DNS lookups.

My solution is to patch cert-manager code to failback to TCP if timeout was ocurred.
Resulting code for file pkg/issuer/acme/dns/util/wait.go starting from line 144:

		if err == dns.ErrTruncated ||
			(err != nil && strings.HasPrefix(err.Error(), "read udp") && strings.HasSuffix(err.Error(), "i/o timeout")) {
			tcp := &dns.Client{Net: "tcp", Timeout: DNSTimeout}
			// If the TCP request succeeds, the err will reset to nil
			in, _, err = tcp.Exchange(m, ns)
		}

With this dirty fix cert issuance is finally working... But problem with kubernetes and UDP unfortunately doesn't solved :(

My proposal for this project: allow user to specify "only TCP resolution" in checkAuthoritativeNss() by flag or env variable, so-so. Allowing to change DNSTimeout from default 10 sec also can be a good option.

chriskolenko · 2018-11-01T15:54:02Z

For the helm chart use extraArgs: ["--dns01-self-check-nameservers=8.8.8.8:53"]

kellycampbell · 2018-11-27T16:20:24Z

I ran into a similar issue using AWS. The problem started a week or two ago. Before that it was running fine.

The error from 0.5.2 was:

I1126 12:51:37.281048       1 sync.go:276] Error preparing issuer for certificate ambassador/ambassador: [read udp [f00d::6460:400:0:39f6]:39068->[2600:9000:5307:1400::1]:53: i/o timeout, read udp [f00d::6460:400:0:39f6]:33329->[2600:9000:5305:c000::1]:53: i/o timeout]

I was able to do nslookup from within the cert-manager pod using the ipv6 address above.

Similar error from a version I built from master and after setting the --dns01-self-check-nameservers flag:

E1127 16:01:29.784036       1 controller.go:162] challenges controller: Re-queuing item "ambassador/ambassador-1311446307-1" due to error processing: read udp [f00d::6460:400:0:b64a]:50841->[2600:9000:5305:c000::1]:53: i/o timeout

I noticed the dns library revision is over a year old so I tried updating it, but it didn't fix the issue.

The tcp fallback from comment above worked.

I can send PR's for both the dns dependency update and the tcp fallback.

munnerz · 2018-11-27T18:33:02Z

Interesting that there appears to be issues with IPv6 too. I've not got an environment setup to test this in, nor have I been able to reproduce it.

kellycampbell · 2018-11-27T18:42:59Z

This was on a test cluster using k8s 1.11.4 built with kops 1.11alpha, cillium network provider, on t3 instance types.

I found this issue on the dns project which says that dns servers will ignore invalid requests: miekg/dns#784

The fact that a TCP request works ok makes me think it's something with the length of the request. Maybe the ipv6 addressing contributes to this?

FYI, the hostnames it was failing on were 32 and 30 chars long.

kellycampbell mentioned this issue Nov 27, 2018

Retry dns queries with TCP if UDP has an i/o timeout #1111

Merged

jetstack-bot closed this as completed in #1111 Nov 30, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dns i/o timeout with dns01 when trying to issue a certificate via cloddns provider #896

dns i/o timeout with dns01 when trying to issue a certificate via cloddns provider #896

jar3b commented Sep 13, 2018

jar3b commented Sep 18, 2018 •

edited

chriskolenko commented Nov 1, 2018

kellycampbell commented Nov 27, 2018

munnerz commented Nov 27, 2018

kellycampbell commented Nov 27, 2018 •

edited

dns i/o timeout with dns01 when trying to issue a certificate via cloddns provider #896

dns i/o timeout with dns01 when trying to issue a certificate via cloddns provider #896

Comments

jar3b commented Sep 13, 2018

jar3b commented Sep 18, 2018 • edited

chriskolenko commented Nov 1, 2018

kellycampbell commented Nov 27, 2018

munnerz commented Nov 27, 2018

kellycampbell commented Nov 27, 2018 • edited

jar3b commented Sep 18, 2018 •

edited

kellycampbell commented Nov 27, 2018 •

edited