Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermittent CrashLoopBackOff in Windows Containers Running on AKS (.NET 6 Apps with System.Net.Sockets.SocketException 11001 and 10060) #3598

Closed
chamindac opened this issue Apr 6, 2023 · 11 comments
Labels
action-required Needs Attention 👋 Issues needs attention/assignee/owner question stale Stale issue

Comments

@chamindac
Copy link

I am running .NET 6 apps in AKS. Have 11 Linux based apps and 11 Windows based apps (due to legency dependency on c++ library which is depending on Windows patform). Once a deployment of all 22 Apps done via parellel running Azure DevOps pipelines, which are using Kubernetes manifest files to deploy each app, apps were giving intermittenet CrashLoopBackOff randomly. One or two apps randomly fail at starting up. Each of these app is in a seperate pod. some apps have 2 minimum instances. Once a deployemnt done sometimes one instance starts fine, and one instance of same app may go to CrashLoopBackOff, trying to connect to Azure App Configuration service.

Analyzing log of CrashLoopBackOff container shown it fails connecting to Azure app config service, which is using a Azure Key Vault to keep secrets. The issue shown is one of the below.

System.Net.Sockets.SocketException (10060): A connection attempt failed

---> Azure.RequestFailedException: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond. (my-demo-dev-appconfig-ac.azconfig.io:443)
 ---> System.Net.Http.HttpRequestException: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond. (my-demo-dev-appconfig-ac.azconfig.io:443)
 ---> System.Net.Sockets.SocketException (10060): A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.

System.Net.Sockets.SocketException (11001): No such host is known.

Unhandled exception. System.AggregateException: Retry failed after 3 tries. Retry settings can be adjusted in ClientOptions.Retry. (No such host is known. (my-demo-dev-appconfig-ac.azconfig.io:443)) (No such host is known. (my-demo-dev-appconfig-ac.azconfig.io:443)) (No such host is known. (my-demo-dev-appconfig-ac.azconfig.io:443))
 ---> Azure.RequestFailedException: No such host is known. (my-demo-dev-appconfig-ac.azconfig.io:443)
 ---> System.Net.Http.HttpRequestException: No such host is known. (my-demo-dev-appconfig-ac.azconfig.io:443)
 ---> System.Net.Sockets.SocketException (11001): No such host is known.

Attempted Fixes

Intially the issue was happening in both Windows and Linux containers. However, after making the following fix (dnsConfig) in Kubernetes deployment manifest the linux containers no longer giving any CrashLoopBackOff and all linux apps starts fine after a deployment or a rollover restart.

template:
    metadata:
      labels:
        app: ${aks_app_name}$
        service: ${aks_app_name}$
    spec:
      nodeSelector:
        "kubernetes.io/os": ${aks_app_nodeselector}$
      priorityClassName: ${aks_app_container_priorityclass_name}$
      #------------------------------------------------------
      # setting pod DNS policies to enable faster DNS resolution
      # https://kubernetes.io/docs/concepts/services-networking/dns-pod-service/#pod-s-dns-policy
      dnsConfig:
        options:
          # use FQDN everywhere 
          # any cluster local access from pods need full CNAME to resolve 
          # short names will not resolve to internal cluster domains
          - name: ndots
            value: "2"
          # dns resolver timeout and attempts
          - name: timeout
            value: "15"
          - name: attempts
            value: "3"
          # use TCP to resolve DNS instad of using UDP (UDP is lossy and pods need to wait for timeout for lost packets)
          - name: use-vc
          # open new socket for retrying
          - name: single-request-reopen
      #------------------------------------------------------
      volumes:

However, the windows containers still randomly run into this issue after a deployment most of the time. But have not seen this happening, when pods are scaling, due to load, based on horozontal pod autoscaler settings. In rollover restarts, the CrashLoopBackOff occurs rarely running the Windows containers with one of the above mentioned socket exceptions.

Following articles, github issues etc. already referred for this.

Setting up node local dns cache is not possible in AKS as by design AKS does not allow this. #1435

From .NET point of view following are referred.

Tried adding below code fixes to increase retries and delays between retries in the apps

    return configurationBuilder.AddAzureAppConfiguration(options =>
                {
                    options
                        .Connect(appConfigurationEndpoint)
                        .ConfigureClientOptions(clientOptions =>
                        {
                            clientOptions.Retry.Delay = TimeSpan.FromSeconds(10);
                            clientOptions.Retry.MaxDelay = TimeSpan.FromSeconds(40);
                            clientOptions.Retry.MaxRetries = 5;
                            clientOptions.Retry.Mode = RetryMode.Exponential;
                        });
SecretClientOptions secrteClientOptions = new()
                    {
                        Retry =
                        {
                            Delay= TimeSpan.FromSeconds(10),
                            MaxDelay = TimeSpan.FromSeconds(40),
                            MaxRetries = 5,
                            Mode = RetryMode.Exponential
                            }
                    };

But these retry delay etc. changes in app only resulted in, such failing container to take long time like 20 minutes to run into CrashLoopBackOff. Without above retry settings, such container (pod) runs into CrashLoopBackOff sooner with one of above socket exceptions. Sometimes when such a pod is left alone for longer time like 45 minutes to one hour it automatically manages to get up and running on the restart attempts by Kubernetes.

Any thoughts or ideas to apply as a fix for AKS or to the .NET app is helpful here as this CrashLoopBackOff is annoying. Now, what is being done is after a deployment, I check the AKS workloads, pods from Azure portal(or using kubectl) and kill the pod in crashloopback, which creates another pod by Kubernetes and that is running fine.

@chamindac
Copy link
Author

I have automated the cleaning up of Pods runing into CrashLoopBackOff, with known socket exceptions for now as a workaround as described here. Workaround is OK for now. However, prefer to fix the issue rather than relying on this workaround.

@ghost ghost added the action-required label May 1, 2023
@ghost
Copy link

ghost commented May 6, 2023

Action required from @Azure/aks-pm

@ghost ghost added the Needs Attention 👋 Issues needs attention/assignee/owner label May 6, 2023
@ghost
Copy link

ghost commented May 21, 2023

Issue needing attention of @Azure/aks-leads

5 similar comments
@ghost
Copy link

ghost commented Jun 6, 2023

Issue needing attention of @Azure/aks-leads

@ghost
Copy link

ghost commented Jun 21, 2023

Issue needing attention of @Azure/aks-leads

@ghost
Copy link

ghost commented Jul 6, 2023

Issue needing attention of @Azure/aks-leads

@ghost
Copy link

ghost commented Jul 21, 2023

Issue needing attention of @Azure/aks-leads

@ghost
Copy link

ghost commented Aug 5, 2023

Issue needing attention of @Azure/aks-leads

@microsoft-github-policy-service microsoft-github-policy-service bot added the stale Stale issue label Feb 2, 2024
Copy link
Contributor

This issue has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs within 15 days of this comment.

Copy link
Contributor

Issue needing attention of @Azure/aks-leads

Copy link
Contributor

This issue will now be closed because it hasn't had any activity for 7 days after stale. chamindac feel free to comment again on the next 7 days to reopen or open a new issue after that time if you still have a question/issue or suggestion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
action-required Needs Attention 👋 Issues needs attention/assignee/owner question stale Stale issue
Projects
None yet
Development

No branches or pull requests

1 participant