-
Notifications
You must be signed in to change notification settings - Fork 289
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Intermittent CrashLoopBackOff in Windows Containers Running on AKS (.NET 6 Apps with System.Net.Sockets.SocketException 11001 and 10060) #3598
Comments
I have automated the cleaning up of Pods runing into CrashLoopBackOff, with known socket exceptions for now as a workaround as described here. Workaround is OK for now. However, prefer to fix the issue rather than relying on this workaround. |
Action required from @Azure/aks-pm |
Issue needing attention of @Azure/aks-leads |
5 similar comments
Issue needing attention of @Azure/aks-leads |
Issue needing attention of @Azure/aks-leads |
Issue needing attention of @Azure/aks-leads |
Issue needing attention of @Azure/aks-leads |
Issue needing attention of @Azure/aks-leads |
This issue has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs within 15 days of this comment. |
Issue needing attention of @Azure/aks-leads |
This issue will now be closed because it hasn't had any activity for 7 days after stale. chamindac feel free to comment again on the next 7 days to reopen or open a new issue after that time if you still have a question/issue or suggestion. |
I am running .NET 6 apps in AKS. Have 11 Linux based apps and 11 Windows based apps (due to legency dependency on c++ library which is depending on Windows patform). Once a deployment of all 22 Apps done via parellel running Azure DevOps pipelines, which are using Kubernetes manifest files to deploy each app, apps were giving intermittenet CrashLoopBackOff randomly. One or two apps randomly fail at starting up. Each of these app is in a seperate pod. some apps have 2 minimum instances. Once a deployemnt done sometimes one instance starts fine, and one instance of same app may go to CrashLoopBackOff, trying to connect to Azure App Configuration service.
Analyzing log of CrashLoopBackOff container shown it fails connecting to Azure app config service, which is using a Azure Key Vault to keep secrets. The issue shown is one of the below.
System.Net.Sockets.SocketException (10060): A connection attempt failed
System.Net.Sockets.SocketException (11001): No such host is known.
Attempted Fixes
Intially the issue was happening in both Windows and Linux containers. However, after making the following fix (dnsConfig) in Kubernetes deployment manifest the linux containers no longer giving any CrashLoopBackOff and all linux apps starts fine after a deployment or a rollover restart.
However, the windows containers still randomly run into this issue after a deployment most of the time. But have not seen this happening, when pods are scaling, due to load, based on horozontal pod autoscaler settings. In rollover restarts, the CrashLoopBackOff occurs rarely running the Windows containers with one of the above mentioned socket exceptions.
Following articles, github issues etc. already referred for this.
Setting up node local dns cache is not possible in AKS as by design AKS does not allow this. #1435
From .NET point of view following are referred.
Tried adding below code fixes to increase retries and delays between retries in the apps
But these retry delay etc. changes in app only resulted in, such failing container to take long time like 20 minutes to run into CrashLoopBackOff. Without above retry settings, such container (pod) runs into CrashLoopBackOff sooner with one of above socket exceptions. Sometimes when such a pod is left alone for longer time like 45 minutes to one hour it automatically manages to get up and running on the restart attempts by Kubernetes.
Any thoughts or ideas to apply as a fix for AKS or to the .NET app is helpful here as this CrashLoopBackOff is annoying. Now, what is being done is after a deployment, I check the AKS workloads, pods from Azure portal(or using kubectl) and kill the pod in crashloopback, which creates another pod by Kubernetes and that is running fine.
The text was updated successfully, but these errors were encountered: