-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
daemon: Run conntrack GC after Endpoint Restore #32012
Conversation
/test |
Wow, easy. Is this a mostly cosmetic change (as referenced in #32013), or is this also causing perceptible issues? |
Because #31205 added an explicit revision counter, this change doesn't introduce any risk of GCing active connections. So that looks good 👍 |
The main thing that I noticed with #32013 is that #32013 didn't work for the first several minutes after startup without this change. The reason is that #32013 relies on understanding the timing of previous and next conntrack GC events from this logic running against all endpoints. Because of the timing before this PR, the initial GC would not report those prev/next conntrack GC timestamps into the DNS history for each endpoint here ( cilium/pkg/maps/ctmap/gc/gc.go Lines 167 to 169 in a6b8538
In my local testing, if I waited another ~7m30s then the next conntrack GC would trigger and the DNS timers would be updated, then everything would work as expected. The idea behind the conntrack GC on startup is to ensure that when Cilium starts, everything is as up-to-date and synchronized as it can be. So we trigger GC on startup, and for the most part this works. There's only a couple of things that rely on a correct list, (a) marking the alive time for individual connections in
So then the main impact I can identify at this point is just that the garbage collection that we expect to happen within the first ~30s of Cilium startup will be deferred to the first ~10m of Cilium startup. Now I put it that way, the logic seems inconsistent with the intent of the code but it's not that severe of a bug. |
d1135ed
to
15772a8
Compare
/ci-l4lb |
39c0a25
to
e828af6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice
e828af6
to
584a1b6
Compare
Could you please rebase? |
The reverse call tree for RestoreEndpoint, which exposes all restored endpoints in the EndpointManager, is as follows: INCOMING CALLS - f RestoreEndpoint github.com/cilium/cilium/pkg/endpointmanager - f regenerateRestoredEndpoints github.com/cilium/cilium/pkg/endpointmanager - f initRestore github.com/cilium/cilium/daemon/cmd + f startDaemon github.com/cilium/cilium/daemon/cmd Previously, the `CTNATMapGC.Enable()` call, which invokes `gc.endpointsManager.GetEndpoints()`, would be called prior to exposing these endpoints in the EndpointManager. As a result, the step where the initial scan attempts to update each Endpoint's DNSHistory with the latest CT GC timers would fail, leaving the timestamps empty. The potential impact of this is that DNS entries that should expire soon after a cilium-agent restart may not time out for an extra entire conntrack garbage collection interval several minutes later. Signed-off-by: Joe Stringer <joe@cilium.io>
15772a8
to
09fac0c
Compare
/test |
Relies on #32068 to debug test failures.
The reverse call tree for RestoreEndpoint, which exposes all restored
endpoints in the EndpointManager, is as follows:
Previously, the
CTNATMapGC.Enable()
call, which invokesgc.endpointsManager.GetEndpoints()
, would be called prior to exposingthese endpoints in the EndpointManager. As a result, the step where the
initial scan attempts to update each Endpoint's DNSHistory with the
latest CT GC timers would fail, leaving the timestamps empty.
The potential impact of this is that DNS entries that should expire soon
after a cilium-agent restart may not time out for an extra entire
conntrack garbage collection interval several minutes later.