Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

daemon: Run conntrack GC after Endpoint Restore #32012

Merged
merged 1 commit into from
Apr 30, 2024

Conversation

joestringer
Copy link
Member

@joestringer joestringer commented Apr 17, 2024

Relies on #32068 to debug test failures.

The reverse call tree for RestoreEndpoint, which exposes all restored
endpoints in the EndpointManager, is as follows:

INCOMING CALLS
- f RestoreEndpoint github.com/cilium/cilium/pkg/endpointmanager
  - f regenerateRestoredEndpoints github.com/cilium/cilium/pkg/endpointmanager
    - f initRestore github.com/cilium/cilium/daemon/cmd
      + f startDaemon github.com/cilium/cilium/daemon/cmd

Previously, the CTNATMapGC.Enable() call, which invokes
gc.endpointsManager.GetEndpoints(), would be called prior to exposing
these endpoints in the EndpointManager. As a result, the step where the
initial scan attempts to update each Endpoint's DNSHistory with the
latest CT GC timers would fail, leaving the timestamps empty.

The potential impact of this is that DNS entries that should expire soon
after a cilium-agent restart may not time out for an extra entire
conntrack garbage collection interval several minutes later.

@joestringer joestringer added release-note/bug This PR fixes an issue in a previous release of Cilium. needs-backport/1.15 This PR / issue needs backporting to the v1.15 branch labels Apr 17, 2024
@joestringer joestringer requested a review from squeed April 17, 2024 00:06
@joestringer
Copy link
Member Author

/test

@squeed
Copy link
Contributor

squeed commented Apr 17, 2024

Wow, easy.

Is this a mostly cosmetic change (as referenced in #32013), or is this also causing perceptible issues?

@squeed
Copy link
Contributor

squeed commented Apr 17, 2024

Because #31205 added an explicit revision counter, this change doesn't introduce any risk of GCing active connections. So that looks good 👍

@joestringer
Copy link
Member Author

The main thing that I noticed with #32013 is that #32013 didn't work for the first several minutes after startup without this change. The reason is that #32013 relies on understanding the timing of previous and next conntrack GC events from this logic running against all endpoints. Because of the timing before this PR, the initial GC would not report those prev/next conntrack GC timestamps into the DNS history for each endpoint here (eps would be empty):

for _, e := range eps {
e.MarkCTGCTime(gcStart)
}

In my local testing, if I waited another ~7m30s then the next conntrack GC would trigger and the DNS timers would be updated, then everything would work as expected.

The idea behind the conntrack GC on startup is to ensure that when Cilium starts, everything is as up-to-date and synchronized as it can be. So we trigger GC on startup, and for the most part this works. There's only a couple of things that rely on a correct list, (a) marking the alive time for individual connections in Endpoint.DNSZombies and (b) Setting the most recent run time for CT GC inside the Endpoint.DNSZombies. Both of these end up boiling down to a set of checks in the DNS cache GC which asserts that the connections are alive if the timestamp from (a) is older than the timestamp for (b). (a) is presumably restored on startup from a previous run, whereas (b) will remain zero during initial CT GC scan, so all entries are considered "alive" until the second CT GC run, when the timestamps will be refreshed. So it doesn't look like we will prematurely delete DNS entries on startup due to this bug.

$ git grep -C 1 lastCTGCUpdate
pkg/fqdn/cache.go-      // DNSZombieMappings.MarkAlive.
pkg/fqdn/cache.go:      // When AliveAt is later than DNSZombieMappings.lastCTGCUpdate the zombie is
pkg/fqdn/cache.go-      // considered alive.
--
pkg/fqdn/cache.go-      // multiple times.
pkg/fqdn/cache.go:      // When DNSZombieMappings.lastCTGCUpdate is earlier than DeletePendingAt a
pkg/fqdn/cache.go-      // zombie is alive.
--
pkg/fqdn/cache.go-      deletes        map[netip.Addr]*DNSZombieMapping
pkg/fqdn/cache.go:      lastCTGCUpdate time.Time
pkg/fqdn/cache.go-      nextCTGCUpdate time.Time // estimated
--
pkg/fqdn/cache.go-// Zombie is considered dead if all of these conditions apply:
pkg/fqdn/cache.go:// 1. CT GC has run after the DNS Expiry time and grace period (lastCTGCUpdate > DeletePendingAt + GracePeriod), and
pkg/fqdn/cache.go:// 2. The CT GC run did not mark the Zombie alive (lastCTGCUpdate > AliveAt)
pkg/fqdn/cache.go-// 3. CT GC has run at least 2 times since Zombie was entered
--
pkg/fqdn/cache.go-func (zombies *DNSZombieMappings) isConnectionAlive(zombie *DNSZombieMapping) bool {
pkg/fqdn/cache.go:      if !zombies.lastCTGCUpdate.After(zombie.DeletePendingAt.Add(option.Config.ToFQDNsIdleConnectionGracePeriod)) {
pkg/fqdn/cache.go-              return true
pkg/fqdn/cache.go-      }
pkg/fqdn/cache.go:      if !zombies.lastCTGCUpdate.After(zombie.AliveAt) {
pkg/fqdn/cache.go-              return true
--
pkg/fqdn/cache.go-      zombies.Lock()
pkg/fqdn/cache.go:      zombies.lastCTGCUpdate = ctGCStart
pkg/fqdn/cache.go-      zombies.nextCTGCUpdate = estNext

So then the main impact I can identify at this point is just that the garbage collection that we expect to happen within the first ~30s of Cilium startup will be deferred to the first ~10m of Cilium startup. Now I put it that way, the logic seems inconsistent with the intent of the code but it's not that severe of a bug.

@joestringer joestringer removed the needs-backport/1.15 This PR / issue needs backporting to the v1.15 branch label Apr 17, 2024
@joestringer joestringer changed the base branch from main to pr/joe/improve-nat46x64-debugging April 18, 2024 21:54
@joestringer joestringer added the dont-merge/blocked Another PR must be merged before this one. label Apr 18, 2024
@joestringer
Copy link
Member Author

/ci-l4lb

@joestringer joestringer marked this pull request as ready for review April 22, 2024 16:26
@joestringer joestringer requested a review from a team as a code owner April 22, 2024 16:26
@joestringer joestringer force-pushed the pr/joe/improve-nat46x64-debugging branch from 39c0a25 to e828af6 Compare April 22, 2024 21:34
@joestringer joestringer requested review from a team as code owners April 22, 2024 21:34
@joestringer joestringer requested review from aspsk, nebril and viktor-kurchenko and removed request for a team April 22, 2024 21:34
Copy link
Member

@nebril nebril left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice

@joestringer joestringer force-pushed the pr/joe/improve-nat46x64-debugging branch from e828af6 to 584a1b6 Compare April 25, 2024 03:06
Base automatically changed from pr/joe/improve-nat46x64-debugging to main April 25, 2024 07:57
@aspsk
Copy link
Contributor

aspsk commented Apr 25, 2024

Could you please rebase?

The reverse call tree for RestoreEndpoint, which exposes all restored
endpoints in the EndpointManager, is as follows:

INCOMING CALLS
- f RestoreEndpoint github.com/cilium/cilium/pkg/endpointmanager
  - f regenerateRestoredEndpoints github.com/cilium/cilium/pkg/endpointmanager
    - f initRestore github.com/cilium/cilium/daemon/cmd
      + f startDaemon github.com/cilium/cilium/daemon/cmd

Previously, the `CTNATMapGC.Enable()` call, which invokes
`gc.endpointsManager.GetEndpoints()`, would be called prior to exposing
these endpoints in the EndpointManager. As a result, the step where the
initial scan attempts to update each Endpoint's DNSHistory with the
latest CT GC timers would fail, leaving the timestamps empty.

The potential impact of this is that DNS entries that should expire soon
after a cilium-agent restart may not time out for an extra entire
conntrack garbage collection interval several minutes later.

Signed-off-by: Joe Stringer <joe@cilium.io>
@joestringer
Copy link
Member Author

/test

@joestringer joestringer removed the dont-merge/blocked Another PR must be merged before this one. label Apr 25, 2024
@joestringer joestringer added this pull request to the merge queue Apr 30, 2024
Merged via the queue into main with commit 6b0d76a Apr 30, 2024
261 of 265 checks passed
@joestringer joestringer deleted the pr/joe/ct-gc-after-restore branch April 30, 2024 01:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-note/bug This PR fixes an issue in a previous release of Cilium.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants