Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ipcache: Introduce the ability to inherit CIDR prefixes #32578

Merged
merged 9 commits into from
May 29, 2024

Conversation

gandro
Copy link
Member

@gandro gandro commented May 16, 2024

This PR introduces a new feature in the IPCache metadata subsystem:
The ability for prefixes with local identities to inherit CIDR labels
from encompassing prefixes. This is needed to implement
cilium/design-cfps#17 where we want to introduce a new way to manage
identities for IPs allowlisted by DNS lookups. Without going into too
much detail, the key point of that CFP is that IPs returned by DNS
lookups will be associated with a fqdn: label.

Those IPs returned by a DNS lookup however might also be selected by a
CIDR prefix, and thus might also optionally need a cidr: label if (but
only if) a selecting CIDR policy is present.

To give an example: If we perform a DNS lookup on one.one.one.one on a
pod with DNS redirects enabled, we may observe IP 1.1.1.1 and thus
associated that IP with a fqdn: label. If at the same time, we also
have a separate CIDR policy allowing traffic to 1.0.0.0/8, then that
IP also needs the cidr:1.0.0.0/8 label, to make it is selected by both
the FQDN and the CIDR policy.

To implement this, the IPCache metadata subsystem is slightly augmented:

  1. Whenever metadata for a prefix is upserted or removed, we no longer
    enqueue only the prefix it self, but also all potentially affected
    prefixes or IPs contained by it. This allows InjectLabels to then
    down-propagate the changes to any child prefixes.
  2. In resolveIdentity (called by InjectLabels), we check if there
    are any parent prefixes with a CIDR label that we want to inherit.

At the moment, only local identities without a reserved: label may
inherit a CIDR label from a parent prefix. This currently therefore only
applies to identities containing fqdn: labels. Going forward, this
infrastructure might however also be useful to implement non-K8s label
sources.

@gandro gandro added sig/policy Impacts whether traffic is allowed or denied based on user-defined policies. release-note/misc This PR makes changes that have no direct user impact. labels May 16, 2024
@gandro gandro force-pushed the pr/gandro/ipcache-inherit-cidr-labels branch 2 times, most recently from a8d8300 to 53185d7 Compare May 16, 2024 14:59
@gandro
Copy link
Member Author

gandro commented May 16, 2024

/test

!lbls.Has(labels.LabelIngress[labels.IDNameIngress]) {
!lbls.Has(labels.LabelIngress[labels.IDNameIngress]) &&
!lbls.HasSource(labels.LabelSourceFQDN) &&
!lbls.HasSource(labels.LabelSourceCIDR) {
cidrLabels := labels.GetCIDRLabels(prefix)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: This case should not be hit anymore, even before this PR. But we still have unit tests relying on it. To avoid adding even more commits to this PR, I decided to not emit a "should not be reached" warning log message here and leave this for a follow-up PR.

@gandro gandro requested review from squeed and removed request for squeed May 16, 2024 15:38
@gandro gandro marked this pull request as ready for review May 16, 2024 15:38
@gandro gandro requested review from a team as code owners May 16, 2024 15:38
pkg/ipcache/metadata.go Show resolved Hide resolved
pkg/ipcache/metadata.go Outdated Show resolved Hide resolved
pkg/ipcache/metadata.go Outdated Show resolved Hide resolved
Copy link
Member

@nathanjsweet nathanjsweet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool!

@gandro gandro force-pushed the pr/gandro/ipcache-inherit-cidr-labels branch from 53185d7 to 9574178 Compare May 21, 2024 07:59
@gandro
Copy link
Member Author

gandro commented May 21, 2024

Dropped the last ipcache: Only update policy maps for new identities commit as requested. It's now its own PR #32628

@gandro gandro requested a review from christarazi May 21, 2024 07:59
@gandro
Copy link
Member Author

gandro commented May 21, 2024

/test

Copy link
Member

@christarazi christarazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate on

A note on the implementation: To find affected child prefixes, we
currently perform a linear scan over the metadata map. While this is a
O(n) operation, finding CIDR parents in a trie could be done in O(log n), but at the moment we expect n to be in hundreds to the low
thousands at most. For most real world cases, we therefore expect the
current implementation to perform just as well if not better than a
trie. Also note that the linear scan is skipped if the prefix is a
single IP and thus cannot contain any other prefixes besides itself.

What is part of n where we expect it to be in the hundreds or low thousands? What if we add pod IPs into this as #31409 would do?

@gandro
Copy link
Member Author

gandro commented May 23, 2024

What is part of n where we expect it to be in the hundreds or low thousands? What if we add pod IPs into this as #31409 would do?

So that sentence was written with the status quo, where only CIDR prefixes (per node), node IPs (global) and DNS lookups (per node) are added to metadata. Of course, once we add pod IPs (global) to also use the async API, the order of magnitude changes and the statement might no longer be true.

It's hard to say at what point an allocation-heavy but asymptotically more optimal data-structure like a trie outperforms a naive linear scan. Let me try to add some benchmarks to see how expensive the current approach is.

This helper allows to check if a label set contains a label from a
particular source.

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
This label source will superseed CIDR labels for identities generated by
ToFQDN rules. Like CIDR labels, FQDN labels are node local identities.

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
This commit removes the direct access to `PrefixInfo` as it is not
needed for the current caller in the node manager (which only requires
the source). A subsequent commit will introduce additional labels to a
prefix not found in the `PrefixInfo`. Therefore, to avoid callers
observing incomplete information, we hide the direct access.

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
There are no longer is access outside of the IPCache package.

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
A IP prefix may have multiple representations. For example
`192.168.0.0/24` can also be represented as `192.168.50.60/24` (a IPv4
prefix using a non-masked address) or `::ffff:192.168.0.0/24` (a
IPv4-mapped IPv6 address which will use the IPv4 stack on the host).
Those prefixes might look different, but they all represent exactly the
same destination. To avoid issues where callers might provide metadata
for a prefix in non-canonical form, this PR introduces a helper which
translates every IP used for upserts, deletions and lookups into the
canonical form before the metadata map is accessed.

This commit does not fix any known bug and merely introduces this in the
spirit of defensive programming.

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
The comment was written is no longer valid, as CIDR policies now also
funnel their prefixes through the IPCache metadata subsystem.

However, the logic needs to remain, as other callers (e.g. the node
manager) now can inject multiple prefixes that share the same identity.
The code comment is updated to reflect that.

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
This commit introduces a new feature in the IPCache metadata subsystem:
The ability for prefixes with local identities to inherit CIDR labels
from encompassing prefixes. This is needed to implement
cilium/design-cfps#17 where we want to introduce a new way to manage
identities for IPs allowlisted by DNS lookups. Without going into too
much detail, the key point of that CFP is that IPs returned by DNS
lookups will be associated with a `fqdn:` label.

Those IPs returned by a DNS lookup however might also be selected by a
CIDR prefix, and thus might also optionally need a `cidr:` label if (but
only if) a selecting CIDR policy is present.

To give an example: If we perform a DNS lookup on `one.one.one.one` on a
pod with DNS redirects enabled, we may observe IP `1.1.1.1` and thus
associated that IP with a `fqdn:` label. If at the same time, we also
have a separate CIDR policy allowing traffic to `1.0.0.0/8`, then that
IP also needs the `cidr:1.0.0.0/8` label, to make it is selected by both
the FQDN and the CIDR policy.

To implement this, the IPCache metadata subsystem is slightly augmented:

1. Whenever metadata for a prefix is upserted or removed, we no longer
   enqueue only the prefix it self, but also all potentially affected
   prefixes or IPs contained by it. This allows `InjectLabels` to then
   down-propagate the changes to any child prefixes.
2. In `resolveIdentity` (called by `InjectLabels`), we check if there
   are any parent prefixes with a CIDR label that we want to inherit.

At the moment, only local identities without a `reserved:` label may
inherit a CIDR label from a parent prefix. This currently therefore only
applies to identities containing `fqdn:` labels. Going forward, this
infrastructure might however also be useful to implement non-K8s label
sources.

A note on the implementation: To find affected child prefixes, we
currently perform a linear scan over the metadata map. While this is a
`O(n)` operation, finding CIDR parents in a trie could be done in `O(log
n)`, but at the moment we expect `n` to be in hundreds to the low
thousands at most. For most real world cases, we therefore expect the
current implementation to perform just as well if not better than a
trie. Also note that the linear scan is skipped if the prefix is a
single IP and thus cannot contain any other prefixes besides itself.

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
@gandro gandro force-pushed the pr/gandro/ipcache-inherit-cidr-labels branch from 9574178 to 8a8c0b0 Compare May 27, 2024 15:50
This adds a benchmark testing the overhead of using a naive linear scan
over a Go map vs using a bitlpm trie.

The benchmark harness generates a number of child (leaf) prefixes, plus
a number of parent (non-leaf) prefixes.  To simulate somewhat realistic
behavior, we don't allocate IPs randomly across the whole IP space, but
rather from a contiguous block of IPs with a given sparseness factor
(i.e. a factor of 1.0 will allocate every single IP out of the block, a
factor of 0.1 will allocate only 10% of the available IP space).

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
@gandro gandro force-pushed the pr/gandro/ipcache-inherit-cidr-labels branch from 8a8c0b0 to e6b3759 Compare May 27, 2024 15:53
@gandro
Copy link
Member Author

gandro commented May 27, 2024

I've pushed the code for a benchmark comparing the findAffectedChildPrefixes function when implemented on a Golang map vs a bitlpm trie. There are of course a lot of variables that can impact the trie behavior (how are the prefixes distributed etc), but for the Golang map the algorithm is pretty much agnostic to those changes, and the Golang hash map behavior which is what we are mainly interested in here. In other words, the goal of the benchmark is not to test the behavior of our trie implementation, but mostly figure out "how bad" the naive linear scan really is.

Here are the average times it takes one invocation of findAffectedChildPrefixes - i.e. the cost of upserting a non-leaf prefix, i.e. the cost of ingesting a policy containing a ToCIDR rule (upserts for single IPs do not trigger findAffectedChildPrefixes):

IPs/CIDRs LPM Trie Hash Map
1k/10 0.003335 ms 0.029799 ms
10k/100 0.004634 ms 0.25511 ms
100k/1k 0.009221 ms 2.028705 ms
1M/10k 0.015925 ms 26.684717 ms

It's pretty clear that the the trie is O(log n) where as the map is O(n).

In terms of of allocations performed, trie also seems roughly O(log n), where as the map is actually O(1):

IPs/CIDRs LPM Trie Hash Map
1k/10 26 allocs/op 6 allocs/op
10k/100 36 allocs/op 6 allocs/op
100k/1k 47 allocs/op 6 allocs/op
1M/10k 56 allocs/op 6 allocs/op

As is, I would say the 10k/100 run is about the expected order of magnitude we can expect in a large-ish cluster (5000 nodes, as pods are excluded in this) today. 100k/1k is probably what we can expect for a large-ish cluster after #31409 (i.e. 5k nodes, 90k pods). The 1 million entries can be treated as a theoretical worst case today (it would require ~500k nodes). Also note IPCache currently supports up to 512k entries by default.

In my opinion, the status quo is acceptable. Once #31409 lands, I think the case for the trie becomes much more interesting (though the increased memory pressure needs to be investigated).

And keep in mind, this cost only occurs whenever a CIDR policy is changed (and /32 CIDRs do not trigger this). It does not affect upserts from the FQDN subsystem, does not affect upserts from the node manager and does not upserts from CEPs once #31409 lands.

Given these results, I would prefer to not block this PR on the overall worse performance and defer the trie backend as a performance optimization to later. But please let me know what you think!

@gandro
Copy link
Member Author

gandro commented May 27, 2024

/test

@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label May 28, 2024
@gandro gandro removed the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label May 28, 2024
Copy link
Member

@christarazi christarazi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the benchmark! I agree with not blocking the PR. I was trying to understand just how much the impact was in case we faced issues with users and how much work it would be to move towards the trie as the mitigation.

@maintainer-s-little-helper maintainer-s-little-helper bot added the ready-to-merge This PR has passed all tests and received consensus from code owners to merge. label May 28, 2024
@gandro gandro merged commit 8e3dec8 into cilium:main May 29, 2024
63 of 64 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready-to-merge This PR has passed all tests and received consensus from code owners to merge. release-note/misc This PR makes changes that have no direct user impact. sig/policy Impacts whether traffic is allowed or denied based on user-defined policies.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants