Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FlowAggregator] Add "proxy" mode #6773

Open
antoninbas opened this issue Oct 25, 2024 · 0 comments
Open

[FlowAggregator] Add "proxy" mode #6773

antoninbas opened this issue Oct 25, 2024 · 0 comments
Assignees
Labels
area/flow-visibility/aggregator Issues or PRs related to Flow Aggregator area/flow-visibility Issues or PRs related to flow visibility support in Antrea kind/design Categorizes issue or PR as related to design.

Comments

@antoninbas
Copy link
Contributor

antoninbas commented Oct 25, 2024

Describe what you are trying to solve
The FlowAggregator collects records from all Agents / Nodes (FlowExporters), aggregates the records by correlating records from the source Node and the destination Node and by adding some missing information (e.g., Pod labels), and finally exports the aggregated data to the configured "destination". The destination can, for example, be a ClickHouse database, or an "external" IPFIX collector.
In a large cluster with many connections, one observation is that the FlowAggregator can be quite memory intensive. The FlowAggregator needs to keep a list of all ongoing connections.

When the destination is itself an IPFIX collector that supports aggregation / correlation, performing flow aggregation in Antrea seems redundant.

Consider the following topology:

IPFIX

K8s Nodes are deployed as VMs. The colored arrows show the flow of IPFIX data. IPFIX data is exported by Antrea via the FlowAggregator (Pod-to-Pod flows), by the hypervisor (VM-to-VM flows) assuming it supports IPFIX export, and even potentially by underlay network devices (physical router). All of this data goes to one IPFIX collector, and ideally this collector needs to be smart enough to correlate all the records corresponding to the same e2e flow between Pod X and Pod Y, showing the full network path for the traffic belonging to the flow. I believe that in such a scenario with an IPFIX collector capable of providing advanced network visualization capabilities, it is unnecessary to perform correlation / aggregation in the FlowAggregator. Instead, IPFIX records generated by Antrea should not be treated differently from those coming from the hypervisor (assuming hypervisors connect to the IPFIX collector directly, without aggregation) or the underlay. Each Antrea vSwitch is just a normal network hop (one at the source Node and one at the destination Node in the case of Pod-to-Pod traffic).

Describe the solution you have in mind

I would like to propose a "proxy" mode for the FlowAggregator. In this mode, the FlowAggregator would operate in a stateless fashion: IPFIX records received from each Agent are directly forwarded to the destination IPFIX collector, after adding some new Information Elements (IEs). In proxy mode, the FlowAggregator can still add missing information such as Pod labels, but does not perform flow correlation, which means it no longer needs to hold connection information in memory.

If we use the unofficial terminology from https://www.ietf.org/proceedings/69/slides/ipfix-9.pdf, the Flow Aggregator in proxy mode is an IPFIX distributor performing IPFIX protocol mediation.

Describe how your solution impacts user flows
New configuration parameters will be introduced for the FlowAggregator, so that users can enable proxy mode and provide configuration values which are specific to proxy mode, and don't apply to the default "aggregation" mode. The existing "aggregation" mode will remain the default, so when deploying the FlowAggregator with a given configuration, the behavior will remain exactly the same when using the same configuration after support for proxy mode is introduced.

Note that proxy mode will only be available for IPFIX export. Exporting flows to ClickHouse or S3 will not be supported in proxy mode.

In theory, proxy mode makes it possible to run multiple replicas of the FlowAggregator, in order to scale horizontally.

Describe the main design/architecture of your solution
When proxy mode is enabled, the aggregationProcess will not be used:

aggregationProcess ipfix.IPFIXAggregationProcess

Instead we will read records from the collectingProcess and send them to the IPFIX exporter directly, after adding some extra IEs. At first we plan on adding the following standard IEs: originalObservationDomainId, originalExporterIPv4Address, originalExporterIPv6Address. We will also add Pod labels when requested via the FlowAggregator config.

A lot of the code can be shared, but we will need one flowExportLoop for each mode:

// flowExportLoop is the main loop for the FlowAggregator. It runs in a single
// goroutine. All calls to exporter.Interface methods happen within this
// function, hence preventing any concurrency issue as the exporter.Interface
// implementations are not safe for concurrent access.
func (fa *flowAggregator) flowExportLoop(stopCh <-chan struct{}) {

Alternative solutions that you considered
An alternative approach would be to avoid the FlowAggregator altogether, and have all Agents connect directly to the final IPFIX collector. However, I think it makes sense to keep the FlowAggregator:

  1. More symmetry with the "aggregation" mode
  2. Simpler configuration and debugging; only the FlowAggregator needs to connect to the external IPFIX collector
  3. External IPFIX collector has fewer connections / sessions to manage
  4. Ability to add information in a centralized place (even though in theory, extra information such as Pod labels could be added in the Agent)
  5. We keep the door open for changing the interface used by the Agent to export IPFIX records

Test plan
Unit tests and at least one e2e test to validate proxy mode.

Additional context

@antoninbas antoninbas added kind/design Categorizes issue or PR as related to design. area/flow-visibility Issues or PRs related to flow visibility support in Antrea area/flow-visibility/aggregator Issues or PRs related to Flow Aggregator labels Oct 25, 2024
@antoninbas antoninbas self-assigned this Oct 25, 2024
@antoninbas antoninbas added this to the Antrea v2.3 release milestone Nov 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/flow-visibility/aggregator Issues or PRs related to Flow Aggregator area/flow-visibility Issues or PRs related to flow visibility support in Antrea kind/design Categorizes issue or PR as related to design.
Projects
None yet
Development

No branches or pull requests

1 participant