Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(outputs): expose outputs_queue config for memory control #2711

Merged
merged 10 commits into from
Sep 7, 2023

Conversation

incertum
Copy link
Contributor

@incertum incertum commented Aug 1, 2023

What type of PR is this?

Uncomment one (or more) /kind <> lines:

/kind bug

/kind cleanup

/kind design

/kind documentation

/kind failing-test

/kind feature

/kind release

Any specific area of the project related to this PR?

Uncomment one (or more) /area <> lines:

/area build

/area engine

/area tests

/area proposals

/area CI

What this PR does / why we need it:

The tbb output queue is currently unbounded and is a very likely the cause for memory leak issues in production when the pipe is not holding up, see some simulated results #2495 (comment).

We discussed that exposing new configs is best practices offering more options for adopters to control consumption and limits beyond the external cgroups limits. It's also in line with the kernel buffer being bounded and its size exposed as config.

After all no one wants SRE s or system admins calling out Falco may be leaking memory.

There could still be other issues. Therefore, the best next step would be to merge an acceptable solution and then test deploy and see and then go from there.

@leogr @jasondellaluce

Which issue(s) this PR fixes:

#2495 (comment)

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

fix(outputs): expose queue_capacity_outputs config for memory control

falco.yaml Outdated
# recovery: 1 means simply exit (default behavior).
# recovery: 2 means empty the queue and then continue.
queue_capacity_outputs:
items: 0
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@leogr as per our chat preserving the old default (unbounded and exiting). Obviously we can change the encoding etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Furthermore, @leogr I would like to better understand the interrelationships with for instance output_timeout, buffered_outputs and outputs.rate and outputs.max_burst. Their descriptions are also still more cryptic especially for the outputs one. aka I am looking for a scaling factor recommendation aka if you change let's

outputs:
  rate: 100000
  max_burst: 100000000

what capacity of the queue should you choose in relation?

Maybe the timeout default is not right? we block for too long? How was the current default derived?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also reading this raises some question marks ... because not having a recovery strategy as well here appears suboptimal:

# It's important to note that Falco outputs will not be discarded from the
# output queue. This means that if an output channel becomes blocked
# indefinitely, it indicates a potential issue that needs to be addressed by the
# user.
output_timeout: 2000

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Furthermore, @leogr I would like to better understand the interrelationships with for instance output_timeout, buffered_outputs and outputs.rate and outputs.max_burst.

Let me explain them one by one in a logical order (from upstream to downstream).

  • (step 0: main thread) the rule engine process incoming events

  • (step 1: main thread) outputs.rate and outputs.max_burst configure simple rate limiter. It is applied immediately and after a rule's condition is met and immediately before pushing the message in the queue:

    // As the inspector has no filter at its level, all
    // events are returned here. Pass them to the falco
    // engine, which will match the event against the set
    // of rules. If a match is found, pass the event to
    // the outputs.
    std::unique_ptr<falco_engine::rule_result> res = s.engine->process_event(source_engine_idx, ev);
    if(res)
    {
    if (!rate_limiter_enabled || rate_limiter.claim())
    {
    s.outputs->handle_event(res->evt, res->rule, res->source, res->priority_num, res->format, res->tags);
    }
    else
    {
    falco_logger::log(LOG_DEBUG, "Skipping rate-limited notification for rule " + res->rule + "\n");
    }
    }

    It was a legacy way to protect the outputs system from being flooded by a peak of messages. However, this mechanism should be discouraged since the rate limiter discards those events, so one can potentially lose critical alerts. For this reason, it's not enabled by default.

  • (step 2) here's the queue

  • (step 3a: in another thread) A worker pops messages from the queue and fans out them to each (enabled) output channel implementation.

  • (step 3b: in another thread) A watchdog checks that the output channel implementation is not blocking the worker (and thus the whole queue) for longer than the value configured in output_timeout. When the timeout is reached, a log message is triggered as a last resort.

  • (step 4: in the same thread of 3a) each enabled output channel implementation receives the message and outputs it

    • Among those output implementations, some support the buffered_outputs option (for example, the file output and stdout output. The actual meaning of buffered may be slightly different depending on the specific output, and not all outputs support it. Anyway, it generally refers to enabling (or not) the buffer on the underlying byte stream on the output channel implementation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the output_timeout is so important?

When output consumer blocks (ie. imagine the program output or any I/O related op that happens downstream to Falco) and the alert can be delivered within the given deadline, Falco uses stderr to notify the user of a not addressable problem.

Why stderr ? Because output channels are blocked.

Why the problem is not addressable by Falco? Because such a situation indicates an I/O problem downstream to Falco (for example, a network slowdown) or, more rarely, a misconfiguration (for instance, imagine configuring the program output with a program that blocks for more than 2 seconds, you can simulate this using sleep 10).

So output_timeout is just an emergency mechanism to notify the user that something (outside Falco) is blocking Falco and Falco is giving up because it can't handle the problem. It has no direct with Falco's performance or with its behavior during normal operations.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you Leo for the good explanation!
What about deprecating the rate_limiter (just put a deprecation warning in Falco and a [DEPRACATED] label in the config) in this release 0.36 and then remove it in 0.37? WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on this @Andreagit97:

Don't wanna come across as always being in favor of just getting rid of things, but at the same time, we do need to improve technical clarity and clean things up. Referring to it as "technical debt" without any negative connotation -- it's normal in every project.

@Andreagit97 would you be willing to run point on this? Asking because you know I will be out most of Sep and will limit code contributions now for this release, but happy to help as reviewer!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a strong opinion on the rate limiter. I wonder if it can still be helpful with some use cases or not.
I vaguely recall I had a conversation with @jasondellaluce a while ago.

Anyway, I'm okay with removing it if we get a consensus. In such a case, I recommend putting a deprecation warning for at least 1 release before completely removing it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I am just confused that we then silently drop events without even counting these drops. Is there another way to do it? If we keep it can we at least add drop counters in?

I guess it comes back to Falco being a bit of a different type of a service, where the end goal would be to not produce too many alerts and try to never drop an event.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I will open an issue to deprecate it and see if we reach consensus

falco.yaml Outdated
# recovery: 1 means simply exit (default behavior).
# recovery: 2 means empty the queue and then continue.
queue_capacity_outputs:
items: 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Furthermore, @leogr I would like to better understand the interrelationships with for instance output_timeout, buffered_outputs and outputs.rate and outputs.max_burst.

Let me explain them one by one in a logical order (from upstream to downstream).

  • (step 0: main thread) the rule engine process incoming events

  • (step 1: main thread) outputs.rate and outputs.max_burst configure simple rate limiter. It is applied immediately and after a rule's condition is met and immediately before pushing the message in the queue:

    // As the inspector has no filter at its level, all
    // events are returned here. Pass them to the falco
    // engine, which will match the event against the set
    // of rules. If a match is found, pass the event to
    // the outputs.
    std::unique_ptr<falco_engine::rule_result> res = s.engine->process_event(source_engine_idx, ev);
    if(res)
    {
    if (!rate_limiter_enabled || rate_limiter.claim())
    {
    s.outputs->handle_event(res->evt, res->rule, res->source, res->priority_num, res->format, res->tags);
    }
    else
    {
    falco_logger::log(LOG_DEBUG, "Skipping rate-limited notification for rule " + res->rule + "\n");
    }
    }

    It was a legacy way to protect the outputs system from being flooded by a peak of messages. However, this mechanism should be discouraged since the rate limiter discards those events, so one can potentially lose critical alerts. For this reason, it's not enabled by default.

  • (step 2) here's the queue

  • (step 3a: in another thread) A worker pops messages from the queue and fans out them to each (enabled) output channel implementation.

  • (step 3b: in another thread) A watchdog checks that the output channel implementation is not blocking the worker (and thus the whole queue) for longer than the value configured in output_timeout. When the timeout is reached, a log message is triggered as a last resort.

  • (step 4: in the same thread of 3a) each enabled output channel implementation receives the message and outputs it

    • Among those output implementations, some support the buffered_outputs option (for example, the file output and stdout output. The actual meaning of buffered may be slightly different depending on the specific output, and not all outputs support it. Anyway, it generally refers to enabling (or not) the buffer on the underlying byte stream on the output channel implementation.

falco.yaml Outdated
# recovery: 1 means simply exit (default behavior).
# recovery: 2 means empty the queue and then continue.
queue_capacity_outputs:
items: 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the output_timeout is so important?

When output consumer blocks (ie. imagine the program output or any I/O related op that happens downstream to Falco) and the alert can be delivered within the given deadline, Falco uses stderr to notify the user of a not addressable problem.

Why stderr ? Because output channels are blocked.

Why the problem is not addressable by Falco? Because such a situation indicates an I/O problem downstream to Falco (for example, a network slowdown) or, more rarely, a misconfiguration (for instance, imagine configuring the program output with a program that blocks for more than 2 seconds, you can simulate this using sleep 10).

So output_timeout is just an emergency mechanism to notify the user that something (outside Falco) is blocking Falco and Falco is giving up because it can't handle the problem. It has no direct with Falco's performance or with its behavior during normal operations.

falco.yaml Outdated Show resolved Hide resolved
@github-actions
Copy link

github-actions bot commented Aug 4, 2023

This PR may bring feature or behavior changes in the Falco engine and may require the engine version to be bumped.

Please double check userspace/engine/falco_engine_version.h file. See versioning for FALCO_ENGINE_VERSION.

/hold

@leogr leogr added this to the 0.36.0 milestone Aug 11, 2023
@incertum incertum force-pushed the queue-capacity-outputs branch 2 times, most recently from 8dba4ee to 530d5d9 Compare August 23, 2023 21:18
userspace/engine/falco_common.cpp Outdated Show resolved Hide resolved
userspace/engine/falco_common.cpp Outdated Show resolved Hide resolved
userspace/engine/falco_common.h Outdated Show resolved Hide resolved
userspace/falco/app/actions/process_events.cpp Outdated Show resolved Hide resolved
falco.yaml Show resolved Hide resolved
falco.yaml Outdated Show resolved Hide resolved
userspace/falco/falco_outputs.cpp Outdated Show resolved Hide resolved
@@ -268,8 +276,21 @@ inline void falco_outputs::push(const ctrl_msg& cmsg)
{
if (!m_queue.try_push(cmsg))
{
fprintf(stderr, "Fatal error: Output queue reached maximum capacity. Exiting.\n");
exit(EXIT_FAILURE);
switch (m_recovery)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if we are 100% thread safe here 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not, maybe @jasondellaluce once back may have additional ideas that we could also address in a follow up PR since Jason worked a lot on the thread safety aspects all around.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what would happen in multi-source scenarios?

userspace/falco/stats_writer.cpp Outdated Show resolved Hide resolved
userspace/falco/stats_writer.cpp Outdated Show resolved Hide resolved
Copy link
Member

@leogr leogr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just left a few comments.

Also, please keep in mind that this PR should be rebased, and we must make sure the changes introduced by #2663 are preserved (i.e. the WASM build support)

Finally, a question. Since we are dropping a message from the queue, what will happen when a control message is lost (for example, ctrl_msg_type::CTRL_MSG_STOP or
ctrl_msg_type::CTRL_MSG_REOPEN)?

I wonder if it can be a problem in some case 🤔

userspace/falco/falco_outputs.cpp Outdated Show resolved Hide resolved
userspace/falco/falco_outputs.cpp Outdated Show resolved Hide resolved
userspace/falco/falco_outputs.cpp Outdated Show resolved Hide resolved
@incertum
Copy link
Contributor Author

Also, please keep in mind that this PR should be rebased, and we must make sure the changes introduced by #2663 are preserved (i.e. the WASM build support)

Let me rebase very carefully and then implement each change request very carefully later on.

Also @Andreagit97 great suggestions all around!

Finally, a question. Since we are dropping a message from the queue, what will happen when a control message is lost (for example, ctrl_msg_type::CTRL_MSG_STOP or ctrl_msg_type::CTRL_MSG_REOPEN)?

I wonder if it can be a problem in some case 🤔

Good question, we need to check.

Signed-off-by: Melissa Kilby <melissa.kilby.oss@gmail.com>
@incertum
Copy link
Contributor Author

Hi 👋 kindly checking in on the status of this PR.

Eager to give my best to get this over the finish line, at the same time please note I have limited availability next week and after that I am out until after the release (something I have well communicated early on).

Rating this PR as important to be able to get to the bottom of memory leaks. Falco effectively can leak memory today in production as confirmed by multiple adopters (including myself).

Copy link
Member

@Andreagit97 Andreagit97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR LGTM! I still have an open point that I would like to understand unrelated to the PR,
why do we have 2 queues, one for outputs and another for metrics? @jasondellaluce @leogr

@@ -258,6 +260,18 @@ void falco_configuration::load_yaml(const std::string& config_name, const yaml_h
}

m_buffered_outputs = config.get_scalar<bool>("buffered_outputs", false);
m_outputs_queue_capacity = config.get_scalar<size_t>("outputs_queue.capacity", DEFAULT_OUTPUTS_QUEUE_CAPACITY_UNBOUNDED_MAX_LONG_VALUE);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use an int64_t instead of size_t? in this way, the DEFAULT_OUTPUTS_QUEUE_CAPACITY could be simply be -1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can, at the same time always preferring to stay close to the framework, here TBB, and they use size_t. Let me know if its a [nit] or if we significantly gain from this change, then I'll do it.

falco.yaml Outdated
# recovery: 1 means simply exit (default behavior).
# recovery: 2 means empty the queue and then continue.
queue_capacity_outputs:
items: 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I will open an issue to deprecate it and see if we reach consensus

@Andreagit97
Copy link
Member

Finally, a question. Since we are dropping a message from the queue, what will happen when a control message is lost (for example, ctrl_msg_type::CTRL_MSG_STOP or
ctrl_msg_type::CTRL_MSG_REOPEN)?

This is a good question, actually, the worker thread won't stop because it doesn't receive the stop message... My real question is, should a user use Falco with alert drops? I mean, if you drop alerts it means that Falco is no longer useful, so yes I get the point of having a bounded queue but if the user faces some alert drops or there is a bug or he has to configure Falco with a great queue capacity 🤔

@Andreagit97
Copy link
Member

on a second thought, I'm asking myself if we really want the continue and empty policies 🤔

IMO what we need to achieve is to prevent the OOM killer from killing us and provide users with a graceful shutdown. If we drop alerts this is a symptom of a bug because we have a dedicated thread that just outputs logs all the time so it's almost impossible that it is not able to keep up with the main thread. It could happen if the tool is misconfigured (too noise, so rule misconfiguration, too outputs, so tool misconfiguration). In my opinion in this case we want to report the problem to the user and stop Falco instead of "hiding" the issue keeping up with drops. Moreover as pointed out by @leogr we could face issues if we drop CTRL messages... WDYT? @incertum @jasondellaluce @leogr

Please let's try to reach a consensus before Melissa go away in this way we can close this workstream 👍

@incertum
Copy link
Contributor Author

incertum commented Aug 30, 2023

Finally, a question. Since we are dropping a message from the queue, what will happen when a control message is lost (for example, ctrl_msg_type::CTRL_MSG_STOP or
ctrl_msg_type::CTRL_MSG_REOPEN)?

This is a good question, actually, the worker thread won't stop because it doesn't receive the stop message...

Currently there can also be issues with the control messages being possibly stuck a long time. Personally think the control messages should not be in the alerts output queue and rather directly be processes or on a separate hot path. @leogr and I chatted and concluded it's not a high priority. Would vouch for cleaning it up in a separate PR regardless.

My real question is, should a user use Falco with alert drops? I mean, if you drop alerts it means that Falco is no longer useful, so yes I get the point of having a bounded queue but if the user faces some alert drops or there is a bug or he has to configure Falco with a great queue capacity 🤔

I do agree that Falco shall not be used with random alert sampling / dropping or dropping based the queue filling up. Here is how I think about this initial change: It's a first step towards performing data-driven decisions (vs speculating) to see how much of an issue this can be in production. It helps determining if this could cause the memory leaks on some servers or it can show that this is not a problem and we have to look elsewhere. Based on some field experience we can come back and optimize this. For example, it would be interesting to see if there could be few kernel side drops, but because Falco is used for too verbose auditing / alerting we fail to keep up in userspace and such.

@incertum
Copy link
Contributor Author

incertum commented Aug 30, 2023

on a second thought, I'm asking myself if we really want the continue and empty policies 🤔

Let's experiment with it. Maybe empty can help handling short bursts vs restarting is a more undesired recovery. If it's not useful, we remove the recovery strategies again, since we tag this as "Experimental" we can do this easily.

IMO what we need to achieve is to prevent the OOM killer from killing us and provide users with a graceful shutdown. If we drop alerts this is a symptom of a bug because we have a dedicated thread that just outputs logs all the time so it's almost impossible that it is not able to keep up with the main thread. It could happen if the tool is misconfigured (too noise, so rule misconfiguration, too outputs, so tool misconfiguration). In my opinion in this case we want to report the problem to the user and stop Falco instead of "hiding" the issue keeping up with drops. Moreover as pointed out by @leogr we could face issues if we drop CTRL messages... WDYT? @incertum @jasondellaluce @leogr

Please let's try to reach a consensus before Melissa go away in this way we can close this workstream 👍

ACK, few more thoughts: I believe the new Falco metrics counters will be helpful to gain a better understanding about what could be going on on troublesome servers, hence we do count the new drops just like the kernel side drops. In addition, adding the capacity config is also good software engineering practice and effectively will make the kernel buffer and output queue equivalent in terms of end user exposed pre-defined bounds.

Re your last point what the end user should do after possibly finding out that the output queue is not working ... I would defer this discussion once we have some data.

@Andreagit97
Copy link
Member

Let's experiment with it. Maybe empty can help handling short bursts vs restarting is a more undesired recovery. If it's not useful, we remove the recovery strategies again, since we tag this as "Experimental" we can do this easily

Yep since they are experimental we could also introduce them for now but i would consider it again before promoting them as stable

@leogr
Copy link
Member

leogr commented Sep 5, 2023

Let's experiment with it. Maybe empty can help handling short bursts vs restarting is a more undesired recovery. If it's not useful, we remove the recovery strategies again, since we tag this as "Experimental" we can do this easily

Yep since they are experimental we could also introduce them for now but i would consider it again before promoting them as stable

I agree with experimenting with this. Also, if we merge this soon, we can include it in Falco 0.36.0 RC1 and give it a try. We will then have a couple of weeks to fine-tune it if needed.

leogr
leogr previously approved these changes Sep 5, 2023
@poiana poiana added the lgtm label Sep 5, 2023
@poiana
Copy link
Contributor

poiana commented Sep 5, 2023

LGTM label has been added.

Git tree hash: 6bc1e99a5c1e81dc9d56395894c2f5836fdc7162

Copy link
Contributor

@jasondellaluce jasondellaluce left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/hold

Found a thread safety hazard. Overall looks good!

switch (m_outputs_queue_recovery)
{
case falco_common::RECOVERY_EXIT:
falco_logger::log(LOG_ERR, "Fatal error: Output queue out of memory. Exiting ...");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not throwing an exception here? It would cause one open event source thread to be stopped (and in cascade, all the others open in parallel), and Falco to terminate gracefully.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good will do!

falco_logger::log(LOG_ERR, "Fatal error: Output queue out of memory. Exiting ...");
exit(EXIT_FAILURE);
case falco_common::RECOVERY_EMPTY:
m_outputs_queue_num_drops += m_queue.size();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stating the docs of falco_outputs:

All methods in this class are thread-safe. The output framework supports a multi-producer model where messages are stored in a queue and consumed by each configured output asynchrounously.

Since m_queue is a thread-safe object as guaranteed by TBB, I'd say m_outputs_queue_num_drops should be atomic here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also realized I think when counting the number of drops for the empty queue case I need to add +1 will do this later.

incertum and others added 2 commits September 5, 2023 20:58
Co-authored-by: Jason Dellaluce <jasondellaluce@gmail.com>
Signed-off-by: Melissa Kilby <melissa.kilby.oss@gmail.com>
…recovery 'empty'

Signed-off-by: Melissa Kilby <melissa.kilby.oss@gmail.com>
Copy link
Member

@Andreagit97 Andreagit97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

@poiana
Copy link
Contributor

poiana commented Sep 7, 2023

LGTM label has been added.

Git tree hash: 935425a0859a89c2261e84d27cc2b0106435b8ae

@poiana
Copy link
Contributor

poiana commented Sep 7, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Andreagit97, incertum, leogr

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [Andreagit97,incertum,leogr]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@Andreagit97
Copy link
Member

/unhold

@poiana poiana merged commit 73f15e6 into falcosecurity:master Sep 7, 2023
17 checks passed
@incertum incertum deleted the queue-capacity-outputs branch March 5, 2024 22:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

5 participants