Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for active timeout #37834

Closed
wants to merge 15 commits into from

Conversation

kvalliyurnatt
Copy link
Contributor

@kvalliyurnatt kvalliyurnatt commented Feb 1, 2024

Proposed commit message

Add support for Active timeout. This would kill the flow when timeout is hit irrespective when the last traffic was seen on the flow.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Related issues

@kvalliyurnatt kvalliyurnatt requested review from yspotts, jaymode and a team February 1, 2024 21:54
@botelastic botelastic bot added needs_team Indicates that the issue/PR needs a Team:* label and removed needs_team Indicates that the issue/PR needs a Team:* label labels Feb 1, 2024
Copy link
Contributor

mergify bot commented Feb 1, 2024

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @kvalliyurnatt? 🙏.
For such, you'll need to label your PR with:

  • The upcoming major version of the Elastic Stack
  • The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-v8./d.0 is the label to automatically backport to the 8./d branch. /d is the digit

@kvalliyurnatt kvalliyurnatt marked this pull request as ready for review February 1, 2024 22:11
@kvalliyurnatt kvalliyurnatt requested a review from a team as a code owner February 1, 2024 22:11
@elasticmachine
Copy link
Collaborator

Pinging @elastic/security-external-integrations (Team:Security-External Integrations)

@elasticmachine
Copy link
Collaborator

❕ Build Aborted

There is a new build on-going so the previous on-going builds have been aborted.

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview

Expand to view the summary

Build stats

  • Start Time: 2024-02-01T22:10:55.508+0000

  • Duration: 7 min 8 sec

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@elasticmachine
Copy link
Collaborator

❕ Build Aborted

There is a new build on-going so the previous on-going builds have been aborted.

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview

Expand to view the summary

Build stats

  • Start Time: 2024-02-01T21:54:52.997+0000

  • Duration: 23 min 15 sec

Test stats 🧪

Test Results
Failed 2
Passed 814
Skipped 1
Total 817

Test errors 2

Expand to view the tests failures

Build&Test / packetbeat-rhel-9-rhel-9 / TestFlowsActiveTimeout – github.com/elastic/beats/v7/packetbeat/flows
    Expand to view the error details

     Failed 
    

    Expand to view the stacktrace

     === RUN   TestFlowsActiveTimeout
        flows_test.go:264: event: {"destination":{"float2":1.4142,"int2":-1,"ip":"128.0.1.2","mac":"06-05-04-03-02-01","port":512,"uint2":5},"event":{"action":"network_flow","category":["network"],"dataset":"flow","duration":7899,"end":"2024-02-01T22:13:36.340Z","kind":"event","start":"2024-02-01T22:13:36.340Z","type":["connection","end"]},"flow":{"final":true,"id":"EQQA////DP//////FP8BAAEBAgMEBQYGBQQDAgF/AAABgAABAgABAAI","kill_reason":"ActiveTimeout"},"network":{"bytes":0,"community_id":"1:lov+6OShb6bXMyK0gZjGX3RabtQ=","packets":0,"transport":"tcp","type":"ipv4"},"source":{"float1":3.14,"int1":-1,"ip":"127.0.0.1","mac":"01-02-03-04-05-06","port":256,"uint1":1},"type":"flow"}
        flows_test.go:297: 
            	Error Trace:	/var/lib/jenkins/workspace/PR-37834-1-bcaae585-13ca-42ce-a652-3e11eed81501/src/github.com/elastic/beats/packetbeat/flows/flows_test.go:297
            	Error:      	Not equal: 
            	            	expected: flows.flowKillReason(2)
            	            	actual  : string("ActiveTimeout")
            	Test:       	TestFlowsActiveTimeout
    --- FAIL: TestFlowsActiveTimeout (0.00s)
     
    

Build&Test / packetbeat-unitTest / TestFlowsActiveTimeout – github.com/elastic/beats/v7/packetbeat/flows
    Expand to view the error details

     Failed 
    

    Expand to view the stacktrace

     === RUN   TestFlowsActiveTimeout
        flows_test.go:264: event: {"destination":{"float2":1.4142,"int2":-1,"ip":"128.0.1.2","mac":"06-05-04-03-02-01","port":512,"uint2":5},"event":{"action":"network_flow","category":["network"],"dataset":"flow","duration":12710,"end":"2024-02-01T22:13:52.002Z","kind":"event","start":"2024-02-01T22:13:52.002Z","type":["connection","end"]},"flow":{"final":true,"id":"EQQA////DP//////FP8BAAEBAgMEBQYGBQQDAgF/AAABgAABAgABAAI","kill_reason":"ActiveTimeout"},"network":{"bytes":0,"community_id":"1:lov+6OShb6bXMyK0gZjGX3RabtQ=","packets":0,"transport":"tcp","type":"ipv4"},"source":{"float1":3.14,"int1":-1,"ip":"127.0.0.1","mac":"01-02-03-04-05-06","port":256,"uint1":1},"type":"flow"}
        flows_test.go:297: 
            	Error Trace:	/var/lib/jenkins/workspace/PR-37834-1-8abf3fab-ebfb-4294-8c61-1e01041894c9/src/github.com/elastic/beats/packetbeat/flows/flows_test.go:297
            	Error:      	Not equal: 
            	            	expected: flows.flowKillReason(2)
            	            	actual  : string("ActiveTimeout")
            	Test:       	TestFlowsActiveTimeout
    --- FAIL: TestFlowsActiveTimeout (0.00s)
     
    

Steps errors 7

Expand to view the steps failures

packetbeat-unitTest - mage build unitTest
  • Took 2 min 29 sec . View more details here
  • Description: mage build unitTest
packetbeat-unitTest - mage build unitTest
  • Took 0 min 24 sec . View more details here
  • Description: mage build unitTest
packetbeat-unitTest - mage build unitTest
  • Took 0 min 23 sec . View more details here
  • Description: mage build unitTest
packetbeat-rhel-9-rhel-9 - mage build unitTest
  • Took 2 min 17 sec . View more details here
  • Description: mage build unitTest
packetbeat-rhel-9-rhel-9 - mage build unitTest
  • Took 0 min 20 sec . View more details here
  • Description: mage build unitTest
packetbeat-rhel-9-rhel-9 - mage build unitTest
  • Took 0 min 20 sec . View more details here
  • Description: mage build unitTest
Error signal
  • Took 0 min 0 sec . View more details here
  • Description: Error 'hudson.AbortException: script returned exit code 1'

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@elasticmachine
Copy link
Collaborator

❕ Build Aborted

Either there was a build timeout or someone aborted the build.

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview

Expand to view the summary

Build stats

  • Duration: 7 min 35 sec

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@elasticmachine
Copy link
Collaborator

❕ Build Aborted

Either there was a build timeout or someone aborted the build.

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview

Expand to view the summary

Build stats

  • Duration: 22 min 1 sec

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@elasticmachine
Copy link
Collaborator

❕ Build Aborted

There is a new build on-going so the previous on-going builds have been aborted.

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview

Expand to view the summary

Build stats

  • Start Time: 2024-02-01T22:30:51.139+0000

  • Duration: 7 min 37 sec

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@elasticmachine
Copy link
Collaborator

💔 Build Failed

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Duration: 26 min 37 sec

Pipeline error 1

This error is likely related to the pipeline itself. Click here
and then you will see the error (either incorrect syntax or an invalid configuration).

❕ Flaky test report

No test was executed to be analysed.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@elasticmachine
Copy link
Collaborator

❕ Build Aborted

Either there was a build timeout or someone aborted the build.

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview

Expand to view the summary

Build stats

  • Duration: 22 min 56 sec

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@kvalliyurnatt
Copy link
Contributor Author

@elastic/sec-linux-platform could I get some reviews on this PR. TIA.

Copy link
Contributor

@nicholasberlin nicholasberlin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few nitpicks and a potential unit test addition.
Overall, it seems fine to me, not familiar with the code though 🤷

packetbeat/config/config.go Outdated Show resolved Hide resolved
packetbeat/flows/worker.go Outdated Show resolved Hide resolved
packetbeat/flows/worker_test.go Show resolved Hide resolved
kvalliyurnatt and others added 2 commits February 15, 2024 16:43
Co-authored-by: Nicholas Berlin <56366649+nicholasberlin@users.noreply.github.com>
packetbeat/config/config.go Outdated Show resolved Hide resolved
packetbeat/flows/worker.go Outdated Show resolved Hide resolved
// The Flow was terminated because it was considered to be idle.
IdleTimeout
// The Flow was terminated for reporting purposes while it was still active.
ActiveTimeout
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we implement the logic to add a reason to events when the end of flow is detected (i.e. TCP FIN)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a reason, but not a 100% sure if this is what you had in mind. Please take a look and let me know. Thanks

@@ -213,8 +264,12 @@ func makeWorker(processor *flowsProcessor, tick time.Duration, timeout, period i
if nPeriod <= 0 {
nPeriod = period
}
handleActiveTimeout := enableActiveFlowTimeout && nActiveTimeout == 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is nActiveTimeout and why might it be 0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tick will be gcd of period, timeout and activetimeout
so if period = 10, timeout = 30 and activetimeout = 60
tick = 10
Initial nPeriod = 1
Initial ntimeout = 3
Initial nActiveTimeout = 6

At each tick we decrement these values and then when any of these values reach 0 we either report flow, check for timeout or check for active timeout.

Now if say activetimeout is not set, or period is set to -1

then tick = timeout
nperiod = -1
nActivetimeout = -1

we will never decrement nperiod or nActivetimeout during the ticks, they will never reach 0 so we will never check for period or active timeout.

I added unit tests around this. Maybe that helps a little bit

}
}
}

fw.spool.flush()
}

func (fw *flowsProcessor) report(w *worker, ts time.Time, flow *biFlow, isOver bool, intNames, uintNames, floatNames []string) {
event := createEvent(fw.watcher, ts, flow, isOver, intNames, uintNames, floatNames)
func shouldEndFlow(flow *biFlow, fw *flowsProcessor, ts time.Time, activeFlowTimeout bool) (flowEndReason, bool) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Im a little confused here - what if there are two different timeouts; one for inactive and one for active. how do we account for those two different values?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is handled at a different place .That is handled in makeWorker function based on timeout, period, activeTimeout . However these is a bug I need to change the if condition

if checkTimeout {

to

if checkTimeout || handleActiveTimeout {

Copy link
Contributor Author

@kvalliyurnatt kvalliyurnatt Feb 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ended up actually changing the check altogether

@elasticmachine
Copy link
Collaborator

💔 Build Failed

Failed CI Steps

History

cc @kvalliyurnatt

@elasticmachine
Copy link
Collaborator

💚 Build Succeeded

History

cc @kvalliyurnatt

@elasticmachine
Copy link
Collaborator

💚 Build Succeeded

History

cc @kvalliyurnatt

@elasticmachine
Copy link
Collaborator

💚 Build Succeeded

History

cc @kvalliyurnatt

@elasticmachine
Copy link
Collaborator

💚 Build Succeeded

History

cc @kvalliyurnatt

@elasticmachine
Copy link
Collaborator

💚 Build Succeeded

History

cc @kvalliyurnatt

@andrewkroh
Copy link
Member

andrewkroh commented Mar 7, 2024

One issue we discussed about this change is that because Packetbeat sends bidirectional flow records the accuracy of source and destination are more important compared to a unidirectional flow exporter. If the active flow timeout purges the flow table entry then when the next packet is observed the source and destination might flip because we lost the state about which side initiated the flow. Packetbeat should keep this state.

IIUC the goal with adding the active flow timeout is to generate mid-flow events whose packet/byte counts are not cumulative over the life of the flow but are only accumulated since the previous flow event.

My initial though was to reset the packet/byte counters but keep the entry in the flow table. But that would change the behavior the the final: true flow record in that it no longer contains a complete accounting for the flow. The current behavior (without any changes) is actually pretty close to what you want IIUC in that it sends both mid-flow records and a final record after idle-timeout. But those records all contain cumulative counts which is not what you want. Maybe it makes sense to explore a config option to enable delta reporting (how many packets/bytes were observed since the previous event was sent) instead of cumulative reporting.

@yspotts
Copy link
Contributor

yspotts commented Mar 7, 2024

@andrewkroh thanks. I think the delta in the intermediate would be helpful for us; however, what would the final record look like? If that is cumalitive, then it would throw off our calculation. Unless we had a very very long idle timeout? Would that help?

@andrewkroh
Copy link
Member

We should probably make the final event also use delta reporting if we create a new delta option. Does that make sense?

@yspotts
Copy link
Contributor

yspotts commented Mar 7, 2024

We should probably make the final event also use delta reporting if we create a new delta option. Does that make sense?

yes, perfect

@kvalliyurnatt
Copy link
Contributor Author

closing this in favour of #38223

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for active flow timeout for packetbeat
5 participants