Refactor sqsReader.Receive: move queue ack waiting and message deleti… #38146

aspacca · 2024-02-29T01:39:57Z

…on in a separated goroutine

Enhancement

Proposed commit message

Refactor AWS S3-SQS input in order to decouple the number of messages processed from waiting for a flush from the publishing queue.

WHAT:
Each SQS message in the AWS S3-SQS input is handled in a separate goroutine, where operations are sequential. Instead of waiting inside each goroutine for a flush from the publishing queue, we’ll return as soon as all events are sent to the queue and start a new goroutine for a new SQS message.

We wait for a flush from an SQS message handled by each of the previous goroutines in a separated ones, messages will be kept it in flight before it’s done. Only after we can decide if we must send the message back to the queue or delete it.

This implementation decouple the number of messages processed from waiting for a flush from the publishing queue. It is only message deletion/sending back to the queue that will wait for that.

WHY:
The coupling on waiting for flush from the publishing queue in the existing implementation causes an implicit throttling on SQS messages to be processed on the throughput of the publishing queue flush.

Checklist

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
~~- [ ] I have made corresponding changes to the documentation~~
~~- [ ] I have made corresponding change to the default configuration files~~
I have added tests that prove my fix is effective or that my feature works
I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Benchmarks comparison

Benchmark scenarios:

New implementation, after all the events from an sqs message are published to the queue (same goroutine behaviour as now) we send to a channel the data we need to process the deletion, in a single goroutine we read from this channel and start a gouroutine for each channel element read in order to delete the message. awscommon.EventACKTracker changed (see below)
Same as above, but we cap the number of concurrent deletion goroutine to 3200 (it’s currently hardcoded: it’s chosen to match the queue size I’ve tested). The reasons is to cap memory usage and goroutines spawn.
Current codebase, just a change in the awscommon.EventACKTracker (brought from this PR) to manage some kind of race conditions where the PendingACKs could reach 0 and then increase again, in case not all the Add calls (increasing the PendingACKs) will come before all the ACK calls (decreasing PendingACKs): existing implementation will stop waiting for all events to be ack’ed by the queue even if not all of them were actually ack’ed
The same as before, but adding not only the logic change for awscommon.EventACKTracker, but also switching from mutex to guard against race conditions for PendingACKs, to using the atomic package.

For each scenarios I’ve tested three different types of load:

Dynamic number of messages: I generate 1.1x max_number_of_messages, testing from 1 to 1024 max_number_of_messages in power of 2 steps. The size of each message, the number of s3 notifications in each, and the number of events for each S3 object are generated randomly with the same seed, and according to the max_number_of_messages this will generate different values.
I identified 71 message to be generated as a particular performant benchmark load during the development, so I test this exact load on every scenario.
I tested the "1 SQS message : 1 S3 object : 1 Event" load, always with randomization according to the max_number_of_messages tested. This is the original load type that urged the refactoring of the input.

See https://github.com/elastic/beats/blob/01ee8d18fcb523586883cf946914643902b01631/x-pack/filebeat/input/awss3/benchmarks-TO-BE-DELETED.md for results

…on in a separated goroutine

mergify · 2024-02-29T01:40:33Z

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @aspacca? 🙏.
For such, you'll need to label your PR with:

The upcoming major version of the Elastic Stack
The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

backport-v8./d.0 is the label to automatically backport to the 8./d branch. /d is the digit

elasticmachine · 2024-02-29T02:31:37Z

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS

Expand to view the summary

Build stats

Duration: 140 min 35 sec

❕ Flaky test report

No test was executed to be analysed.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

/test : Re-trigger the build.
/package : Generate the packages and run the E2E tests.
/beats-tester : Run the installation tests with beats-tester.
run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

x-pack/filebeat/input/awss3/s3_objects.go

x-pack/filebeat/input/awss3/sqs.go

x-pack/libbeat/common/aws/acker.go

x-pack/filebeat/input/awss3/sqs.go

x-pack/libbeat/common/aws/acker.go

x-pack/filebeat/input/awss3/sqs.go

aspacca · 2024-03-12T06:50:31Z

@cmacknz @faec

Benchmark from last commit:
4 mins taken ca for full ingestion, 290 MB RAM, 5000 EPS avg ca, tens of messages inflight varying

Memory is decreasing after end of ingestion

…outine

faec

I'm about to be on PTO (back Wednesday 3/20) so I can't make the sync tomorrow. I've gone through the new revisions in more detail, hopefully these comments can keep things progressing in the meantime.

faec · 2024-03-13T20:25:54Z

x-pack/filebeat/input/awss3/sqs_acker.go

+				"elapsed_time_ns", time.Since(a.start))
+		}
+	} else {
+		a.log.Infow("Skipping deleting SQS message, not all events acked.",


Messages should still be deleted even if some events are dropped. For example a user might configure their processors to filter messages with a particular tag, but this doesn't mean that anything is wrong or that processing of that SQS message is not really "done." Until the client itself is closed, the Beats pipeline will only drop events that should be permanently dropped based on user settings.

Messages should still be deleted even if some events are dropped. For example a user might configure their processors to filter messages with a particular tag, but this doesn't mean that anything is wrong or that processing of that SQS message is not really "done." Until the client itself is closed, the Beats pipeline will only drop events that should be permanently dropped based on user settings.

Can you please clarify between "dropped", "published" and "acked"?

I assumed the following

dropped: discarded by a processor in beats. not sent to the output

published: sent to the output

acked: output returned a positive response of receiving the event

faec · 2024-03-13T20:30:11Z

x-pack/filebeat/input/awss3/sqs_acker.go

+
+	// This is eating its own tail: we should check for dropped+published, but then we won't wait for acked.
+	// Acked might not be equal to published?
+	return a.EventsDropped.Load()+a.EventsAcked.Load() == eventsToBeTracked


EventsDropped shouldn't be included here, and probably shouldn't be tracked at all since it has no effect that the input can act on (see comments on (*eventListener).AddEvent and FlushForSQS)

I see we still need to track.
We know that a sqs message has 10 events
3 of them were dropped
7 acked

We sum the two above that's 7+3 == 10, so we know every events in the sqs message was taken care by the queue and we can delete the message without waiting anything else.

faec · 2024-03-13T20:35:15Z

x-pack/filebeat/input/awss3/sqs_acker.go

+
+func (a *eventListener) ClientClosed() {}
+
+func (a *eventListener) AddEvent(event beat.Event, published bool) {


EventPrivateReporter already triggers your callback in NewEventACKHandler for every event, whether it is published or not, so the current Drop handling leads to double counting. I believe you can remove eventListener entirely, along with EventsDropped and EventsPublished -- there should be no reason to alter behavior based on whether something is published or dropped, since you just need to compare the final event count for a message with the callback invocations from EventPrivateReporter.

I can see that in AddEvent I have the published param, that tells me if the event was published or dropped.

EventPrivateReporter already triggers your callback in NewEventACKHandler for every event, whether it is published or not

Oki, I thought it trigger my callback for every "acked" event (as " output returned a positive response of receiving the event").

If "acked" means that the queue handled the message, either publishing it or dropping the, yes. Then I can just use only EventPrivateReporter and LastEventPrivateReporter.

In EventPrivateReporter I will just call acker.ACK() for every event, and in LastEventPrivateReporter:

if acker.FullyTracked() { acker.FlushForSQS() }

But I need to introduce one client for every acker

It seems this way we don't have the possibility to ensure at-least-once delivery: at least in the meaning that the output ingested the event.

x-pack/filebeat/input/awss3/sqs_acker.go

faec · 2024-03-13T20:49:41Z

x-pack/filebeat/input/awss3/sqs_acker.go

+		EventsToBeTracked: atomic.NewUint64(0),
+	}
+
+	go func() {


Creating a looping goroutine for each message isn't needed. It should be enough to check FullyTracked:

In (*EventACKTracker).ACK

In (*EventACKTracker).MarkSQSProcessedWithData

since these are the only two places where the acked count and/or EventsToBeTracked can be modified. (This could be done safely with just atomics given some careful ordering, but it's also fine to use a mutex for the initial implementation, that's still a lot better than an extra goroutine for every message.)

Creating a looping goroutine for each message isn't needed. It should be enough to check FullyTracked:

In (*EventACKTracker).ACK

In (*EventACKTracker).MarkSQSProcessedWithData

that's what we wanted to remove because it required a mutex, and it will block the listener callback.

In EventPrivateReporter I will just call acker.ACK() for every event, and in LastEventPrivateReporter:

if acker.FullyTracked() { acker.FlushForSQS() }

beware that also this we can end up with the following timeline:
the timeline could be the following:
. T1 client publishes event 1
. T2 client publishes event 2
. T3 private event listener acks event 1
. T4 private event listener acks event 2
. T5 last event listener check how many events has to be acked (0, because T6 has to come)
. T6 sqs event processor mark sqs message as processed, informing the acker that there are only 2 event

No more last event listener will be invocked. We don't delete the message and the input will stay appended on shutdown because we have deletiongWg.Wait()

So I guess a goroutine is needed indeed

The core problem being that the s3 objects are read and published as a stream where you don't know how many events to expect before the object can be deleted until you have read them all?

https://github.com/aspacca/beats/blob/8c3c39a69efe506763aaad50efc5e8280cc0f7c5/x-pack/filebeat/input/awss3/s3_objects.go#L398-L403

message, err := reader.Next() if len(message.Content) > 0 { event := p.createEvent(string(message.Content), offset) event.Fields.DeepUpdate(message.Fields) offset += int64(message.Bytes) p.publish(&event) } if errors.Is(err, io.EOF) { // No more lines break }

cmacknz · 2024-03-18T20:22:50Z

x-pack/filebeat/input/awss3/sqs.go

+		clientsMutex.Lock()
+		clients[id] = client
+		clientsMutex.Unlock()


Is concurrent access to the map actually possible here?

cmacknz · 2024-03-18T20:25:32Z

x-pack/filebeat/input/awss3/sqs.go

+				acker := NewEventACKTracker(ctx, deletionWg)
+
+				deletionWg.Add(1)
+				deletionWaiter.Swap(false)


Is the bool ever actually set to true? The default value of deletionWaiter := new(atomic.Bool) is false isn't it?

good catch!

kaiyan-sheng · 2024-03-18T22:12:27Z

x-pack/filebeat/input/awss3/sqs_acker.go

+// an event has been ACKed an output. If the event contains a private metadata
+// pointing to an eventACKTracker then it will invoke the trackers ACK() method
+// to decrement the number of pending ACKs.
+func NewEventACKHandler() beat.EventListener {


Can NewEventACKHandler function stay inside libbeat?

we already have awscommon.EventACKTracker in github.com/elastic/beats/v7/x-pack/libbeat/common/aws
This akcer is specific for sqs-s3 notifcations

kaiyan-sheng · 2024-03-18T22:26:40Z

x-pack/filebeat/input/awss3/s3_objects.go

-	event.Private = ack
+func (p *s3ObjectProcessor) publish(event *beat.Event) {
+	if p.acker != nil {
+		event.Private = p.acker


also need p.acker.Add() first?

p.acker is awss3.EventACKTracker: it has no Add()
p.ackerForPollin is awscommon.EventACKTracker: it has Add()

…outine

elasticmachine · 2024-03-19T01:25:10Z

💚 Build Succeeded

Buildkite Build
Commit: 03cc1d6

History

💚 Build #2288 succeeded fe71445
💚 Build #2132 succeeded 8c3c39a
💚 Build #2131 succeeded 1e6c9dd
💚 Build #2130 succeeded 24efa6c
💚 Build #2127 succeeded 3be1dd2
💚 Build #2037 succeeded 895a9d7

cc @aspacca

elasticmachine · 2024-03-19T01:25:15Z

💚 Build Succeeded

Buildkite Build
Commit: 03cc1d6

History

💚 Build #1451 succeeded fe71445
💚 Build #1295 succeeded 8c3c39a
💚 Build #1294 succeeded 1e6c9dd
💚 Build #1293 succeeded 24efa6c
💚 Build #1290 succeeded 3be1dd2
💚 Build #1200 succeeded 895a9d7

cc @aspacca

elasticmachine · 2024-03-19T01:25:19Z

💚 Build Succeeded

Buildkite Build
Commit: 03cc1d6

History

💚 Build #3506 succeeded fe71445
💚 Build #3346 succeeded 8c3c39a
💚 Build #3345 succeeded 1e6c9dd
💚 Build #3344 succeeded 24efa6c
💚 Build #3341 succeeded 3be1dd2
💚 Build #3251 succeeded 895a9d7

cc @aspacca

elasticmachine · 2024-03-19T01:25:27Z

💚 Build Succeeded

Buildkite Build
Commit: 03cc1d6

History

💚 Build #1016 succeeded fe71445

cc @aspacca

elasticmachine · 2024-03-19T01:25:32Z

💚 Build Succeeded

Buildkite Build
Commit: 03cc1d6

History

💚 Build #1026 succeeded fe71445

cc @aspacca

elasticmachine · 2024-03-19T01:25:35Z

💚 Build Succeeded

Buildkite Build
Commit: 03cc1d6

History

💚 Build #1013 succeeded fe71445

cc @aspacca

elasticmachine · 2024-03-19T01:25:58Z

💚 Build Succeeded

Buildkite Build
Commit: 03cc1d6

History

💚 Build #1058 succeeded fe71445

cc @aspacca

elasticmachine · 2024-03-19T01:26:58Z

💚 Build Succeeded

Buildkite Build
Commit: 03cc1d6

History

💚 Build #2295 succeeded fe71445
💚 Build #2139 succeeded 8c3c39a
💚 Build #2138 succeeded 1e6c9dd
💚 Build #2137 succeeded 24efa6c
💚 Build #2134 succeeded 3be1dd2
💚 Build #2044 succeeded 895a9d7

cc @aspacca

elasticmachine · 2024-03-19T01:29:53Z

💔 Build Failed

Buildkite Build
Commit: 03cc1d6

Failed CI Steps

History

💔 Build #3938 failed fe71445
💔 Build #3770 failed 8c3c39a
💔 Build #3769 failed 1e6c9dd
💔 Build #3768 failed 24efa6c

cc @aspacca

elasticmachine · 2024-03-19T01:29:56Z

💔 Build Failed

Buildkite Build
Commit: 03cc1d6

Failed CI Steps

History

💔 Build #2585 failed fe71445
💔 Build #2429 failed 8c3c39a
💔 Build #2428 failed 1e6c9dd
💔 Build #2427 failed 24efa6c

cc @aspacca

Refactor sqsReader.Receive: move queue ack waiting and message deleti…

c2b9325

…on in a separated goroutine

aspacca self-assigned this Feb 29, 2024

botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Feb 29, 2024

background ctx in DeleteMessage, linting

6734772

aspacca added the Team:obs-ds-hosted-services Label for the Observability Hosted Services team label Feb 29, 2024

botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Feb 29, 2024

Andrea Spacca added 5 commits February 29, 2024 13:08

make check

826b53c

Refactor benchmark, acker with atomic

e1294f4

Improve sqsReader.Receive

7120e9f

linting

edae48f

remove ctx from awscommon-EventACKTracker

a696473

aspacca commented Mar 4, 2024

View reviewed changes

x-pack/filebeat/input/awss3/s3_objects.go Outdated Show resolved Hide resolved

aspacca commented Mar 4, 2024

View reviewed changes

x-pack/filebeat/input/awss3/sqs.go Outdated Show resolved Hide resolved

aspacca commented Mar 4, 2024

View reviewed changes

x-pack/filebeat/input/awss3/sqs.go Outdated Show resolved Hide resolved

aspacca commented Mar 4, 2024

View reviewed changes

x-pack/libbeat/common/aws/acker.go Outdated Show resolved Hide resolved

aspacca commented Mar 4, 2024

View reviewed changes

x-pack/libbeat/common/aws/acker.go Outdated Show resolved Hide resolved

aspacca requested review from zmoog, axw, cmacknz, andrewkroh, strawgate, kaiyan-sheng and faec March 4, 2024 09:12

cmacknz reviewed Mar 4, 2024

View reviewed changes

x-pack/filebeat/input/awss3/sqs.go Outdated Show resolved Hide resolved

x-pack/libbeat/common/aws/acker.go Outdated Show resolved Hide resolved

x-pack/filebeat/input/awss3/sqs.go Outdated Show resolved Hide resolved

Andrea Spacca added 4 commits March 6, 2024 09:42

Revert awscommon.EventAckTracker and add awss3.EventACKTracker

0636921

Implement usage of new awss3.EventACKTracker

96389b1

fix tests

c6ff95f

make check

c91076b

Andrea Spacca added 3 commits March 12, 2024 15:19

lint

136f026

fix race in acker goroutine

d06021e

lint

d1be46c

Andrea Spacca added 6 commits March 12, 2024 17:05

Delete only on fully acked

32bf3da

fix tests

01ee8d1

track published, dropped and acked events

f7dda95

fix tests

c9699f3

Merge branch 'main' into sqs-s3-input-wait-for-ack-in-a-separated-gor…

d531efb

…outine

wait for acked

895a9d7

faec reviewed Mar 13, 2024

View reviewed changes

Andrea Spacca added 2 commits March 14, 2024 16:49

No need to track dropped and published

4dbd25b

fix tests

8c3c39a

cmacknz reviewed Mar 18, 2024

View reviewed changes

kaiyan-sheng reviewed Mar 18, 2024

View reviewed changes

Andrea Spacca added 2 commits March 19, 2024 08:17

cr fixes and other cleaning

fe71445

Merge branch 'main' into sqs-s3-input-wait-for-ack-in-a-separated-gor…

03cc1d6

…outine

aspacca closed this Mar 20, 2024

aspacca deleted the sqs-s3-input-wait-for-ack-in-a-separated-goroutine branch March 20, 2024 03:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor sqsReader.Receive: move queue ack waiting and message deleti… #38146

Refactor sqsReader.Receive: move queue ack waiting and message deleti… #38146

aspacca commented Feb 29, 2024 •

edited

Loading

mergify bot commented Feb 29, 2024

elasticmachine commented Feb 29, 2024 •

edited

Loading

Build stats

aspacca commented Mar 12, 2024

faec left a comment

faec Mar 13, 2024

aspacca Mar 14, 2024 •

edited

Loading

faec Mar 13, 2024

aspacca Mar 14, 2024

faec Mar 13, 2024

aspacca Mar 14, 2024 •

edited

Loading

faec Mar 13, 2024

aspacca Mar 14, 2024

aspacca Mar 14, 2024

cmacknz Mar 18, 2024

cmacknz Mar 18, 2024

aspacca Mar 18, 2024

cmacknz Mar 18, 2024

aspacca Mar 18, 2024

kaiyan-sheng Mar 18, 2024

aspacca Mar 18, 2024

kaiyan-sheng Mar 18, 2024

aspacca Mar 18, 2024

elasticmachine commented Mar 19, 2024

elasticmachine commented Mar 19, 2024

elasticmachine commented Mar 19, 2024

elasticmachine commented Mar 19, 2024

elasticmachine commented Mar 19, 2024

elasticmachine commented Mar 19, 2024

elasticmachine commented Mar 19, 2024

elasticmachine commented Mar 19, 2024

elasticmachine commented Mar 19, 2024 •

edited

Loading

elasticmachine commented Mar 19, 2024 •

edited

Loading


		func (a *eventListener) ClientClosed() {}

		func (a *eventListener) AddEvent(event beat.Event, published bool) {

Refactor sqsReader.Receive: move queue ack waiting and message deleti… #38146

Refactor sqsReader.Receive: move queue ack waiting and message deleti… #38146

Conversation

aspacca commented Feb 29, 2024 • edited Loading

Proposed commit message

Checklist

Benchmarks comparison

mergify bot commented Feb 29, 2024

elasticmachine commented Feb 29, 2024 • edited Loading

💚 Build Succeeded

Build stats

❕ Flaky test report

🤖 GitHub comments

aspacca commented Mar 12, 2024

faec left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aspacca Mar 14, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aspacca Mar 14, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

elasticmachine commented Mar 19, 2024

💚 Build Succeeded

History

elasticmachine commented Mar 19, 2024

💚 Build Succeeded

History

elasticmachine commented Mar 19, 2024

💚 Build Succeeded

History

elasticmachine commented Mar 19, 2024

💚 Build Succeeded

History

elasticmachine commented Mar 19, 2024

💚 Build Succeeded

History

elasticmachine commented Mar 19, 2024

💚 Build Succeeded

History

elasticmachine commented Mar 19, 2024

💚 Build Succeeded

History

elasticmachine commented Mar 19, 2024

💚 Build Succeeded

History

elasticmachine commented Mar 19, 2024 • edited Loading

💔 Build Failed

Failed CI Steps

History

elasticmachine commented Mar 19, 2024 • edited Loading

💔 Build Failed

Failed CI Steps

History

aspacca commented Feb 29, 2024 •

edited

Loading

elasticmachine commented Feb 29, 2024 •

edited

Loading

aspacca Mar 14, 2024 •

edited

Loading

aspacca Mar 14, 2024 •

edited

Loading

elasticmachine commented Mar 19, 2024 •

edited

Loading

elasticmachine commented Mar 19, 2024 •

edited

Loading