Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement polling improvements #2652

Merged
merged 144 commits into from
Oct 27, 2023
Merged

Conversation

zalegrala
Copy link
Contributor

@zalegrala zalegrala commented Jul 13, 2023

What this PR does:

Currently, each polling cycle, the block list is read from the backend, followed by a read of each meta.json file. If the meta.json returns not-found, then an additional call is made to the backend to fetch the meta.compacted.json. During compaction, the meta.json is replaced with meta.compacted.json, but the details about the block remain unchanged. This means that tempo performs lots of fetching for information that is already known from the previous poll. This can be a significant source of backend calls, and in turn, latency of the polling cycle. In environments with a high number of blocks, this problem is amplified.

Here we extend the backend interfaces to create a ListBlocks() method that can be used to target block information direclty from each backend.

  • list the backend, splitting on an empty delimter to get all objects in the backend
  • parse the response, sorting the results into compacted and not-compacted
  • for each block, use the previous blocklist to match data we have in memory
  • for each new block, we perform the fetch as normal to read the metadata

This results in fewer calls to the backend, and a may decrease in total time spent for each poll cycle.

Additional changes to the structure of the backend are made to improve the duration while keeping the total calls in check.

  • the s3 and gcs ListBlocks() call has implemented a parallel list over the search space

Which issue(s) this PR fixes:
Fixes #2521

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@zalegrala zalegrala force-pushed the pollingLessWork branch 2 times, most recently from 7e423c0 to 63f1c85 Compare July 13, 2023 22:25
@zalegrala zalegrala force-pushed the pollingLessWork branch 5 times, most recently from f0417b7 to a47fec5 Compare July 26, 2023 21:04
@zalegrala
Copy link
Contributor Author

Blocklist poll duration decreases with an image built from this PR in GCS.
image

@zalegrala zalegrala changed the title Poller updates in progress Implement "quick" polling Jul 27, 2023
@zalegrala
Copy link
Contributor Author

Blocklist poll duration increases in s3.
image

The list call itslef is now taking longer in s3, but likely still saves cost on requests. Perhaps there is a different approach to the s3 implementation we can take. I'll do a little research and see if I can come up with something, but feel free to chime in if you have ideas.

@joe-elliott
Copy link
Member

As long as the polling takes less time than the polling cycle I think we're ok. Double the polling time might not matter if it's going from 20s -> 40s on 200k blocks or something like that.

@zalegrala
Copy link
Contributor Author

I've made a small change to use n as a delimiter for both the s3 and the gcs case. This works because .json files end with n. Use against the actual GCS API was working fine, but against the fake-gcs-server that we use in our CI was not accepting multi-character delimiters. I applied the same change to s3 thinking that perhaps the multi-character delimiter was slowing down the responses, but this didn't seem to pan out when I deployed. I think its worth a review at this point.

@zalegrala zalegrala marked this pull request as ready for review July 28, 2023 16:48
Copy link
Member

@joe-elliott joe-elliott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First pass. Have some Qs. Let me know if you want to have a sync convo.

tempodb/backend/azure/azure.go Outdated Show resolved Hide resolved
tempodb/backend/raw.go Outdated Show resolved Hide resolved
tempodb/backend/raw.go Outdated Show resolved Hide resolved
tempodb/blocklist/poller.go Outdated Show resolved Hide resolved
@zalegrala
Copy link
Contributor Author

Thanks for the review @joe-elliott. I'm happy to chat through this whenever.

@zalegrala zalegrala force-pushed the pollingLessWork branch 2 times, most recently from ee8950c to bf5f1c6 Compare August 21, 2023 21:22
} else if cm != nil {
chCompactedMeta <- cm
} else if err != nil {
mtx.Lock()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the cost of the lock in this context is basically nothing. lock/defer unlock is a safer pattern that is more difficult to mess up as functionality changes.

you could also rely on the lock in this case to protect the err var. up to you

integration/poller/poller_test.go Show resolved Hide resolved
tempodb/backend/gcs/gcs.go Show resolved Hide resolved
tempodb/backend/gcs/gcs.go Outdated Show resolved Hide resolved
tempodb/blocklist/poller.go Outdated Show resolved Hide resolved
tempodb/blocklist/poller.go Outdated Show resolved Hide resolved
Copy link
Member

@joe-elliott joe-elliott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of smaller comments, but I'm g2g. Great improvement!

.github/workflows/ci.yml Show resolved Hide resolved
tempodb/backend/gcs/gcs.go Outdated Show resolved Hide resolved
tempodb/backend/gcs/gcs.go Outdated Show resolved Hide resolved
continue
}

// TODO: Review the ability to avoid polling for compacted blocks that we
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think the code you've added does this? if so we can clean up this todo

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite. What I had in mind might not be possible, but the note here is about avoiding pulling the data entirely if the block transitions from meta -> compactedMeta. We have all the detail except the time it was compacted. More generally, I wonder if we already had this data in memory, then we know that it is only one polling cycle out of date, and perhaps we could be fuzzy enough on this that we don't have to know the precise time. More thinking needed probably.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve Polling Performance at high block counts
5 participants