[Enhancement] support lake table cache select in physical way #55328

starrocks-xupeng · 2025-01-22T07:12:59Z

Why I'm doing:

What I'm doing:

part of the cache select process is CPU heavy, which is unnecessary and can be removed.

100G SSB, everything is cached already, run cache select * from lineorder;.

-	Previous Implementataion	New Implemention
Total	2s373ms	418ms
IOTime (IO heavy)	272.922ms	281.019ms
Decompress Page (CPU heavy) + Checksum Check (CPU heavy) + Form Chunk (CPU heavy)	1.336s	0s

100G TPCH, everything is cached already, run cache select * from lineitem;.

-	Previous Implementataion	New Implemention
Total	9s254ms	1s190ms
IOTime (IO heavy)	348.076ms	274.867ms
Decompress Page (CPU heavy) + Checksum Check (CPU heavy) + Form Chunk (CPU heavy)	5.402s	0s

What type of PR is this:

Does this PR entail a change in behavior?

Yes, this PR will result in a change in behavior.
No, this PR will not result in a change in behavior.

If yes, please specify the type of change:

Interface/UI changes: syntax, type conversion, expression evaluation, display information
Parameter changes: default values, similar parameters but with different default values
Policy changes: use new policy to replace old one, functionality automatically enabled
Feature removed
Miscellaneous: upgrade & downgrade compatibility, etc.

Checklist:

I have added test cases for my bug fix or my new feature
This pr needs user documentation (for new or modified features or behaviors)
- I have added documentation for my new feature or new function
This is a backport pr

Bugfix cherry-pick branch check:

Signed-off-by: starrocks-xupeng <xupeng@starrocks.com>

ctbrennan · 2025-01-22T17:56:22Z

Hi Xupeng, thank you very much for this change.

I have a question to clarify my own understanding about the time spent performing file i/o. Since we're talking about data that's already been cached, I assume that the file i/o is streaming the data from local disk into memory (and not downloading the file from s3 to local disk). Please correct me if my understanding of what this I/O is is wrong.

Assuming that my understanding above is correct, is there a way that in the future, we could add new code to StarOs and the backend client so that we could basically ensure that the file exists on local disk without streaming the file into memory? This might be a relevant performance improvement in the case that our tables are 1 or 10 TB large.

This PR is already a great improvement, just asking about future plans. Thanks again!

starrocks-xupeng · 2025-01-24T09:57:44Z

in this pr, file will only be read from local disk into memory if it is cached already.

yes you are right, that's the best performance solution for your case, but that need some more work to be done.

in your case, you only run cache select * from xxx, so whole file is needed, but if you run cache select A,B from xxx in a table with A,B,C,D 4 columns, part of the file is not needed;
we use block cache instead of file cache, so say 100MB file will be divided into 100 blocks, each 1MB, so you need to write code like this, the code will be a bit tricky, also you need to modify both starlet code and starrocks BE/CN code

for (int i = 0; i < 100; ++i) {
    if (file->exist_block(i)) { // starlet needs to provide this api
        // do nothing
    } else {
        file->load_block(i); // can be done with current file-> read APi
    }
}

actually in your case, there is another possible solution which is multi-replica.
warehouse A do all the loading/compaction, and in the background, sends all write to warehouse B. we are doing multi-replica development in one warehouse and some time in the future, we will implement cross warehouse multi-replica. but all these features are in the enterprise version.

[Enhancement] support lake table cache select in physical way

23e550d

Signed-off-by: starrocks-xupeng <xupeng@starrocks.com>

mergify bot assigned starrocks-xupeng Jan 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Enhancement] support lake table cache select in physical way #55328

[Enhancement] support lake table cache select in physical way #55328

starrocks-xupeng commented Jan 22, 2025 •

edited

Loading

ctbrennan commented Jan 22, 2025 •

edited

Loading

starrocks-xupeng commented Jan 24, 2025

[Enhancement] support lake table cache select in physical way #55328

Are you sure you want to change the base?

[Enhancement] support lake table cache select in physical way #55328

Conversation

starrocks-xupeng commented Jan 22, 2025 • edited Loading

Why I'm doing:

What I'm doing:

What type of PR is this:

Checklist:

Bugfix cherry-pick branch check:

ctbrennan commented Jan 22, 2025 • edited Loading

starrocks-xupeng commented Jan 24, 2025

starrocks-xupeng commented Jan 22, 2025 •

edited

Loading

ctbrennan commented Jan 22, 2025 •

edited

Loading