According to an HN discussion, RG decides its read strategy (mmap vs. no-mmap) based on the predicted workload. In theory, could this be optimized if the file size were known ahead of time? #1769

Qix- · 2020-12-27T15:21:25Z

Qix-
Dec 27, 2020

Title. For context, I found this comment of yours mentioning the strategy for choosing the read approach while researching filesystem scanning techniques.

As this followed my intuition, I was curious if, in the hypothetical case where a file stat had already taken place (and thus isn't part of the "benchmark" time) and the filesize was well known beforehand, could this be used to improve the speed of reading a file by selectively mmaping larger files that surpass a threshold size?

Understandably this information wouldn't be available in RG's case since it would incur a stat() call for every file (slowing things way down), so this is more of a hypothetical than anything. I'm doing research for a case where the file sizes of a known set of files is cached and loaded prior to performing the scan, and thus might be used to predict the kind of workload that might occur on a per-file case.

I'm aware this wouldn't make sense for ripgrep but this is kind of the king utility for file scanning that I know of, so I figured this might be the best place to ask.

Thanks for any insight :)

EDIT: P.S. thanks for making RipGrep, I use it all the time and it's a breath of fresh air.

Answered by BurntSushi

Dec 27, 2020

I think you've kind of answered the question yourself already. You've already mentioned the main caveat: the additional stat call could indeed doom the strategy in many workloads. There are some circumstances where ripgrep will stat every file on Unix at least, but both are pretty uncommon. The first is if ripgrep has its output redirected to a file. It will stat each file it searches to ensure that it doesn't search the file its output is being redirected to (otherwise you could end up with an infinite feedback loop). The other is if the --max-filesize flag is used. A stat is necessary to determine whether a file's size is too big to search. Doing the memory map optimization you mention …

View full answer

BurntSushi · 2020-12-27T16:42:41Z

BurntSushi
Dec 27, 2020
Maintainer

I think you've kind of answered the question yourself already. You've already mentioned the main caveat: the additional stat call could indeed doom the strategy in many workloads. There are some circumstances where ripgrep will stat every file on Unix at least, but both are pretty uncommon. The first is if ripgrep has its output redirected to a file. It will stat each file it searches to ensure that it doesn't search the file its output is being redirected to (otherwise you could end up with an infinite feedback loop). The other is if the --max-filesize flag is used. A stat is necessary to determine whether a file's size is too big to search. Doing the memory map optimization you mention in these two cases is almost certainly not worth it.

Otherwise, the only thing left to really do is to test the hypothesis and measure it.

1 reply

Qix- Dec 27, 2020
Author

@BurntSushi Awesome, just confirming my suspicions! Thank you for the additional information, too. Good stuff.

Happy new year!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

According to an HN discussion, RG decides its read strategy (mmap vs. no-mmap) based on the predicted workload. In theory, could this be optimized if the file size were known ahead of time? #1769

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

According to an HN discussion, RG decides its read strategy (mmap vs. no-mmap) based on the predicted workload. In theory, could this be optimized if the file size were known ahead of time? #1769

Qix- Dec 27, 2020

Replies: 1 comment · 1 reply

BurntSushi Dec 27, 2020 Maintainer

Qix- Dec 27, 2020 Author

Qix-
Dec 27, 2020

Replies: 1 comment 1 reply

BurntSushi
Dec 27, 2020
Maintainer

Qix- Dec 27, 2020
Author