According to an HN discussion, RG decides its read strategy (mmap vs. no-mmap) based on the predicted workload. In theory, could this be optimized if the file size were known ahead of time? #1769
-
Title. For context, I found this comment of yours mentioning the strategy for choosing the read approach while researching filesystem scanning techniques. As this followed my intuition, I was curious if, in the hypothetical case where a file stat had already taken place (and thus isn't part of the "benchmark" time) and the filesize was well known beforehand, could this be used to improve the speed of reading a file by selectively mmaping larger files that surpass a threshold size? Understandably this information wouldn't be available in RG's case since it would incur a stat() call for every file (slowing things way down), so this is more of a hypothetical than anything. I'm doing research for a case where the file sizes of a known set of files is cached and loaded prior to performing the scan, and thus might be used to predict the kind of workload that might occur on a per-file case. I'm aware this wouldn't make sense for ripgrep but this is kind of the king utility for file scanning that I know of, so I figured this might be the best place to ask. Thanks for any insight :) EDIT: P.S. thanks for making RipGrep, I use it all the time and it's a breath of fresh air. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
I think you've kind of answered the question yourself already. You've already mentioned the main caveat: the additional stat call could indeed doom the strategy in many workloads. There are some circumstances where ripgrep will stat every file on Unix at least, but both are pretty uncommon. The first is if ripgrep has its output redirected to a file. It will stat each file it searches to ensure that it doesn't search the file its output is being redirected to (otherwise you could end up with an infinite feedback loop). The other is if the Otherwise, the only thing left to really do is to test the hypothesis and measure it. |
Beta Was this translation helpful? Give feedback.
I think you've kind of answered the question yourself already. You've already mentioned the main caveat: the additional stat call could indeed doom the strategy in many workloads. There are some circumstances where ripgrep will stat every file on Unix at least, but both are pretty uncommon. The first is if ripgrep has its output redirected to a file. It will stat each file it searches to ensure that it doesn't search the file its output is being redirected to (otherwise you could end up with an infinite feedback loop). The other is if the
--max-filesize
flag is used. A stat is necessary to determine whether a file's size is too big to search. Doing the memory map optimization you mention …