-
Notifications
You must be signed in to change notification settings - Fork 486
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Implement multi-pass prefetch for memory efficiency (#2566)
Summary: Pull Request resolved: #2566 ## Context Memory snapshot shows significant memory usage during prefetch kernels (specifically, `linearize_cache_index` and `lru_cache_populate`), which is estimated to be 6x of input size And unfortunately, due to they using dedicated stream, the memory cannot be reused by any other stream without performance penalty. So we need to lower down the peak prefetch memory usage as much as possible. ## MultiPass Prefetch (MPP) Multipass prefetch is basically a technique to sacrifice a bit of more running time for less peak memory during prefetch: We observed that intermediate memory usage for all functions during prefetch is `O(N)`, so we reduce the total prefetched index (`N`) for each pass to reduce the peak temporary usage. The following passes will recycle the memory used in the first pass so they won't further increase the memory footprint. **Benefit** With this being turned on, the peak memory usage will be dropped from `6 * input_size` to `(6 / M) * input_size`, where `M` is the total # of passes being configured. **Overhead** Overall, the bigger `M` we configured, the slower we'll be. But the overall overhead is acceptable. - **Efficiency regression**: Prefetch is taking longer because the process of cache lookup is being repeated for every duplicate index. In the past, they're deduped before being looked up, but now they might be look up multiple times if duplicate index are across different passes. - The regression is overall insignificant, as the major cost is the data movement between DDR and HBM. We'll always copy the data only once, even if they're duplicated across different passes. - The regression is likely hidden from the actual training performance, since prefetch happen in a separate stream. As long as it's not long enough to block sparse backward it's invisible. - **Spamming CUDA Launch Queue**: CUDA is allowing max # of 1024 pending kernels. CPU will go blocking if more are submitted. If a kernel is really small, we'll easily spam launch queue and greatly hurt QPS. We mitigate this via limit the minimal # of elements for a pass. ## What's in the patch? 1. Add multipass prefetch config to the interface of TBE. By default it's None for full backward compatibility 2. Modify the `lru_find_uncached` to make it idempotent -- if we tried to lock the same id multiple times in one single timestep (but multiple passes), we'll increase lock counter by only one. Reviewed By: sryap Differential Revision: D56908989 fbshipit-source-id: 94413acf18f4652687f594a3f69365fe9d68b54b
- Loading branch information
1 parent
108e8ef
commit 578ab67
Showing
5 changed files
with
245 additions
and
66 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.