Slides at https://github.com/rbtcollins/reveal.js
Robert Collins ; He/him
@rbtcollins (Twitter)
@robert.collins (Slack)
Rustup working group member Note: Thank you for coming to listen to my tale today. I hope you will enjoy it. We're going to talk about a 10x performance improvement in rustup on Windows, well, mostly on Windows.
On a road trip, far from my chunky workstation
Note: I was using my Surface, horrid motel Wifi, worse 4G. Rust-lang installed very very slowly. What was up with that?
Note: Ah good, there's a bug report
Note: Correctly blaming anti-virus! Problem Solved! Why is the bug still open?
Note: Oh. Its still slow even if you turn off Defender :(.
..techsplaining mostly elided..
What is common between these things?
rustup (Windows) 3m43s
7-zip (Windows) 2m35s
FastCopy (Windows) 2m53s
tar (Linux) <1.5s
Note: None of them are able to make NTFS perform well. Or, for that matter, NFS, GlusterFS, CephFS and a whole range of other file systems. Ext4 on an SSD, great.
The bug had concluded the problem was not solvable within Rustup. And then I "helped".
- "NTFS is what it is and can't be fixed"
- Lets not ship docs (but users want them on flights)
- Lets ship them in a zip or .ISO
- Lets only unpack them when the user asks for it
Note: Fortunately, I ran across this broader bug for all file systems.
Around about here I submitted my first patch - lets call this v1
Date: Tue Apr 9 12:53:15 2019 +1200
Less copying during dist installation
Per #904 copying the contents of dists out of the extracted staging
directory and then later deleting that same staging directory consumes
60s out of a total of 200s on Windows.
Wall clock testing shows this patch reduces
`rustup toolchain install nightly` from 3m45 to 2m23 for me - including
download times etc.
I'm sure there is more that can be done, thus I'm not marking this as
a closing merge for #904.
Note: 1 minute on the total runtime isn't too bad: ~30%.
Note: The rust project I had been working on is long forgotten at this point. I'm just sitting there installing and uninstalling rust, looking at process monitor (a sysinternals tool, now from Microsoft - think a GUI strace) traces. rustup isn't super easy to do perf work on. The different stages it goes through are not well isolated, and I was doing hacky things to avoid downloading rust each time from the internet (copying files while they were in use and arrgh).
v2: terminfo
v3: parent dir stat caching: 1-2 seconds
v4: larger untar read buffer: 0-12 seconds
v5: larger hash checking read buffer
Note: My v2 patch fixed a bug where terminfo was opened way too many times. Global state needs to be managed as global state. rustup isn't /good rust/ in this regard - not a reflection on rust. My v3 patch improved tar unpacking so that the parent directory of a path was only statted once per directory name, rather than once per child of that directory. This saved 20k stats. Sorry about the wall of text! I've skipped half the commit as-is. Another 12 seconds. Are you keeping track? Rust has a very small built-in buffer size (8K) - which is never going to be optimal dealing with disk or network I/O anywhere. I looked at submitting a patch but because this ends up in in-stack buffers in various sys functions it was not going to go anywhere.
Note: Remember that very first comment
- Stat
- Create, write data
- set mtime an
- stat
- set permissions
And then there is another later walk over the same file. And a complete copy of everything.
v6: 1 line, 4-6 seconds
We don't care about mtime. Turning it off saves an open, a read, and write and a close. Note: NTFS stores the mtime in the Master File Table - the NTKERNEL could have a non-file-opening syscall to set file times, but doesn't. Probably makes the security surface cleaner. 4 kernel mode transitions * 20K = 4 to 6 seconds. Docs is 20k; src is 7k? toolchain is many fewer files but larger. Also note that if these lower layers had benchmark suites using e.g. dm-delay, its likely that the death by a thousand cuts effect would have been mitigated. One of the things that experienced folk will tend to tell you is that it really matters what is happening under the hood. And you may be getting sense that that is involved here.
fn chmod(fname: &Path, mode: uint) -> Result<()> {
let file = open(fname)?
fchmod(file, mode)?
close(file)
}
Note: To make it more concrete than me just talking to it, this is what an emulated chmod inside something like cygwin, or WSL, ends up looking like.
Here I ran off for a bit and worked on tar-rs
- Time of check/Time of use bugs (O_EXCL etc)
- Do the modeset while the file is open fchmod vs chmod.
- Set mtime and atime on file handles
- Optimise out readonly modesetting on Windows
Note: All of these things are better for Linux and ext4 as well. But Linux and ext4 is remarkably tolerant of things being done less than ideally.
v7 - call CloseHandle() from threads - 11s total
Avoid blocking on CloseHandle
On Windows closing a file involves CloseHandle, which can be quite
slow (6+ms) for various reasons, including circumstances outside our
control such as virus scanners, content indexing, device driver cache
write-back synchronisation and so forth. Rather than having a super
long list of all the things users need to do to optimise the performance
of CloseHandle, use a small threadpool to avoid blocking package
extraction when closing file handles.
This does run a risk of resource exhaustion, but as we only have 20k
files in the largest package today that should not be a problem in
practice.
https://www.mercurial-scm.org/pipermail/mercurial-devel/2016-January/078404.html
provided inspiration for this.
My benchmark system went from 21/22s to 11s with this change with both
4 or 8 threads.
Note: specifically want to acknowledge the mercurial developers who had identified that close latency was a performance issue for hg - see the link in my commit message. Talk here about the IO completion model - drivers not owning the pages - the app still owns it. Vs Linux where it is moved/transferred into the write cache pathway, and then control returns.
Note: whew. Time for a pause. We're down to 11s seconds now, is NTFS still in the picture? Is Defender? Not to mention mcafee/norton/kapersky.
How did just eliminating a directory copy save a whole minute?
Is I/O on NTFS that bad? Note: mention burning temps on tablet back
rust-docs has ~20k files
each file was being scanned each time it was written Note: I don't know about you, but I can't read 20k documents in 11 seconds
Note: This provoked some jealousy. Scott watched my earlier version of this talk. He fields dozens of these queries daily. Why did I get access? It was clear, well articulated. I had done my prep work and that enabled both the introduction and then the subsequent response I got from the Defender group.
Note: The discussion with Microsoft staff was excellent. They asked for traces via the feedback hub, shared how defender works with me and included me in discussions about the solution.
CloseHandle -> Scan -> Ok/Error Note: This is how Defender defaults. And I have traces with the CloseHandle call taking many 10's of milliseconds. If we dispatch work to defender - to the scan routine - faster than defender can process it, it backs up.
Open -> Scan -> Ok/Error Note: This is what Defender has to do after a reboot. And it is also an opt-in mode for individual processes at runtime. But not process controlled.
Defender was updated within a couple of weeks Note: to recognise rustup and default to scan-on-first-use rather than scan-on-close for the files that rustup writes.
I suggested a userspace API to let all opensource projects opt-in... but that didn't go anywhere.
Think git/hg/bzr, etc.
v8 - make it slower when it fails
Fix #1870: Race condition with some virus scanners
Some virus scanners take out handles on files we are going to
rename, causing access errors and preventing the rename. Typically
this is short lived. Retry rather than erroring.
Note: McAfee
Note: From the big issue about windows doc installs being slow. Perhaps it really isn't just windows. Explain the KB/s install metric. Also reddit
Note: So this bug was filed about lustre being slow, and turned out to be NFS was slow. This leads to
v9 - close() in a thread Note: Narrator- this did not solve the problem
openat(): ~3ms
fchmod(): ~2ms
write(): ~10ns
close(): ~10ns
chmod(): ~1ms
Note: 20k items, so 1ms == 20 seconds. so 60 seconds in open(), 40 seconds in fchmod(), and 20 seconds() in the final chmods.=120s So, the things we need to do to make it fast for NFS filesystems are:
- don't block on inode creation
- don't apply chmod's twice
- and do the final chmod threaded or in the initial IO
v10 - thread all file IO Note: A version of this logic is included in rust 1.37 so that rust-doc (and cargo doc etc) generate documentation in a threaded fashion, for exactly the same reason. win-api for instance really benefits.
v11 - thread directory operations as well Note: This required cross-thread coordination because directories are effectively global state
v12 - set permissions during extraction Note: because duh. Otherwise we just blow the disk cache.
I also added a tracing system in there that outputs Catapult trace files - thats the Chromium in-browser profiler.
Syscalls are not free
Anywhere
Don't blame your serial code's serial performance on what you are running on top of (like NTFS)
Collaborate with the owners of components to get the best results
There are often many causes to problems
Don't stop at the first cause
Theres a whole other chapter into WSLv1, WSLv2 and more, but ETIME.
These slides: https://github.com/rbtcollins/reveal.js/tree/master/cognite-devcon-installing-rust.html
- Robert Collins
- @rbtcollins (twitter)
- robert.collins@cognite.com