Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Demystifying needed options with QubesOS pool in btrfs reflink (multiple cow snapshots rotating, beesd dedup and load avg hitting 65+ on 12 cores setup) #283

Open
tlaurion opened this issue Jun 27, 2024 · 5 comments

Comments

@tlaurion
Copy link

tlaurion commented Jun 27, 2024

I'll revisit this issue and update this post with details as I gather information along.

First sharing current frozen screen waiting for system to deal with iowait and writing changes to disk after having ran beesd on qubesos deployed with qusal; which means a lot of clones were created from base minimal templates then specialized with different packages installed in derived templates and where origin of clone also were updated. This means the origin of the clones were intact origin of clones, then those disk images in reflink pool diverged, and where bees deduped extents that were the same.

Notes:

  • Qubesos rotates its cow snapshots at boot and shutdown of its Xen qubes, keeping "dirty" cow disks at qube runtime and effectively providing revertible snapshots through their qvm-volume revert qube:volume helper. This permits the end user to revert up to two states of a qube after having shutdown it after realizing he did a stupid mistake, eg wiping ~/, for up to two subsequent reboot of a qube without needing to rely on backups to restore files/disk image states.
  • This qubesos deployment deploys qusal salt recipes, which fist download debian-minimal and fedora-minimal templates, and then clones them prior of specializing them for different use cases.
  • the current iowait explosion resulted from updating QubesOS templates in parallel from command line, where same extents are being accessed by most of the template clones resulting in a total non responsive system. Also note that I'm using wyng-backup to push/receive delta of block level disk images, where wyng also creates snapshots of its own staying at the last backup disk state to easily diff current disk image state against to produce its changeset which is what is being sent as volume to archive.
  • As a result of qubesos standard qvm-volume, wyng and qusal combination, nonetheless to say that a parent disk image is (qube=appvm/templates/wyng snapshots) are exact same origin copies in time of a disk image that diverge over time.
  • From testing, beesd in this scenario is more efficient in scan-mode 3, where it seems to try to pickup newer snapshots/files and known common shared snapshots (not btrfs subvolumes here: cow snapshots) prior of rescanning old disk image files that have been rotated (renamed), where beesd then needs to rescan because the files were moved and therefore beesd needs to rescan, and see some changes.
  • In qubesos context, there is a lot of warnings in beesd logs being FIXME: too many dedup candidates. Ideally, this should not be a problem @Zygo ? Is it a problem?
  • to undo beesd looooong work (445gb of actual disk images states prior of dedup, resulting is approx 100gb of deduped consumed space on a 1.8Tb pool.
  • When deploying qubesos, everything is under one single btrfs fs, which includes rootfs (dom0) and the reflink pool which is under /var/lib/qubes
  • When setting up wyng, the tool requires to work on a btrfs subvolume. Therefore wyng instructs end user to shutdown all qubes, move /var/lib/qubes to /var/lib/qubes-old, create btrfs subvol under /var/lib/qubes and move back /var/lib/qubes-old/* content back under /var/lib/qubes/
  • neither wyng/beesd/QubesOS is prescriptive about farther tuning btrfs filesystem outside of qubesos fs creation. Note that qubesos dom0 is based on fedora-37 and comes with btrfs-profs from that era, while no-holes, extended extends and tiny Metadata are turned on. Will provide details in subsequent comments.
  • I experimented a bit with /etc/fstab a bit:
    • Activating ssd mode
    • with autodefag (seems bad and counter productive of dedup, then removed)
    • activated zstd compression but not forced
    • All of which as of now are not improving the situation when deduped cow volumes get rotated by qubesos current renaming scheme at vm poweron/poweroff
    • maybe trimming gets in the way?

PXL_20240627_143958263.jpg


@tasket @Zygo : you have some guidelines on proper btrfs improvements of what is best known to work in cow disk images under virtualization context and more specific to qubesos use case of btrfs that should be tweaked? Willing to reinstall and restore backup if needed, where from current understanding most can/should be tweak able by balancing/fstab/tunefs without needing to reinstall.

Any insights welcome. I repeat, if I defrag: deduped is canceled and performances go back to normal. Thanks for your time.

@tlaurion tlaurion changed the title Demystifying needed options with QubesOS pool in btrfs reflink (multiple cow snapshots rotating, beesd dedup and load avg hitting 26 on 12 cores setup) Demystifying needed options with QubesOS pool in btrfs reflink (multiple cow snapshots rotating, beesd dedup and load avg hitting 65+ on 12 cores setup) Jun 27, 2024
@tasket
Copy link

tasket commented Jun 28, 2024

@tlaurion Having helped someone (and myself) recently with Btrfs degraded performance, two things stood out:

  • btrfs fs defrag -r -t 256K /var/lib/qubes made lagging filesystems performant again. Note this is not 'autodefrag'. For multi-terabyte filesystems, consider values larger than 256K.
  • Using fstrim can be a trigger for sudden onset of poor performance. (This also happens with tLVM, except in that case it can quickly degrade into a corrupted pool.) I'd expect deduplication to have a similar impact.

What I prescribe is pretty simple:

  • Use batch defrag on a regular basis (weekly is probably fine).
  • Avoid any maintenance that creates a sudden jump in fs metadata use, like fstrim or intensive deduplication. IIRC there is at least one Qubes issue that conclude its best to rely on moderate filesystem 'discard' mode instead of fstrim for this reason. For dedup, if dedup is necessary, then use a threshold larger than 256KB (i.e. avoid splitting data into extents that are smaller) .
  • Avoid fs compression for disk images, which increases the number of extents.

Also:

  • Lowering Qubes' snapshot 'revisions_to_keep' can certainly help. Regular VM use is pretty cow-intensive as well, and the snapshot history exacerbates the issue. Setting to '0' is fine if you do regular backups.
  • Lowering Wyng's snapshot profile is simple: Running the wyng monitor command in-between infrequent backups will reduce snapshot frag footprint to near zero each time its run. Frequent backups also have this effect. (And its now possible to use wyng receive --use-snapshot as a kind of stand-in for Qubes' volume revert feature, having a single snapshot serve both roles).

Worth trying: 'ssd_spread' option, if it has any effect on metadata

Batch defrag is by far the most important factor above, IMO. Using a margin of additional space in exchange for smooth operation seems like a small price to pay (note that other filesystems make space/frag tradeoffs automatically). Making the defrag a batch op, with days in between, gives us some of the best aspects of what various storage systems do, preserving responsiveness while avoiding the worst write-amplification effects.

'autodefrag' will make baseline performance more like tLVM and could also increase write-amplification more than other options. But it can help avoid hitting a performance wall if for some reason you don't want to use btrfs fs defrag.

Long term, I would suggest someone convince the Btrfs devs to take a hard look at the container/VM image use case so they might help people avoid these pitfalls. Qubes might also help here as well: If we created one-subvol-per-vm and used subvol snapshots instead of reflinks, then the filesystem could be in 'nodatacow' mode and you would have a metadata performance profile closer to NTFS with generally less fragmentation because not every re-write would create detached/split extents. Qubes could also create a designation for backup snapshots, including them in the revisions_to_keep count.

With that said, all the CoW filesystems have these same issues. Unless some new mathematical principle is applied to create a new kind of write-history, then the trade-offs will be similar across different formats. We also need to reflect on what deduplication means for active online systems and the degree to which it should be used; the fact that we can dedup intensively doesn't mean that practice isn't better left to archival or offline 'warehouse' roles (one of Btrfs' target use cases).


FWIW, the Btrfs volume I use most intensively has some non-default properties:

  • no-holes, skinny metadata, big metadata, mixed backref, extended iref enabled
  • xxhash64 hashing
  • 'RAID1' metadata (default for JBOD)
  • multi-device 'JBOD' mode (just multiple partitions on same drive)
  • resides on LUKS 4KB blocksize volumes
  • not the same fs as '/' root

Probably 'no-holes' has the greatest impact. I suspect the jbod hurts performance slightly. The safety margins on Btrfs are such that I'd feel safe turning off RAID1 metadata if it enhances performance. Also, I never initiate balancing (no reason why).

@tlaurion
Copy link
Author

tlaurion commented Jul 3, 2024

Better, but not quite there yet.

Some default, yet unchanged options from QoS 4.2.1 installer's FS creation defaults:

(130)$ sudo btrfs inspect-internal dump-super /dev/mapper/luks-5d997862-6372-4574-aa47-563060917b19
superblock: bytenr=65536, device=/dev/mapper/luks-5d997862-6372-4574-aa47-563060917b19
---------------------------------------------------------
csum_type		0 (crc32c)
csum_size		4
csum			0x7e1ea852 [match]
bytenr			65536
flags			0x1
			( WRITTEN )
magic			_BHRfS_M [match]
fsid			d6cf356b-495b-4b99-bd6d-1071f51cf1ef
metadata_uuid		00000000-0000-0000-0000-000000000000
label			qubes_dom0
generation		153330
root			3625061007360
sys_array_size		129
chunk_root_generation	90053
root_level		0
chunk_root		3158019358720
chunk_root_level	1
log_root		3625006022656
log_root_transid (deprecated)	0
log_root_level		0
total_bytes		1973612969984
bytes_used		486205415424
sectorsize		4096
nodesize		16384
leafsize (deprecated)	16384
stripesize		4096
root_dir		6
num_devices		1
compat_flags		0x0
compat_ro_flags		0x3
			( FREE_SPACE_TREE |
			  FREE_SPACE_TREE_VALID )
incompat_flags		0x371
			( MIXED_BACKREF |
			  COMPRESS_ZSTD |
			  BIG_METADATA |
			  EXTENDED_IREF |
			  SKINNY_METADATA |
			  NO_HOLES )
cache_generation	0
uuid_tree_generation	153330
dev_item.uuid		feaab371-72eb-488a-ae0a-923cf57cf6f2
dev_item.fsid		d6cf356b-495b-4b99-bd6d-1071f51cf1ef [match]
dev_item.type		0
dev_item.total_bytes	1973612969984
dev_item.bytes_used	1973611921408
dev_item.io_align	4096
dev_item.io_width	4096
dev_item.sector_size	4096
dev_item.devid		1
dev_item.dev_group	0
dev_item.seek_speed	0
dev_item.bandwidth	0
dev_item.generation	0

fstab:

# BTRFS pool within LUKSv2:
UUID=d6cf356b-495b-4b99-bd6d-1071f51cf1ef			/                       btrfs   subvol=root,x-systemd.device-timeout=0,ssd_spread,space_cache=v2 0 0 #w/o  autodefrag, w/o discard=async, w/o compress=zstd (incompatible with bees?)

Ran
sudo btrfs filesystem defragment -r -t 256K /var/lib/qubes

Still:
2024-07-03-112521

Happening on cp --reflink=always wyng calls with beesd enforced (but not currently running in background).
See the amount of IO write without read? I'm a bit confused here on what to tweak @Zygo.

@tasket :Thought reflink was not supposed to copy image but reference disk images.
Really not sure I get an understanding of what happens nor how to dig that deeper.

@tasket
Copy link

tasket commented Jul 4, 2024

Thought reflink was not supposed to copy image but reference disk images.
Really not sure I get an understanding of what happens nor how to dig that deeper.

Reflink copy will duplicate all the extent information in the source file's metadata to the dest file. Its not like a hard link (which is just one pointer to an inode) but usually much bigger. I am pretty sure Wyng is using reflink copy the same way Qubes Btrfs driver is. One difference is that after making reflinks, Wyng creates a read-only subvol snapshot, reads extent metadata from it, then deletes the snapshot (when it displays "Acquiring deltas"). You might try looking at a 'top' listing during that phase to see if there is anything unusual. For volumes over a certain size (about 128GB) Wyng will use a tmp directory in /var instead of /tmp; the more complex/deduped a large volume is, the more it will write data to /var (vaguely possible its creating your spike, but unlikely). Also check for swap activity.

@tasket
Copy link

tasket commented Jul 4, 2024

PS: Look at btrfs subvolume list / to check if there are any extra/stray subvolumes on that filesystem. You should see only the default and the one you made for /var/lib/qubes.

@rustybird
Copy link

rustybird commented Oct 31, 2024

@tasket:

Qubes might also help here as well: If we created one-subvol-per-vm and used subvol snapshots instead of reflinks, then the filesystem could be in 'nodatacow' mode and you would have a metadata performance profile closer to NTFS with generally less fragmentation because not every re-write would create detached/split extents.

How do snapshots make a difference here vs. reflinks? If you want nodatacow for the whole dom0 root filesystem (mounted with that option before the first .img file is created) or only for the pool directory (chattr +C while the directory is still empty), either setup will work fine with the file-reflink driver.

But with snapshots or with reflinks, I think the limiting factor is QubesOS/qubes-issues#8767 i.e. the inherent amount of CoW currently required by the Qubes OS storage API.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants