diff --git a/content/blog/linux/bughunting/grub-bug-broke-my-servers-bootloader.md b/content/blog/linux/bughunting/grub-bug-broke-my-servers-bootloader.md new file mode 100644 index 0000000..91cf14c --- /dev/null +++ b/content/blog/linux/bughunting/grub-bug-broke-my-servers-bootloader.md @@ -0,0 +1,373 @@ +--- +title: Bug Hunting - E02 - How Grub broke my Server, Rebooting into a broken Bootloader +description: We take a deep dive into how a GRUB Bug broke my Servers Bootloader and left me frustrated and chrooting into it, in order to fix it. +date: 2024-03-12 +tags: + - bug + - 'bug hunting' + - zfs + - grub + - linux + - file system + - server +--- + +_This Article serves as a chronicle recapitulation of the events that took place over the course of last saturday ( +09.03.24)._ + +## What happened? + +My Saturday morning started rather boring with my alarm going off at 07:00 as usual. +Whilst sleepy me was browsing through the notifications I received since traversing the land of the sleeping, some mail +grabbed my attention. +My Server notified me of a broken APT upgrade. +This is nothing unusual itself, the unattended-upgrades package performs APT upgrades for the Ubuntu LTS running on my +server and notifies me in the event of any mishaps. +The major fuckup happened to me still sleepy not reading the whole message but rather only skimming it on my phone +before taking immediate action. +But we will come to that in a minute. + +{% image "./mail-reporting-apt-failure.png", "The Mail Report that started it all" %} + +So let us examine that mail together. +First of all, we see that the Linux Kernel Package received an upgrade which lead to the rebuild of the Servers +Initramfs and grub. +The latter step failed since there was no space left. +Also, we are informed that we - obviously - should reboot after the upgrade as it finished successfully, since only then +we get to apply the new kernel version. +Ok, at this point in time I decided to quickly pull out my laptop from the nightstand ssh into my server, fix the error, +reapply the package upgrade (therefore initramfs and grub update) and reboot. +Should be as easy as this - right? +Spoiler: **No** +This took me way longer in the end... + +## My Fuckup + +I ssh'ed into my Server and validated that the boot partition was indeed out of space. +The reason was quite easy, I utilize zfs-auto-snapshot as a service, but did not care to configure it properly. +Since the boot partition - as automatically created via Ubuntus Installer - is quite small, the automatic Snapshots of +the corresponding Pool and Datasets quickly fill it up with every kernel upgrade. +I validated this quickly via Shell (the command output below is recreated therefore showing the current and correct +space usage). + +{% highlight "bash" %} +$ zpool list +NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT +bpool 1.88G 316M 1.57G - - 1% 16% 1.00x ONLINE - +rpool 920G 77.4G 843G - - 15% 8% 1.00x ONLINE - +xypool 3.62T 1.71T 1.92T - - 0% 47% 1.00x ONLINE - +{% endhighlight %} + +Snapshots of the boot partition do not contain any relevant data or serve any additional purpose. +I mostly utilize Snapshots for easy rollbacks and recovery of deleted files on my data partition (xypool/data) as well +as the Docker containers running all relevant services on the root (rpool). +Therefore, I decided the easiest way out would be to purge all snapshots of the bpool. + +{% highlight "bash" %} +$ zfs list -H -o name -t snapshot bpool/BOOT/ubuntu_ww2tf2 | xargs -n1 zfs destroy +$ zfs list -H -o name -t snapshot bpool/BOOT/ | xargs -n1 zfs destroy +$ zfs list -H -o name -t snapshot bpool/ | xargs -n1 zfs destroy +{% endhighlight %} +_Contrary to the Syntax Highlighting, these commands are executed as root, zfs assumes you know what you are doing, +there is no way back._ + +Ok fine, now that there is sufficient space, let us rerun `apt upgrade`. +Finished without visible errors on first sight (hint for the fuckup) aaaand `sudo reboot`. + +## Everything is in Shambles + +As I am used to reboots taking some time (fastboot is disabled anyway, grub has a long timeout) I usually ping my +server. + +{% highlight "bash" %} +$ ping 192.168.178.5 +PING 192.168.178.5 (192.168.178.5) 56(84) bytes of data. +From 192.168.178.40 icmp_seq=1 Destination Host Unreachable +From 192.168.178.40 icmp_seq=2 Destination Host Unreachable +From 192.168.178.40 icmp_seq=3 Destination Host Unreachable +{% endhighlight %} + +Well that did not look good, as it is necessary to supply the zfs key for the rpool (root partition) in order to boot I +utilize a kvm which came in handy. +But what I saw through the HDMI Screens stream, did freak me out. +My urge to quickly rush through the required steps to fixup the server, in order to get some breakfast, coffee and go +bouldering now fell back on me. + +{% image "./grub-rescue-shell.png", "Symbolic Image Grub Rescue Shell" %}Symbolic Image [1] + +The particularly attentive out of you may have already noticed the second error. +My mails screenshot above contains a scrollbar, which might reveal important information. + +{% highlight "bash" %} +Processing triggers for linux-image-6.5.0-25-generic (6.5.0-25.25~22.04.1) ... +/etc/kernel/postinst.d/initramfs-tools: +update-initramfs: Generating /boot/initrd.img-6.5.0-25-generic +/etc/kernel/postinst.d/zz-update-grub: +Sourcing file `/etc/default/grub' +Sourcing file `/etc/default/grub.d/init-select.cfg' +Generating grub configuration file ... +Found linux image: vmlinuz-6.5.0-25-generic in rpool/ROOT/ubuntu_ww2tf2 +Found initrd image: initrd.img-6.5.0-25-generic in rpool/ROOT/ubuntu_ww2tf2 +Found linux image: vmlinuz-6.5.0-21-generic in rpool/ROOT/ubuntu_ww2tf2 +Found initrd image: initrd.img-6.5.0-21-generic in rpool/ROOT/ubuntu_ww2tf2 +Found linux image: vmlinuz-6.2.0-37-generic in rpool/ROOT/ubuntu_ww2tf2 +Found initrd image: initrd.img-6.2.0-37-generic in rpool/ROOT/ubuntu_ww2tf2 +[...] +/usr/sbin/grub-probe: error: compression algorithm inherit not supported +. +Memtest86+ needs a 16-bit boot, that is not available on EFI, exiting +Warning: os-prober will not be executed to detect other bootable partitions. +Systems on them will not be added to the GRUB boot configuration. +Check GRUB_DISABLE_OS_PROBER documentation entry. +Adding boot menu entry for UEFI Firmware Settings ... +done +{% endhighlight %} + +And there it was, right in front of my eyes, the cryptic error +message `grub-probe: error: compression algorithm inherit not supported`. + +## An annoying GRUB Bug + +After long traversing trough loads of issues I found the culprit.[4][5] + +{% highlight "bash" %} +grub-core/fs/zfs/zfs.c:3395:zfs: endian = 1 +grub-core/fs/zfs/zfs.c:3170:zfs: endian = 1 +grub-core/fs/zfs/zfs.c:1885:zfs: zio_read: E 0: size 0/512 +grub-core/kern/fs.c:79:fs: error: compression algorithm inherit not supported +{% endhighlight %}[5] + +The underlying root cause is that a nasty bug in grub leads to grub-prober (grubs detection mechanism for boot +partitions) to no longer detect the zfs boot partition if the top level pool is snapshotted. +As you have witnessed above, I did deploy regular snapshots on a regular basis. +What frightens me is the fact, that this bugs seems not to occur every time since I do snapshots of bpool since the +server is up and running and were confronted by this bug only months later. + +## The Mitigation + +Ok now that I knew the issue, the real work of fixing my broken bootloader started. +I moved over to the pivkm and mounted an image of ubuntu I could boot into. +Or so I thought, I should have more carefully investigated the issue and read the proposed solutions. +To be fair, this was 20 minutes into the chaos and i was far from being awake or in reach of my first coffee. + +{% image "./192.168.178.55_kvm_.png", "pikvm Web GUI" %} + +From there it was the usual hustle of chrooting into an encrypted zfs partition. +I utilized this Guide [6] as a base and adapted it to my Setup as well as an native encrpyted zfs install, which takes +some additional hoops to jump through. + +{% highlight "bash" %} + +# import root pool + +zpool import -f rpool -R /mnt + +# ensure root pool datasets are mounted + decrypted + +cryptsetup open /dev/zvol/rpool/keystore zfskey + +# proceeding to mount the key via nautilus (file explorer) + +cat /media/ubuntu/keystore-rpool/system.key | zfs load-key -L prompt rpool + +# mount all datasets + +zfs mount -a + +# import boot pool + +zpool import -f bpool -R /mnt + +# mount EFI Partition + +mount -t msdos /dev/nvme0n1p1 /mnt/boot/efi + +# mount special file systems + +for i in proc dev sys dev/pts; do mount -v --bind /$i /mnt/$i; done + +# bind mount grub folder in efi partition + +mount -v --bind /mnt/boot/efi/grub /mnt/boot/grub + +# mount efivars for grub reinstall + +mount --bind /sys/firmware/efi/efivars /mnt/sys/firmware/efi/efivars + +# chroot into FS + +chroot /mnt /bin/bash +{% endhighlight %} + +Jumping into the servers shell I first had to finish DNS resolution to get working internet access (especially since the +DNS server my router was pointing to should have been running on the very server I was working from). +So i proceeded to reinstall grub, rebuilt initramfs and grub. + +{% highlight "bash" %} + +# fix dns resolution + +vim /etc/resolv.conf # add 'nameserver 8.8.8.8' + +# proceed to upadte initramfs + +update-initramfs -uvk all + +# update bootloader + +update-grub + +# reinstall grub-packages + +apt update +apt --reinstall install grub-common grub-efi-amd64-bin grub-efi-amd64-signed os-prober + +# install grub + +grub-install --bootloader-id=ubuntu --recheck --target=x86_64-efi --efi-directory=/boot/efi --no-floppy +{% endhighlight %} + +After then backing out of the chroot, I prepared the system for shutdown. + +{% highlight "bash" %} + +# unmount special FS + +for i in proc dev/pts dev sys boot/grub; do umount -v /mnt/$i; done + +# remove EFI Partition + efi vars + +umount -v /dev/nvme0n1p1 +umount /mnt/sys/firmware/efi/efivars + +# unmount zfs pools + +zpool export bpool +zfs umount -a + +# ensure to umount keystore before + +zpool export rpool +{% endhighlight %} + +After a reboot I was struck, once again with the grub rescue shell - I will spare you the symbolic picture again. +That could not be right? +After more careful reading I realized that the proposed solution was to recreate the boot pool. +But hey at least this was a good exercise in desaster chrooting. +As I was not willing to go through that process again I decided to bypass grubs configuration and manually boot my +server from the rescue shell. +This seemed a viable option as I was certain that the boot partition contained the correct entries. + +{% highlight "bash" %} +set root=(hd3,gpt3) +linux /BOOT/ubuntu_ww2tf2/@/vmlinuz root=ZFS=rpool/ROOT/ubuntu_ww2tf2 boot=zfs +initrd /BOOT/ubuntu_ww2tf2/@/initrd.img +boot +{% endhighlight %} + +After some minutes of figuring the partition layout out later, I booted into my server. +There I straight up went to work, backing up my boot partition, deleting the pool and dataset. + +{% highlight "bash" %} +ll / | grep boot +-rw-r--r-- 1 root root 300M Mär 9 16:48 backup-boot-partition-before-zfs-problems-grub.tgz +drwxr-xr-x 4 root root 19 Mär 14 07:37 boot +{% endhighlight %} + +Then I went into recreating it with the proposed mitigating parameters[7] combined with the default parameters for +ubuntu 22.04 taken from the openzfs documentation[8]. +I was aware that this was highly experimental, especially cannonical does not utilize the default openzfs configuration +oftentimes. +But my backup plan was to utilize zfs boot manager until the grub bugfix was present in my Ubuntu version (which I was +aware could potentially mean never, since people reported issues with this constellation for years). + +{% highlight "bash" %} + +# create pool (exemplary did not recover command) + +zpool create \ +-o ashift=12 \ +-o autotrim=on \ +-o cachefile=/etc/zfs/zpool.cache \ +-o compatibility=grub2 \ +-o feature@livelist=enabled \ +-o feature@zpool_checkpoint=enabled \ +-O devices=off \ +-O acltype=posixacl -O xattr=sa \ +-O compression=lz4 \ +-O normalization=formD \ +-O relatime=on \ +-o feature@extensible_dataset=disabled \ +-o feature@bookmarks=disabled \ +-o feature@filesystem_limits=disabled \ +-o feature@large_blocks=disabled \ +-o feature@large_dnode=disabled \ +-o feature@sha512=disabled \ +-o feature@skein=disabled \ +-o feature@edonr=disabled \ +-o feature@userobj_accounting=disabled \ +-o feature@encryption=disabled \ +-o feature@project_quota=disabled \ +-o feature@obsolete_counts=disabled \ +-o feature@bookmark_v2=disabled \ +-o feature@redaction_bookmarks=disabled \ +-o feature@redacted_datasets=disabled \ +-o feature@bookmark_written=disabled \ #some properties needed to be removed +-o feature@livelist=disabled \ #due to incompatibility with my package versions +-o feature@zstd_compress=disabled \ +-o feature@head_errlog=disabled \ +-O canmount=off -O mountpoint=/boot -R /mnt \ +bpool ${DISK}-part3 + +# create dataset used for partition with mountpoint (exemplary did not recover command) + +zfs create -o canmount=off -o mountpoint=none rpool/ROOT +zfs create -o mountpoint=/boot bpool/BOOT/ubuntu_$UUID +{% endhighlight %}[7][8] + +From there I restored my bootloader backup and rebuilt the initramfs as well as grub just to be sure. +Also, I checked that grub-prober did indeed detect the partition without erros, which worked out fine. +Additionally, I made sure, that I will never be plagued by this bug again, by disabling auto snapshots by property for +the pool and datasets. + +{% highlight "bash" %} + +# rebuild initramfs + +update-initramfs -c -k all + +# rebuild grub + +update-grub + +# check grub-prober works as intended + +grub-probe /boot + +# disable snapshotting on pool and datasets + +zfs set com.sun:auto-snapshot=false bpool +zfs set com.sun:auto-snapshot=false bpool/BOOT +zfs set com.sun:auto-snapshot=false bpool/BOOT/ubuntu_ww2tf2 +{% endhighlight %} + +After a reboot later my server came up without any problems - finally! +And this is how 10 minutes of work, turned into hours of debugging once again. + +--- + +[1] - how to use grub rescue mode, olinux.net +[2] - grub-probe ("suddenly") fails with " +algorithm inherit not supported" #15261 , GitHub Openzfs +[3] - snapshotting +top-level "bpool" filesystem causes grub to fail #13873, GitHub Openzfs +[4] - grub-probe: error: +compression algorithm inherit not supported , Bugs Launchpad +[5] - update-grub giving errors +and apparently not locating /boot on correct zfs pool after upgrade to Ubuntu Mantic, Bugs Launchpad +[6] - +Mount Ubuntu 22.04 ZFS partitions using live ISO for disaster recovery, Developer Monkey +[7] - +Comment 9 for bug 2041739, Bugs Launchpad +[8] - +Ubuntu 22.04 Root on ZFS, openzfs documentation \ No newline at end of file diff --git a/content/blog/linux/bughunting/grub-rescue-shell.png b/content/blog/linux/bughunting/grub-rescue-shell.png new file mode 100644 index 0000000..c2b9d86 Binary files /dev/null and b/content/blog/linux/bughunting/grub-rescue-shell.png differ diff --git a/content/blog/linux/bughunting/mail-reporting-apt-failure.png b/content/blog/linux/bughunting/mail-reporting-apt-failure.png new file mode 100644 index 0000000..0b15064 Binary files /dev/null and b/content/blog/linux/bughunting/mail-reporting-apt-failure.png differ