Rust 2020 - helping to get rust deployed

This is my contribution to Rust 2020, where community members put forward ideas on what they thing Rust should aim to achieve in 2020.

In my view, Rust has had an amazing adoption by developers, and is great if you are in a position to deploy it in your own infrastructure, but we have yet to really see Rust make it to broad low-level components (IE in a linux distro or other infrastructure).

As someone who works on “enterprise” software (389-ds) and my own IDM project (kanidm), there is a need to have software packaged and distributed. We can not ask our consumers to build and compile these tools. One could view it as a chain, where I develop software in a language, it’s packaged for a company (like SUSE), and then consumed by a customer (could be anyone!) who provides a service to others (indirect users).

Rust however has always been modeled that there is no “middle” section. You have either a developer who’s intent is to develop for other developers. This is where Rust ideas like crates.io becomes involved. Alternately, you have a larger example in firefox, where developers build a project and can “bundle” everything into a whole unit that is then distributed directly to customers.

The major difference is that in the intermediate distribution case, we have to take on different responsibilities such as security auditing, building, ensuring dependencies exist etc.

So this leads me to:

1: Cargo Vendor Needs Some Love

Cargo vendor today is quite confusing in some scenarios, and it’s not clear how to have it work for projects that require offline builds. I have raised issues about this, but largely they have been un-acted upon.

2: Cargo is Difficult to Use in Mixed Language Projects

A large value proposition of Rust is the ability to use it with FFI and C. This is great if you say have cargo compile your C code for you.

But most major existing projects don’t. They use autotools, cmake, or maybe even meson or more esoteric, waf (looking at you samba). Cargo’s extreme opinionation in this area makes it extremely difficult to integrate Rust into an existing build system reliably. It’s hard to point to one fault, as much as a broader “lack of concern” in the space, and I think cargo needs to evolve improvements to help allow Rust to be used from other build tools.

3: Rust Moves Too Fast

A lot of “slower” enterprise companies want to move slowly, including compiler versions. Admittedly, this conservative behaviour is because of the historical instability of gcc versions and how it can change or affect your code between releases. Rust doesn’t suffer this, but people are still wary of fast version changes. This means Rustc + Cargo will get pinned to some version that may be 6 months old.

However crate authors don’t consider this - they will use the latest and greatest features from stable (and sometimes still nightly … grrr) in releases. Multiple times I have found that on my development environment even with a 3 month old compiler, dependencies won’t build.

Compounding this, crates.io doesn’t distinguish a security release from a feature one. Crates also encourages continuall version bumping, rather than maintenence of versioned branches. IE version 0.4.3 of a crate with a security fix will become 0.4.4, but then a feature update to include try_from may go to 0.4.5 as it “adds” to the api, or they use it internally as a cleanup.

Part of this issue is that Rust applications need to be treated closer to docker, as static whole units where only the resulting binary is supported rather than the toolchain that built it. But that only works on pure Rust applications - any mixed C + Rust application will hit this issue due to the difference between a system Rust version and what crate dependencies publish.

So I think that from this it leads to:

3.1: Crates need to indicate a minimum supported compiler version

Rust has “toyed” with the idea of editions, but within 2018 we’ve seen new features like maybeuninit and try_from land, which within an “edition” caused crates to stop worked on older compilers.

As a result, editions I think is “too broad” and people will fear incrementing it, and Rust will add features without changing edition anyway. Instead Rust needs to consider following up on the minimum supported rust version flag RFC. Rust has made it pretty clear the only “edition” flag that matters is the rust compiler version based on crate developers and what they are releasing.

3.2: Rust Needs to Think “What’s Our End Goal”

Rust is still moving incredibly fast, and I think in a way we need to ask … when will Rust be finished? When will it as a language go from continually rapid growth to stable and known feature sets? The idea of Rust editions acts as though this has happened (saying we change only every few years) when this is clearly not the case. Rust is evolving release-on-release, every 6 weeks.

4: Zero Cost Needs to Factor in Human Cost

My final wish for Rust is that sometimes we are so obsessed with the technical desire for zero cost abstraction, that we forget the high human cost and barriers that can exist as a result making feature adoption challenging. Rust has had a great community that treats people very well, and I think sometimes we need to extend that into feature development, to really consider the human cognitive cost of a feature.

Summarised - what’s the benefit of a zero cost abstraction if people can not work out how to use it?

Summary

I want to see Rust become a major part of operating systems and how we build computer systems, but I think that we need to pace ourselves, improve our tooling, and have some better ideas around what Rust should look like.

Recovering LVM when a device is missing with a cache pool lv

I had a heartstopping moment today: my after running a command lvm proudly annouced it had removed an 8TB volume containing all of my virtual machine backing stores.

Everyone, A short view back to the past …

I have a home server, with the configured storage array of:

  • 2x 8TB SMR (Shingled Magnetic Recording) archive disks (backup target)
  • 2x 8TB disks (vm backing store)
  • 2x 1TB nvme SSD (os + cache)

The vm backing store also had a lvm cache segment via the nvme ssds in a raid 1 configuration. This means that the 2x8TB drives are in raid 1, and a partition on each of the nvme devices are in raid 1, then they are composed to allow the nvme to cache blocks from/to the 8TB array.

Two weeks ago I noticed one of the nvme drives was producing IO errors indicating a fault of the device. Not wanting to risk corruption or other issues from growing out of hand, I immediately shutdown the machine and identified the nvme disk with the error.

At this stage I took the precaution of imaging (dd) both the good and bad nvme devices to the archive array. Subsequently I completed a secure erase of the faulty nvme drive before returning it to the vendor for RMA.

I then left the server offline as I was away from my home for more than a week and would not need, and was unable to monitor if the other drives would produce further errors.

Returning home …

I decided to ignore William of the past (always a bad idea) and to “break” the raid on the remaining nvme device so that my server could operate allowing me some options for work related tasks.

This is an annoying process in lvm - you need to remove the missing device from the volume group as well as indicating to the array that it should no longer be in a raid state. This vgreduce is only for removing missing PV’s, it shouldn’t be doing anything else.

I initiated the raid break process on the home, root and swap devices. The steps are:

vgreduce --removemissing <vgname>
lvconvert -m0 <vgname>/<lvname>

This occured without fault due to being present on an isolated “system” volume group, so the partial lvs were untouched and left on the remaining pv in the vg.

When I then initiated this process on the “data” vg which contained the libvirt backing store, vgreduce gave me the terrifying message:

Removing logical volume "libvirt_t2".

Oh no ~

Recovery Attempts

When a logical volume is removed, it can be recovered as lvm stores backups of the LVM metadata state in /etc/lvm/archive.

My initial first reaction was that I was on a live disk, so I needed to backup this content else it would be lost on reboot. I chose to put this on the unaffected, and healthy SMR archive array.

mount /dev/mapper/archive-backup /archive
cp -a /etc/lvm /archive/lvm-backup

At this point I knew that randomly attempting commands would cause further damage and likely prevent any ability to recover.

The first step was to activate ssh so that I could work from my laptop - rather than the tty with keyboard and monitor on my floor. It also means you can copy paste, which reduces errors. Remember, I’m booted on a live usb, which is why I reset the password.

# Only needed in a live usb.
passwd
systemctl start sshd

I then formulated a plan and wrote it out. This helps to ensure that I’ve thought through the recovery process and the risks, it helps be to be fully aware of the situation.

vim recovery-plan.txt

Into this I laid out the commands I would follow. Here is the plan:

bytes 808934440960
data_00001-2096569583

dd if=/dev/zero of=/mnt/lv_temp bs=4096 count=197493760
losetup /dev/loop10 /mnt/lv_temp
pvcreate --restorefile /etc/lvm/archive/data_00001-2096569583.vg --uuid iC4G41-PSFt-6vqp-GC0y-oN6T-NHnk-ivssmg /dev/loop10
vgcfgrestore data --test --file /etc/lvm/archive/data_00001-2096569583.vg

Now to explain this: The situation we are in is:

  • We have a removed data/libvirt_t2 logical volume
  • The VG data is missing a single PV (nvme0). It still has three PVs (nvme1, sda1, sdb1).
  • We can not restore metadata unless all devices are present as per the vgcfgrestore man page.

This means, we need to make a replacement device to replace into the array, and then to restore the metadata with that.

The “bytes” section you see, is the size of the missing nvme0 partition that was a member of this array - we need to create a loopback device of the same or greater size to allow us to restore the metadata. (dd, losetup)

Once the loopback is created, we can then recreate the pv on the loopback device with the same UUID as the missing device.

Once this is present, we can now restore the metadata as documented which should contain the logical volume.

I ran these steps and it was all great until vgcfgrestore. I can not remember the exact error but it was along the lines of:

Unable to restore metadata as PV was missing for VG when last modification was performed.

Yep, the vgreduce command has changed the VG state, triggering a metadata backup, but because a device was missing at the time, we can not restore this metadata.

Options …

At this point I had to consider alternate options. I conducted research into this topic as well to see if others had encountered this case (no one has ever not been able to restore their metadata apparently in this case …). The options that I arrived at:

    1. Restore the metadata from the nvme /root as it has older (but known) states - however I had recently expanded the libvirt_t2 volume from a live disk, meaning it may not have the correct part sizes.
    1. Attempt to extract the xfs filesystem with DD from the disk to begin a data recovery.
    1. Cry in a corner
    1. Use lvcreate with the “same paramaters” and hope that it aligns the start at the same location as the former data/libvirt_t2 allowing the xfs filesystem to be accessed.

All of these weren’t good - especially not 3.

I opted to attempt solution 1, and then if that failed, I would disconnect one of the 8TB disks, attempt solution 4, then if that ALSO failed, I would then attempt 2, finally all else lost I would begin solution 3. The major risk of relying on 4 and 2 is that LVM has dynamic geometry on disk, it does not always allocate contiguously. This means that attempting 4 with lvcreate may not create with the same geometry, and it may write to incorrect locations causing dataloss. The risk of 2 was again, due to the dynamic geometry what we recover may be re-arranged and corrupt.

This mean option 1 was the best way to proceed.

I mounted the /root volume of the host and using the lvm archive I was able to now restore the metadata.

vgcfgrestore data --test --file /system/etc/lvm/archive/data_00004-xxxx.vg

Once completed I performed an lvscan to refresh what block devices were available. I was then shown that every member of the VG data had conflicting seqno, and that the metadata was corrupt and unable to proceed.

Somehow we’d made it worse :(

Successful Procedure

At this point, faced with 3 options that were all terrible, I started to do more research. I finally discovered a post describing that the lvm metadata is stored on disk in the same format as the .vg files in the archive, and it’s a ring buffer. We may be able to restore from these.

To do so, you must dd out of the disk into a file, and then manipulate the file to only contain a single metadata entry.

Remember how I made images of my disks before I sent them back? This was their time to shine.

I did do a recovery plan with these commands too, but it was more evolving due to the paramaters involved so it changed frequently with the offsets. The plan was very similar to above - use a loop device as a stand in for the missing block device, restore the metadata, and then go from there.

We know that LVM metadata occurs in the first section of the disk, just after the partition start. So to work out where this is we use gdisk to show the partitions in the backup image.

# gdisk /mnt/mion.nvme0n1.img
GPT fdisk (gdisk) version 1.0.4
...

Command (? for help): p
Disk /mnt/mion.nvme0n1.img: 2000409264 sectors, 953.9 GiB
Sector size (logical): 512 bytes
...

Number  Start (sector)    End (sector)  Size       Code  Name
   1            2048         1026047   500.0 MiB   EF00
   2         1026048       420456447   200.0 GiB   8E00
   3       420456448      2000409230   753.4 GiB   8E00

It’s important to note the sector size flag, as well as the fact the output is in sectors.

The LVM header occupies 255 sectors after the start of the partition. So this in mind we can now create a dd command to extract the needed information.

dd if=/mnt/mion.nvme0n1.img of=/tmp/lvmmeta bs=512 count=255 skip=420456448

bs sets the sector size to 512, count will read from the start up to 255 sectors of size ‘bs’, and skip says to start reading after ‘skip’ * ‘sector’.

At this point, we can now copy this and edit the file:

cp /tmp/lvmmeta /archive/lvm.meta.edit

Within this file you can see the ring buffer of lvm metadata. You need to find the highest seqno that is a complete record. For example, my seqno = 20 was partial (is the lvm meta longer than 255, please contact me if you know!), but seqno=19 was complete.

Here is the region:

# ^ more data above.
}
# Generated by LVM2 version 2.02.180(2) (2018-07-19): Mon Nov 11 18:05:45 2019

contents = "Text Format Volume Group"
version = 1

description = ""

creation_host = "linux-p21s"    # Linux linux-p21s 4.12.14-lp151.28.25-default #1 SMP Wed Oct 30 08:39:59 UTC 2019 (54d7657) x86_64
creation_time = 1573459545      # Mon Nov 11 18:05:45 2019

^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@data {
id = "4t86tq-3DEW-VATS-1Q5x-nLLy-41pR-zEWwnr"
seqno = 19
format = "lvm2"

So from there you remove everything above “contents = …”, and clean up the vgname header. It should look something like this.

contents = "Text Format Volume Group"
version = 1

description = ""

creation_host = "linux-p21s"    # Linux linux-p21s 4.12.14-lp151.28.25-default #1 SMP Wed Oct 30 08:39:59 UTC 2019 (54d7657) x86_64
creation_time = 1573459545      # Mon Nov 11 18:05:45 2019

data {
id = "4t86tq-3DEW-VATS-1Q5x-nLLy-41pR-zEWwnr"
seqno = 19
format = "lvm2"

Similar, you need to then find the bottom of the segment (look for the next highest seqno) and remove everything below the line: “# Generated by LVM2 …”

Now, you can import this metadata to the loop device for the missing device. Note I had to wipe the former lvm meta segment due to the previous corruption, which caused pvcreate to refuse to touch the device.

dd if=/dev/zero of=/dev/loop10 bs=512 count=255
pvcreate --restorefile lvmmeta.orig.nvme1.edited --uuid iC4G41-PSFt-6vqp-GC0y-oN6T-NHnk-ivssmg /dev/loop10

Now you can do a dry run:

vgcfgrestore --test -f lvmmeta.orig.nvme1.edited data

And the real thing:

vgcfgrestore -f lvmmeta.orig.nvme1.edited data
lvscan

Hooray! We have volumes! Let’s check them, and ensure their filesystems are sane:

lvs
lvchange -ay data/libvirt_t2
xfs_repair -n /dev/mapper/data-libvirt_t2

If xfs_repair says no errors, then go ahead and mount!

At this point, lvm started to resync the raid, so I’ll leave that to complete before I take any further action to detach the loopback device.

How to Handle This Next Time

The cause of this issue really comes from vgreduce –removemissing removing the device when a cache member can’t be found. I plan to report this as a bug.

However another key challenge was the inability to restore the lvm metadata when the metadata archive reported a missing device. This is what stopped me from being able to restore the array in the first place, even though I had a “fake” replacement. This is also an issue I intend to raise.

Next time I would:

  • Activate the array as a partial
  • Remove the cache device first
  • Then stop the raid
  • Then perform the vgreduction

I really hope this doesn’t happen to you!

Upgrading OpenSUSE 15.0 to 15.1

It’s a little bit un-obvious how to do this. You have to edit the repo files to change the release version, then refresh + update.

sed -ri 's/15\.0/15.1/' /etc/zypp/repos.d/*.repo
zypper ref
zypper dup
reboot

Note this works on a transactional host too:

sed -ri 's/15\.0/15.1/' /etc/zypp/repos.d/*.repo
transactional-update dup
reboot

It would be nice if these was an upgrade tool that would attempt the upgrade and revert the repo files, or use temporary repo files for the upgrade though. It would be a bit nicer as a user experience than sed of the repo files.

Announcing Kanidm - A new IDM project

Today I’m starting to talk about my new project - Kanidm. Kanidm is an IDM project designed to be correct, simple and scalable. As an IDM project we should be able to store the identities and groups of people, authenticate them securely to various other infrastructure components and services, and much more.

You can find the source for kanidm on github.

For more details about what the project is planning to achieve, and what we have already implemented please see the github.

What about 389 Directory Server

I’m still part of the project, and working hard on making it the best LDAP server possible. Kanidm and 389-ds have different goals. 389 Directory Server is a globally scalable, distributed database, that can store huge amounts of data and process thousands of operations per second. 389-ds let’s you build a system ontop, in any way you want. If you want an authentication system today, use 389-ds. We are even working on a self-service web portal soon too (one of our most requested features!). Besides my self, no one on the (amazing) 389 DS team has any association with kanidm (yet?).

Kanidm is an opinionated IDM system, and has strong ideas about how authentication and users should be processed. We aim to be scalable, but that’s a long road ahead. We also want to have more web integrations, client tools and more. We’ll eventually write a kanidm to 389-ds sync tool.

Why not integrate something with 389? Why something new?

There are a lot of limitations with LDAP when it comes to modern web-focused auth processes such as webauthn. Because of this, I wanted to make something that didn’t have the same limitations, and had different ideas about data storage and apis. That’s why I wanted to make something new in parallel. It was a really hard decision to want to make something outside of 389 Directory Server (Because I really do love the project, and have great pride in the team), but I felt like it was going to be more productive to build in parallel, than ontop.

When will it be ready?

I think that a single-server deployment will be usable for small installations early 2020, and a fully fledged system with replication would be late 2020. It depends on how much time I have and what parts I implement in what order. Current rough work order (late 2019) is indexing, RADIUS integration, claims, and then self-service/web ui.

OpenSUSE leap as a virtualisation host

I’ve been rebuilding my network to use SUSE from CentOS, and the final server was my hypervisor. Most of the reason for this is the change in my employment, so I feel it’s right to dogfood for my workplace.

What you will need

  • Some computer parts (assembaly may be required)
  • OpenSUSE LEAP 15.1 media (dd if=opensuse.iso of=/dev/a_usb_i_hope)

What are we aiming for?

My new machine has dual NVME and dual 8TB spinning disks. The intent is to have the OS on the NVME and to have a large part of the NVME act as LVM cache for the spinning disk. The host won’t run any applications beside libvirt, and has to connect a number of vlans over a link aggregation.

Useful commands

Through out this document I’ll assume some details about your devices and partitions. To find your own, and to always check and confirm what you are doing, some command will help:

lsblk  # Shows all block storage devices, and how (if) they are mounted.
lvs  # shows all active logical volumes
vgs  # shows all active volume groups
pvs  # shows all active physical volumes
dmidecode  # show hardware information
ls -al /dev/disk/by-<ID TYPE>/  # how to resolve disk path to a uuid etc.

I’m going to assume you have devices like:

/dev/nvme0  # the first nvme of your system that you install to.
/dev/nvme1  # the second nvme, used later
/dev/sda    # Two larger block storage devices.
/dev/sdb

Install

Install and follow the prompts. Importantly when you install you install to a single NVME, and choose transactional server + lvm, then put btrfs on the /root in the lvm. You want to partition such that there is free space still in the NVME - I left about 400GB unpartitioned for this. In other words, the disk should be like:

[ /efi | pv + vg system               | pv (unused) ]
       | /root (btrfs), /boot, /home  |

Remember to leave about 1GB of freespace on the vg system to allow raid 1 metadata later!

Honestly, it may take you a try or two to get this right with YaST, and it was probably the trickiest part of the install.

You should also select that network management is via networkmanager, not wicked. You may want to enable ssh here. I disabled the firewall personally because there are no applications and it interfers with the bridging for the vms.

Packages

Because this is suse transactional we need to add packages and reboot each time. Here is what I used, but you may find you don’t need everything here:

transactional-update pkg install libvirt libvirt-daemon libvirt-daemon-qemu \
  sssd sssd-ad sssd-ldap sssd-tools docker zsh ipcalc python3-docker rdiff-backup \
  vim rsync iotop tmux fwupdate fwupdate-efi bridge-utils qemu-kvm apcupsd

Reboot, and you are ready to partition.

Partitioning - post install

First, copy your gpt from the first NVME to the second. You can do this by hand with:

gdisk /dev/nvme0
p
q

gdisk /dev/nvme1
c
<duplicate the parameters as required>

Now we’ll make your /efi at least a little redundant

mkfs.fat /dev/nvme1p0
ls -al /dev/disk/by-uuid/
# In the above, look for your new /efi fs, IE CE0A-2C1D -> ../../nvme1n1p1
# Now add a line to /etc/fstab like:
UUID=CE0A-2C1D    /boot/efi2              vfat   defaults                      0  0

Now to really make this work, because it’s transactional, you have to make a change to the /root, which is readonly! To do this run

transactional-update shell dup

This put’s you in a shell at the end. Run:

mkdir /boot/efi2

Now reboot. After the reboot your second efi should be mounted. rsync /boot/efi/* to /boot/efi2/. I leave it to the reader to decide how to sync this periodically.

Next you can setup the raid 1 mirror for /root and the system vg.

pvcreate /dev/nvme1p1
vgextend system /dev/nvme1p1

Now we have enough pvs to make a raid 1, so we convert all our volumes:

lvconvert --type raid1 --mirrors 1 system/home
lvconvert --type raid1 --mirrors 1 system/root
lvconvert --type raid1 --mirrors 1 system/boot

If this fails with “not enough space to alloc metadata”, it’s because you didn’t leave space on the vg during install. Honestly, I made this mistake twice due to confusion about things leading to two reinstalls …

Getting ready to cache

Now lets get ready to cache some data. We’ll make pvs and vgs for data:

pvcreate /dev/nvme0p2
pvcreate /dev/nvme1p2
pvcreate /dev/sda1
pvcreate /dev/sda2
vgcreate data /dev/nvme0p2 /dev/nvme1p2 /dev/sda1 /dev/sdb2

Create the larger volume

lvcreate --type raid1 --mirrors 1 -L7.5T -n libvirt_t2 data /dev/sda1 /dev/sdb1

Prepare the caches

lvcreate --type raid1 --mirrors 1 -L 4G -n libvirt_t2_meta data
lvcreate --type raid1 --mirrors 1 -L 400G -n libvirt_t2_cache data
lvconvert --type cache-pool --poolmetadata data/libvirt_t2_meta data/libvirt_t2_cache

Now put the caches in front of the disks. It’s important for you to check you have the correct cachemode at this point, because you can’t change it without removing and re-adding the cache. I choose writeback because my nvme devices are in a raid 1 mirror, and it helps to accelerate writes. You may err to use the default where the SSD’s are read cache only.

lvconvert --type cache --cachemode writeback --cachepool data/libvirt_t2_cache data/libvirt_t2
mkfs.xfs /dev/mapper/data-libvirt_t2

You can monitor the amount of “cached” data in the data column of lvs.

Now you can add this to /etc/fstab as any other xfs drive. I mounted it to /var/lib/libvirt/images.

Network Manager

Now, I have to assemble the network bridges. Network Manager has some specific steps to follow to achieve this. I have:

  • two intel gigabit ports
  • the ports are link aggregated by 802.3ad
  • there are multiple vlans ontop of the link agg
  • bridges must be built on top of the vlans

This requires a specific set of steps to layer this, because network manager sees the bridge and the lagg as seperate things that require the vlan to tie them together.

Configure the link agg, and add our two ethernet phys

nmcli conn add type bond con-name bond0 ifname bond0 mode 802.3ad ipv4.method disabled ipv6.method ignore
nmcli connection add type ethernet con-name bond0-eth1 ifname eth1 master bond0 slave-type bond
nmcli connection add type ethernet con-name bond0-eth2 ifname eth2 master bond0 slave-type bond

Add a bridge for a vlan:

nmcli connection add type bridge con-name net_18 ifname net_18 ipv4.method disabled ipv6.method ignore

Now tie together a vlan on the bond, to the bridge we created.

nmcli connection add type vlan con-name bond0.18 ifname bond0.18 dev bond0 id 18 master net_18 slave-type bridge

You will need to repeat these last two commands as required for the vlans you have.

House Keeping

Finally you need to do some house keeping. Transactional server will automatically reboot and update so you need to be ready for this. You may disable this with:

systemctl disable transactional-update.timer

You likely want to edit:

/etc/sysconfig/libvirt-guests

To be able to handle guest shutdown policy due to a UPS failure or a reboot.

Now you can enable and start libvirt:

systemctl enable libvirtd
systemctl start libvirtd

Finally you can build and import virtualmachines.