Upgrading OpenSUSE 15.0 to 15.1

It’s a little bit un-obvious how to do this. You have to edit the repo files to change the release version, then refresh + update.

sed -ri 's/15\.0/15.1/' /etc/zypp/repos.d/*.repo
zypper ref
zypper dup
reboot

Note this works on a transactional host too:

sed -ri 's/15\.0/15.1/' /etc/zypp/repos.d/*.repo
transactional-update dup
reboot

It would be nice if these was an upgrade tool that would attempt the upgrade and revert the repo files, or use temporary repo files for the upgrade though. It would be a bit nicer as a user experience than sed of the repo files.

Announcing Kanidm - A new IDM project

Today I’m starting to talk about my new project - Kanidm. Kanidm is an IDM project designed to be correct, simple and scalable. As an IDM project we should be able to store the identities and groups of people, authenticate them securely to various other infrastructure components and services, and much more.

You can find the source for kanidm on github.

For more details about what the project is planning to achieve, and what we have already implemented please see the github.

What about 389 Directory Server

I’m still part of the project, and working hard on making it the best LDAP server possible. Kanidm and 389-ds have different goals. 389 Directory Server is a globally scalable, distributed database, that can store huge amounts of data and process thousands of operations per second. 389-ds let’s you build a system ontop, in any way you want. If you want an authentication system today, use 389-ds. We are even working on a self-service web portal soon too (one of our most requested features!). Besides my self, no one on the (amazing) 389 DS team has any association with kanidm (yet?).

Kanidm is an opinionated IDM system, and has strong ideas about how authentication and users should be processed. We aim to be scalable, but that’s a long road ahead. We also want to have more web integrations, client tools and more. We’ll eventually write a kanidm to 389-ds sync tool.

Why not integrate something with 389? Why something new?

There are a lot of limitations with LDAP when it comes to modern web-focused auth processes such as webauthn. Because of this, I wanted to make something that didn’t have the same limitations, and had different ideas about data storage and apis. That’s why I wanted to make something new in parallel. It was a really hard decision to want to make something outside of 389 Directory Server (Because I really do love the project, and have great pride in the team), but I felt like it was going to be more productive to build in parallel, than ontop.

When will it be ready?

I think that a single-server deployment will be usable for small installations early 2020, and a fully fledged system with replication would be late 2020. It depends on how much time I have and what parts I implement in what order. Current rough work order (late 2019) is indexing, RADIUS integration, claims, and then self-service/web ui.

OpenSUSE leap as a virtualisation host

I’ve been rebuilding my network to use SUSE from CentOS, and the final server was my hypervisor. Most of the reason for this is the change in my employment, so I feel it’s right to dogfood for my workplace.

What you will need

  • Some computer parts (assembaly may be required)
  • OpenSUSE LEAP 15.1 media (dd if=opensuse.iso of=/dev/a_usb_i_hope)

What are we aiming for?

My new machine has dual NVME and dual 8TB spinning disks. The intent is to have the OS on the NVME and to have a large part of the NVME act as LVM cache for the spinning disk. The host won’t run any applications beside libvirt, and has to connect a number of vlans over a link aggregation.

Useful commands

Through out this document I’ll assume some details about your devices and partitions. To find your own, and to always check and confirm what you are doing, some command will help:

lsblk  # Shows all block storage devices, and how (if) they are mounted.
lvs  # shows all active logical volumes
vgs  # shows all active volume groups
pvs  # shows all active physical volumes
dmidecode  # show hardware information
ls -al /dev/disk/by-<ID TYPE>/  # how to resolve disk path to a uuid etc.

I’m going to assume you have devices like:

/dev/nvme0  # the first nvme of your system that you install to.
/dev/nvme1  # the second nvme, used later
/dev/sda    # Two larger block storage devices.
/dev/sdb

Install

Install and follow the prompts. Importantly when you install you install to a single NVME, and choose transactional server + lvm, then put btrfs on the /root in the lvm. You want to partition such that there is free space still in the NVME - I left about 400GB unpartitioned for this. In other words, the disk should be like:

[ /efi | pv + vg system               | pv (unused) ]
       | /root (btrfs), /boot, /home  |

Remember to leave about 1GB of freespace on the vg system to allow raid 1 metadata later!

Honestly, it may take you a try or two to get this right with YaST, and it was probably the trickiest part of the install.

You should also select that network management is via networkmanager, not wicked. You may want to enable ssh here. I disabled the firewall personally because there are no applications and it interfers with the bridging for the vms.

Packages

Because this is suse transactional we need to add packages and reboot each time. Here is what I used, but you may find you don’t need everything here:

transactional-update pkg install libvirt libvirt-daemon libvirt-daemon-qemu \
  sssd sssd-ad sssd-ldap sssd-tools docker zsh ipcalc python3-docker rdiff-backup \
  vim rsync iotop tmux fwupdate fwupdate-efi bridge-utils qemu-kvm apcupsd

Reboot, and you are ready to partition.

Partitioning - post install

First, copy your gpt from the first NVME to the second. You can do this by hand with:

gdisk /dev/nvme0
p
q

gdisk /dev/nvme1
c
<duplicate the parameters as required>

Now we’ll make your /efi at least a little redundant

mkfs.fat /dev/nvme1p0
ls -al /dev/disk/by-uuid/
# In the above, look for your new /efi fs, IE CE0A-2C1D -> ../../nvme1n1p1
# Now add a line to /etc/fstab like:
UUID=CE0A-2C1D    /boot/efi2              vfat   defaults                      0  0

Now to really make this work, because it’s transactional, you have to make a change to the /root, which is readonly! To do this run

transactional-update shell dup

This put’s you in a shell at the end. Run:

mkdir /boot/efi2

Now reboot. After the reboot your second efi should be mounted. rsync /boot/efi/* to /boot/efi2/. I leave it to the reader to decide how to sync this periodically.

Next you can setup the raid 1 mirror for /root and the system vg.

pvcreate /dev/nvme1p1
vgextend system /dev/nvme1p1

Now we have enough pvs to make a raid 1, so we convert all our volumes:

lvconvert --type raid1 --mirrors 1 system/home
lvconvert --type raid1 --mirrors 1 system/root
lvconvert --type raid1 --mirrors 1 system/boot

If this fails with “not enough space to alloc metadata”, it’s because you didn’t leave space on the vg during install. Honestly, I made this mistake twice due to confusion about things leading to two reinstalls …

Getting ready to cache

Now lets get ready to cache some data. We’ll make pvs and vgs for data:

pvcreate /dev/nvme0p2
pvcreate /dev/nvme1p2
pvcreate /dev/sda1
pvcreate /dev/sda2
vgcreate data /dev/nvme0p2 /dev/nvme1p2 /dev/sda1 /dev/sdb2

Create the larger volume

lvcreate --type raid1 --mirrors 1 -L7.5T -n libvirt_t2 data /dev/sda1 /dev/sdb1

Prepare the caches

lvcreate --type raid1 --mirrors 1 -L 4G -n libvirt_t2_meta data
lvcreate --type raid1 --mirrors 1 -L 400G -n libvirt_t2_cache data
lvconvert --type cache-pool --poolmetadata data/libvirt_t2_meta data/libvirt_t2_cache

Now put the caches in front of the disks. It’s important for you to check you have the correct cachemode at this point, because you can’t change it without removing and re-adding the cache. I choose writeback because my nvme devices are in a raid 1 mirror, and it helps to accelerate writes. You may err to use the default where the SSD’s are read cache only.

lvconvert --type cache --cachemode writeback --cachepool data/libvirt_t2_cache data/libvirt_t2
mkfs.xfs /dev/mapper/data-libvirt_t2

You can monitor the amount of “cached” data in the data column of lvs.

Now you can add this to /etc/fstab as any other xfs drive. I mounted it to /var/lib/libvirt/images.

Network Manager

Now, I have to assemble the network bridges. Network Manager has some specific steps to follow to achieve this. I have:

  • two intel gigabit ports
  • the ports are link aggregated by 802.3ad
  • there are multiple vlans ontop of the link agg
  • bridges must be built on top of the vlans

This requires a specific set of steps to layer this, because network manager sees the bridge and the lagg as seperate things that require the vlan to tie them together.

Configure the link agg, and add our two ethernet phys

nmcli conn add type bond con-name bond0 ifname bond0 mode 802.3ad ipv4.method disabled ipv6.method ignore
nmcli connection add type ethernet con-name bond0-eth1 ifname eth1 master bond0 slave-type bond
nmcli connection add type ethernet con-name bond0-eth2 ifname eth2 master bond0 slave-type bond

Add a bridge for a vlan:

nmcli connection add type bridge con-name net_18 ifname net_18 ipv4.method disabled ipv6.method ignore

Now tie together a vlan on the bond, to the bridge we created.

nmcli connection add type vlan con-name bond0.18 ifname bond0.18 dev bond0 id 18 master net_18 slave-type bridge

You will need to repeat these last two commands as required for the vlans you have.

House Keeping

Finally you need to do some house keeping. Transactional server will automatically reboot and update so you need to be ready for this. You may disable this with:

systemctl disable transactional-update.timer

You likely want to edit:

/etc/sysconfig/libvirt-guests

To be able to handle guest shutdown policy due to a UPS failure or a reboot.

Now you can enable and start libvirt:

systemctl enable libvirtd
systemctl start libvirtd

Finally you can build and import virtualmachines.

LDAP Filter Syntax Validation

Today I want to do a deep-dive into a change that will be released in 389 Directory Server 1.4.2. It’s a reasonably complicated change for our server, but it has a simple user interaction for admins and developers. I want to peel back some of the layers to explain what kind of experience, thought and team work goes into a change like this.

TL;DR - just keep upgrading your 389 Directory Server instance, and our ‘correct by default’ policy will apply, and you’ll keep having the best LDAP server we can make :)

LDAP Filters and How They Work

LDAP filters are one of the primary methods of expression in LDAP, and are used in almost every aspect of the system - from finding who you are when you login, to asserting you are member of a group or have other security attributes.

For the purposes of this discussion we’ll look at this filter:

'(|(cn=william)(cn=claire))'

In order to execute these queries quickly (LDAP is designed to handle thousands of operations per second) we heavily rely on indexing. Indexing is often a topic where people believe it to be some kind of “magic” but it’s reasonably simple: indexes are pre-computed partial result sets. So why do we need these?

We’ll imagine we have two entries (invalid, and truncated for brevity).

dn: cn=william,...
cn: william

dn: cn=claire,...
cn: claire

These entries both have entry-ids - these id’s are per-server in a replication group and are integers. You can show them by requesting entryid as an attribute in 389.

dn: cn=william,...
entryid: 1
cn: william

dn: cn=claire,...
entryid: 2
cn: claire

Our entries are stored in the main-entry database in /var/lib/dirsrv/slapd-standalone1/db/userRoot in the file “id2entry.db4”. This is a key-value database where the keys are the entryid, and the value is the serialised entry itself. Roughly, it’s:

[ ID ][ Entry             ]
  1     dn: cn=william,...
        cn: william

  2     dn: cn=claire,...
        cn: claire

Now, if we had NO indexes, to evaluate our filters we have to scan every entry of id2entry to determine if the filter matches. This algorithm is:

candidate_set = []
for id in id-min to id-max:
    entry = load_entry_by_id(id)
    if apply_filter(filter, entry):
        candidate_set.append(entry)

For two entries, this may be fast, but when you have 1000, 10.000, or even millions, this is extremely slow. We call these searches full table scans or in 389 DS, ALLIDS searches.

To make our searches faster we have indexes. An index is a mapping of a partial query term to an id list (In 389 we call these IDLs). An IDL is a set of integers. Our index for these examples would likely be something like:

cn
=william: [1, ]
=claire: [2, ]

These indexes are also stored in key-value databases in userRoot - you can see this as cn.db4.

So when we have an indexed term, to evaluate the query, we’ll load the indexes, then using mathematical set operations, we then produce a candidate_id_set, and we can then load the entries that only match.

For example in psuedo python code:

# Assume query is: (cn=william)

attr = filter.get_attr_name()
with open('%s.db' % attr) as index:
    idl = index.get('=william') # from the filter :)

for id in idl:
    ... # as before.

So we can see now that when we load the idl for cn index, this would give us the set [1, ]. Even if the database had 100 million entries, as our idl is a single value, we only need to load the one entry that matches. Neat!

When we have a more complex operation such as AND and OR, we can now manipulate the idl sets. For example:

(|(uid=claire)(uid=william))
   uid =claire -> idl [2, ]
   uid =william -> idl [1, ]

candidate_idl_set = union([2, ], [1, ])
# [1, 2]

This means again, even with millions of entries, we only need to load entry 1 and 2 to uphold the query provided to us.

So we finally know enough to understand how our example query is executed. PHEW!

Unindexed Attributes

However, it’s not always so easy. When we have an attribute that isn’t indexed, we have to handle this situation. In these cases, while we operate on the idl set, we may insert an idl with the value of ALLIDS (which as previously mentioned, is the “set of all entries”). This can have various effects.

If this was an AND query, we can annotate that the filter is partially resolved. This means that if we had:

(&(cn=william)(unindexed=foo))

Because an AND condition, both filter components must be satisfied, we have a partial candidate set from cn=william of [1, ]. We can load this partial candidate set, and then apply the filter test as in the full table scan case, but as we only apply it to a single entry this is really fast.

The real problem is OR queries. If we had:

(|(cn=william)(unindexed=foo))

Because OR means that both filter components could be satisfied, we have to turn unindexd into ALLIDS, and the result of the OR as a whole is ALLIDS. So even if we have 30 indexed values in the OR, a single ALLIDS (unindexed value) will always turn that OR into a full table scan. This is not good for performance!

Missing Attributes

So as a weirder case … what if the attribute doesn’t exist in schema at all? For example we could search for Microsoft AD attributes in 389 Directory Server, or we could submit bogus filters like “(whargarble=foo)”. What happens here?

Well, historically we treated these the same as unindexed queries. Which means that any term that is not in schema, would be treated as ALLIDS. This led to a “quitely known” denial of service attack again 389 Directory Server where you could emit a large number of queries for attributes that don’t exist, causing the server to attempt many ALLIDS scans. We have some defences like the allids limit (how many entries you can full table scan before giving up). But it can still cause entry cache churn and other performance issues.

I was first made aware of this issue in 2014 while working for University of Adelaide where our VMWare service would query LDAP for MS attributes, causing a large performance issue. We resolved this by adding the MS attributes to schema and indexing them so that they would create empty indexes - now we would call this in 389 Directory Server and “idl_alloc(0)” or “empty IDL”.

When initially hired by Red Hat in 2015 I knew this issue existed but I didn’t know enough about the server core to really fix it, so it went in the back of my mind … it was rare to have a customer raise this issue, but we had the work around and so I was able to advise support services on how to mitigate this.

In 2019 however, while investigating an issue related to filter optimisation, I was made aware of an issue with FreeIPA where they were doing certmap queries that requested MS Cert attributes. However it would cause large performance issues. We now had the issue again, and in a large widely installed product so it was time to tackle it.

How to handle this?

A major issue in this is “never breaking customers”. Because we had always supported this behaviour there is a risk that any solution would cause customer queries to “silently” begin to break if we issued a fix or change. More accurately, any change to how we execute the filters could cause results of the filters to change, which would disrupt customers.

Saying this, there is also precedent that 389 Directory Server was handling this incorrectly. From the RFC for LDAP it was noted:

Any assertion about the values of such an attribute is only defined if the AttributeType is known by the evaluating mechanism, the purported AttributeValue(s) conforms to the attribute syntax defined for that attribute type, the implied or indicated matching rule is applicable to that attribute type, and (when used) a presented matchValue conforms to the syntax defined for the indicated matching rules. When these conditions are not met, the FilterItem shall evaluate to the logical value UNDEFINED. An assertion which is defined by these conditions additionally evaluates to UNDEFINED if it relates to an attribute value and the attribute type is not present in an attribute against which the assertion is being tested. An assertion which is defined by these conditions and relates to the presence of an attribute type evaluates to FALSE.

Translation: If a filter component (IE nonexist=foo) is in a filter but NOT in the schema, the result of the filter’s evaluation is an empty-set aka undefined.

It was also clear that if an engaged and active consumer like FreeIPA is making this mistake, then it must be overlooked by many others without notice. So there is sometimes value in helping to raise the standard so that everyone benefits, and highlight mistakes quicker.

The Technical Solution

This is the easy part - we add a new configuration option with three states. “on”, “off”, “warn”. “On” would enable the strictest handling of filters, rejecting them an not continuing if any attribute requested was not in the schema. “Warn” would provide the rfc compliant behaviour, mapping to empty-set index, and notifying in the logs that this occured. Finally, “off” would be the previous “silently allow” behaviour.

This was easily achieved in filter parsing, by checking the attribute of each filter component against our schema hashmap. We then tag the filter element, and depending on the current setting level reject or continue.

In the query execution code, we now check the filter tag to understand if the attribute is schema present or not. If it’s flagged as “undefined”, then we immediately shortcut to return idl_alloc(0) instead of returning ALLIDS on the failure to find the relevant index db.

We can show the performance impact of this change:

Before with non-existant attribute

Average rate:    7.40/thr

After with “warn” enabled (map to empty set)

Average rate: 4808.70/thr

This is a huge improvement, and certainly shows the risk of DOS and how effective the solution was!

The Social Solution

Of course, this is the hard part - the 389 Directory Server team are all amazingly smart people, from many countries, and all live around the world. They all care that the server is the best possible, and that our standards as a team are high. That means when introducing a change that has a risk of affecting query result sets like this, they pay attention, and ask many difficult questions about how the feature will be implemented.

The first important justification - is a change like this worth while? We can see from the performance results that the risk of DOS is reasonable, so the answer there becomes Yes from a security view. But it’s also important to consider the cost on consumers - is this change going to benefit FreeIPA for example? As I am biased being the author I want to say “yes” - by notifying or rejecting invalid filters earlier, we can help FreeIPA developers improve their code quality, without expecting them to know LDAP inside and out.

The next major question is performance - before the feature was developed there is clearly a risk of DOS, but when we implement this we are performing additional locking on the schema. Is that a risk to our standalone performance or normal operating conditions. This had to also be discussed and assessed.

A really important point that was raised by Thierry is how we communicated these errors too. Previously we would use the “notes=” field of the access log. It looks like this:

conn=1 op=4 RESULT err=0 tag=101 nentries=13 etime=0.0003795424 notes=U

The challenge with the notes= field, is that it’s easy to overlook, and unless you are familar, hard to see what this is indicating. In this case, notes=U means partially unindexed query (one filter component but not all returned ALLIDS).

We can’t change the notes field due to the risk of breaking our own scripts like logconv.pl, support tools developed by RH or SUSE, or even integrations to platforms like splunk. But clearly we need a way to detail what is happening with your filter. So Thierry suggested an extension to have details about the provided notes. Now we get:

conn=1 op=4 RESULT err=0 tag=101 nentries=13 etime=0.0003795424 notes=U details="Partially Unindexed Filter"
conn=1 op=8 RESULT err=0 tag=101 nentries=0 etime=0.0001886208 notes=F details="Filter Element Missing From Schema"

So we have extended our log message, but without breaking existing integrations.

The final question is what our defaults should be. It’s one thing to have this feature, but what should we ship with? Do we strictly reject filters? Warn? Or disable, and expect people to turn this on.

This became a long discussion with Ludwig, Thierry and I - we discussed the risk of DOS in the first place, what the impact of the levels could be, how it could break legacy applications or sites using deprecated features or with weird data imports. Many different aspects were considered. We decided to default to “warn” (non-existant becomes empty-set), and we settled on communication with support to advise them of the upcoming change, but also we considered that our “back out” plan is to change the default and ship a patch if there is a large volume of negative feedback.

Conclusion

As of today, the PR is merged, and the code on it’s way to the next release. It’s a long process but the process exists to ensure we do what’s best for our users, while we try to balance many different aspects. We have a great team of people, with decades of experience from many backgrounds which means that these discussions can be long and detailed, but in the end, we hope to give what is the best product possible to our community.

It’s also valuable to share how much thought and effort goes into projects - in your life you may only interact with 1% of our work through our configuration and system, but we have an iceberg of decisions and design process that affects you every day, where we have to be responsible and considerate in our actions.

I hope you enjoyed this exploration of this change!

References

PR#50379

Using ramdisks with Cargo

I have a bit of a history of killing SSDs - probably because I do a bit too much compiling and management of thousands of tiny files. Plenty of developers have this problem! So while thinking one evening, I was curious if I could setup a ramdisk on my mac for my cargo work to output to.

Making the ramdisk

On Linux you’ll need to use tmpfs or some access to /dev/shm.

On OSX you need to run a script like the following:

diskutil partitionDisk $(hdiutil attach -nomount ram://4096000) 1 GPTFormat APFS 'ramdisk' '100%'

This creates and mounts a 4GB ramdisk to /Volumes/ramdisk. Make sure you have enough ram!

Asking cargo to use it

We probably don’t want to make our changes permant in Cargo.toml, so we’ll use the environment:

CARGO_TARGET_DIR=/Volumes/ramdisk/rs cargo ...

Does it work?

Yes!

Disk Build (SSD, 2018MBP)

Finished dev [unoptimized + debuginfo] target(s) in 2m 29s

4 GB APFS ramdisk

Finished dev [unoptimized + debuginfo] target(s) in 1m 53s

For me it’s more valuable to try and save those precious SSD write cycles, so I think I’ll try to stick with this setup. You can see how much rust writes by doing a clean + build. My project used the following:

Filesystem                             Size   Used  Avail Capacity    iused               ifree %iused  Mounted on
/dev/disk110s1                        2.0Gi  1.2Gi  751Mi    63%       3910 9223372036854771897    0%   /Volumes/ramdisk

Make it permanent

Put the following in /Library/LaunchDaemons/au.net.blackhats.fy.ramdisk.plist

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
        <dict>
                <key>Label</key>
                <string>au.net.blackhats.fy.ramdisk</string>
                <key>Program</key>
                <string>/usr/local/libexec/ramdisk.sh</string>
                <key>RunAtLoad</key>
                <true/>
                <key>StandardOutPath</key>
                <string>/var/log/ramdisk.log</string>
        </dict>
</plist>

And the following into /usr/local/libexec/ramdisk.sh

#!/bin/bash
date
diskutil partitionDisk $(hdiutil attach -nomount ram://4096000) 1 GPTFormat APFS 'ramdisk' '100%'

Finally put this in your cargo file of choice

[build]
target-dir = "/Volumes/ramdisk/rs"

Future william will need to work out if there are negative consequences to multiple cargo projects sharing the same target directory … hope not!

Launchctl tips

# Init the service
launchctl load /Library/LaunchDaemons/au.net.blackhats.fy.ramdisk.plist

References

lanuchd.info