Linux Filesystems

Surprising absolutely no-one, I as an underachiving white man have opinions on things that are somewhat out of my normal subject matter areas. What is more surprising is that for some reason people want to hear them, so I have been bullied into writing a blog post to answer the general question: "What do I think about filesystems on linux".

What Currently Exists?

Today in August of 2024 the current discourse around linux filesystems generally revolves around the following culprits, presented in no specific order.

  • ext3 / ext4
  • xfs
  • btrfs
  • zfs

For comparisons sake, I want to also include.

  • apfs (MacOS)

What Do We Expect From A Filesystem?

In the past we didn't expect much - just a place to create folders and files, and trust that they will be there tomorrow when we come back to our computer.

Of course life is never so simple, hardware fails, power abruptly cuts out, bit flips occur and more. As a result, we now demand that our filesystems survive these events and keep our cat pictures intact through all of it.

Now dear reader, be careful when you read the preceeding paragraph. It's easy to start to ascribe technical solutions the the above problems (e.g. ecc for bit flip protection). But don't do that! We need to examine the expected user experience.

Today consumers expect:

  • Resilence to power loss
  • Files not to be corrupted by bugs (either in userland software, or the filesystem itself)
  • Low latency and high throughput to access data

In addition, power users or heavily technical folk have a extended list of expectations:

  • The ability to examine past versions of files.
  • Ability to boot previous OS configurations in event of a bad update.
  • Detection of and correction of corruption in files.
  • Compression of certain file types to save space.
  • Features to protect from catastrophic hardware failure.

How Do Our Current Filesystems Fit In?

FilesystemResilence to Power LossCorruption ProtectionPerformanceVersionsBoot VersionsError CorrectionCompressionHardware Failure
Ext3/Ext4MinimalPartialHighNoNoNoNoNo
XfsHighPartialHighNoNoNoNoNo
BtrfsPartialPartialLowYesYesPartialNo (1)Minimal
ApfsFullPartialHighYesYesPartialNoYes (2)
ZFSFullFullHighYesYes (3)YesYesYes
  1. Btrfs does support compression, but it's obscure to access and use, it may as well not exist.
  2. Apfs doesn't support RAID, but it's integration with time machine gives it excellent hardware failure resilience.
  3. ZFS does support boot versions, they just aren't integrated in Linux yet.

If you want to understand how I came to this assesment, go check out the Roasting Corner at the end of this blog.

The main thing here is that every Linux filesystem falls short of both APFS and ZFS. You end up needing to make a trade between features vs performance and reliability.

How Did We Get Here?

When we look at any software development - filesystems, databases or anything else - you can often see that the software becomes a reflection of the developers priorities and attitudes.

Ext4

For example, ext4 was seen as a "stop gap" until btrfs was ready. This wasn't even just some outsider commentary, this was a statement from the principal developer of ext3/4 themself.

As a result ext4 was never about developing something good, but about adding extra features that were required circa 2005 while btrfs was being developed.

A more damning reflection is that ext3/4 have been pushed for performance so much, that the default mode is data=ordered where only the metadata is written to the journal - this means that in the case of a partial write you will end up with corrupted data.

XFS

Where as when we look at XFS, we see a filesystem that prioritises robustness (which is reflected in the fact the xfs test suite has become the linux filesystem test suite.) and performance. XFS has it's origins in high performance computing and graphics, and so robustness and performance are critical to both of these.

But we also see that the developers of XFS believe that issues like checksumming and raid are best performed at the block level or the application level. This is why XFS has avoided these features for so long.

ZFS

ZFS was developed by SUN in 2001 and made available in Solaris in 2006. At the time Solaris was the Sega of Operating Systems - so far ahead of it's time it was doomed to be misunderstood.

Solaris developed virutalised networking, containers, transactional system updates and more. A key piece of this was ZFS. ZFS unlike other filesystems preceding it was developed with the goal of "enabling admins to access advanced storage features, with a reliable and intuitive interface".

This led to Solaris supporting features like parallel boot environments, snapshots, network incremental backups, high performance, compression and more. Each feature was designed with the user in mind, and achieved a goal in a way that protected the users data at all times.

This has made ZFS the gold standard - ZFS is the bar to meet or exceed if you are now developing a filesystem.

APFS

APFS was developed by apple to replace the aging HFS+. HFS+ had a major issue which was that it liked to shit the bed if you had too many files in a single folder.

Guess how timemachine worked at the time? That's right, hardlinks to files all stored in a single folder. You do the math.

As a result Apple needed something better. Apple had previously attempted to include ZFS but the deal failed. This led to a need to make something new.

Apple's developers opted to be solution driven rather than to take inspiration from existing filesystems. This led to a production ready and robust filesystem that priorisited reliability and integrated features. Within the space of 4 years from inception, APFS became the default on MacOS 10.13.

Btrfs

So lets tackle the elephant in the room.

image

Btrfs was created in 2007 and integrated to the Linux kernel in 2009. Btrfs was created to address the shortcomings of filesystems in Linux especially in light of the success of ZFS (and later APFS). Since APFS and ZFS took 4 and 5 years respectively to become mainline, reliable and mature implementations, surely btrfs has achieved this by 2013.

Nope.

Even now in 2024 btrfs is one of the slowest Linux filesystems, and it does not take long to find reports of ongoing data corruption issues.

But most egregious, Btrfs is a reflection of the intent to prioritise features above all else.

Remember how I said don't ascribe technical solutions before we describe the user experience? Well btrfs did the opposite. They said "here are the technical solutions we want" but never considered the user experience of those features.

As a result, btrfs has every feature under the sun, but all of them half baked in some way.

Lets look at subvolumes.

You'll notice that most administration commands refer to the path where the subvolume is mounted. This is a common source of confusion because btrfs administration doesn't have clear points of interaction. For example, if I have a btrfs volume, and it has:

/myvolume
/myvolume/root
/myvolume/other

And I mount /myvolume/root as my / filesystem, then I can no longer administer /myvolume or /myvolume/other unless I mount these somewhere. This in itself becomes weird because if I do mount /myvolume then I also can access /myvolume/root via it.

Counter intuitively, snapshots on btrfs aren't recursive. So this means you can't take a snapshot that keeps two subvolumes precisely in lockstep.

In btrfs raid 1, when a disk dies btrfs won't tell you which one died. Even btrfs' own documentation doesn't tell you how to replace a failed disk.

Btrfs data scrubbing uses easy to remember commands like systemd-run -p "IOReadBandwidthMax=/dev/sdx 10M" btrfs scrub start -B / which is needed because btrfs doesn't rate limit scrubs in the background like zfs does.

I could go on, but every feature of btrfs has these papercuts - needless complexities, flaws where there should be none, half baked parts, and risks to the user while attempting to do basic tasks.

But currently btrfs is the default in OpenSUSE/Fedora because it looks good on paper. Look at that list of features! So many features. How could it be bad?

Why Don't We All Just Move To ZFS If You Like It So Much

ZFS was developed by SUN for their Solaris Operating System. At the time, Solaris was afraid of being overtaken in the datacentre by Linux, and so in an attempt to garner interest from the opensource community they released Solaris under the CDDL license. The CDDL license is opensource, but it's written in such a way to prevent CDDL being used in the Linux Kernel. Since then Oracle bought SUN, and as a particularly hostile company, it's unlikely that Oracle will find it deep in the tiny black void that is their corporate conscience to attempt to fix the license issues with ZFS.

It's worth noting this doesn't affect Illumos or FreeBSD which are developed under different license terms.

What If We Just Fixed Btr...

You can't.

Software projects are a reflection of their developers intents. We know that btrfs is the reflection of features at all costs, so how would you pivot to "user experience" or "reliability"? The issue is you can't tack those things on later. They need to be there from the start (like in ZFS).

When you do attempt to tack on user experience, it ends up half baked in it's own way.

Lets go back to the subvolume example. How would you fix btrfs to handle subvolumes in a way that doesn't require a mountpoint? The whole existing ecosystem is already deeply embedded in the idea of subvolumes being a magic-folder, so trying to unpick that into a clearer administrative structure would be an impossible task.

To really "get this" watch the talk by George Wilson and Matt Ahrens about ZFS. It really shows the care and thought that has to exist from the start of development - not later.

Can We Add Snapshots To XFS That Woul....

Stratis exists as a direct implementation of this as envisioned by an XFS developer

What Filesystem Should I Use Today?

Do you want to guarantee your data is free of corruption?

ZFS for your data you care about (/home) and XFS on LVM for anything else.

Do you want to avoid mucking about with kernel modules?

XFS on LVM

But why not btr...

Do you value your data? Like seriously?

But I really like snapshots

Fine. Use btrfs on your root (/) volume. But at least make /home xfs.

And have backups. borg backup is good.

No btrfs snapshots aren't backups. Don't try that argument here.

The Roasting Corner 🔥

Wherein I roast every filesystem for being shit except ZFS.

ZFS

ZFS is the gold standard. It sets the bar for every other filesystem to meet, and then fall well short of.

ZFS is a high-performance copy-on-write voluming filesystem. Given a pool of disks, ZFS will create checksummed, redundant storage for you. Natively supporting fast inline transparent compression out of the box, snapshots that are fast and efficient, automatic healing of data if any corruption are found, ZFS not only stores your data, it actively protects it from corruption.

The design of ZFS makes it impervious to filesystem corruption, even with catastrophic disk loss or power failure.

In addition, the administration tools are second to none. Just look at this:

# zpool status tank

  pool: tank
 state: ONLINE
  scan: scrub repaired 0B in 18:47:51 with 0 errors on Thu Aug  8 14:47:53 2024
config:

	NAME                                     STATE     READ WRITE CKSUM
	tank                                     ONLINE       0     0     0
	  mirror-0                               ONLINE       0     0     0
	    scsi-SATA_ST18000NM000J-2T_ZR5EWW2S  ONLINE       0     0     0
	    scsi-SATA_ST18000NM000J-2T_ZR5ER3PJ  ONLINE       0     0     0
	  mirror-2                               ONLINE       0     0     0
	    scsi-SATA_ST18000NM000J-2T_ZR5EY4RR  ONLINE       0     0     0
	    scsi-SATA_ST18000NM000J-2T_ZR5ERJNC  ONLINE       0     0     0

errors: No known data errors

We can easily see that we have a pool of 4 disks, they are aranged in a raid 10 (raid 1 mirrors that are assembled to a raid 0). We can see each disks health and status. We can see the state of the pool, or if errors occured (in the READ, WRITE, CKSUM columns). We can see the results of the last filesystem check (scrub).

Nothing else comes close to this level of administration quality.

Even though it's a copy-on-write filesystem, ZFS has no issues with storage accounting:

# zfs list -r tank
NAME                                                       USED  AVAIL  REFER  MOUNTPOINT
tank                                                      20.4T  12.2T  2.08M  none
...
tank/pub                                                  5.65T  1.35T  5.65T  /mnt/pub
tank/home                                                 2.69T  12.2T   288K  /mnt/home
tank/home/william                                         1.62T   836G  1.62T  /mnt/home/william

We can see exactly where data is used (by the USED column), how much it logically uses on disk after compression (REFER) and how much is available once quotas are considered.

Even tools like measuring iostats are easy to access:

# zpool iostat -v tank
                                           capacity     operations     bandwidth
pool                                     alloc   free   read  write   read  write
---------------------------------------  -----  -----  -----  -----  -----  -----
tank                                     19.5T  13.2T      0     30   168K  4.31M
  mirror-0                               9.74T  6.62T      0     15  84.5K  2.15M
    scsi-SATA_ST18000NM000J-2T_ZR5EWW2S      -      -      0      7  41.9K  1.08M
    scsi-SATA_ST18000NM000J-2T_ZR5ER3PJ      -      -      0      7  42.5K  1.08M
  mirror-2                               9.76T  6.59T      0     15  83.8K  2.15M
    scsi-SATA_ST18000NM000J-2T_ZR5EY4RR      -      -      0      7  41.6K  1.08M
    scsi-SATA_ST18000NM000J-2T_ZR5ERJNC      -      -      0      7  42.2K  1.08M
---------------------------------------  -----  -----  -----  -----  -----  -----

Or how the filesystem cache is performing (which for the record, ZFS ignores the kernels default filesystem cache in favour of it's vastly improved adaptive replacement cache (ARC).

# arc_summary

------------------------------------------------------------------------
ZFS Subsystem Report                            Tue Aug 13 10:53:40 2024
Linux 5.14.21-150500.55.68-default                               2.2.5-1
Machine: morgan (x86_64)                                         2.2.5-1

ARC status:                                                      HEALTHY
        Memory throttle count:                                         0

ARC size (current):                                    19.6 %   15.7 GiB
        Target size (adaptive):                        20.0 %   16.0 GiB
        Min size (hard limit):                         20.0 %   16.0 GiB
        Max size (high water):                            5:1   80.0 GiB
...
        MFU data target:                               48.4 %    7.4 GiB
        MFU data size:                                 56.1 %    8.5 GiB
        MFU ghost data size:                                     3.5 GiB
        MFU metadata target:                           11.1 %    1.7 GiB
        MFU metadata size:                              4.8 %  748.8 MiB
        MFU ghost metadata size:                                 0 Bytes
        MRU data target:                               29.4 %    4.5 GiB
        MRU data size:                                 21.2 %    3.2 GiB
        MRU ghost data size:                                     5.2 GiB
        MRU metadata target:                           11.1 %    1.7 GiB
        MRU metadata size:                              8.6 %    1.3 GiB
        MRU ghost metadata size:                                 0 Bytes

...

ARC total accesses:                                                20.7G
        Total hits:                                   100.0 %      20.7G
        Total I/O hits:                               < 0.1 %     511.7k
        Total misses:                                 < 0.1 %       2.8M

I challenge you to find statistics on the linux filesystem cache hit percentages that are as accesible and easy to use as this.

ZFS really sets a high bar for quality, performance and administration. I could go on further but you're probably already bored.

Ext3 / Ext4

Ext3 and Ext4 are journaled filesystems. The primary difference between ext3 and ext4 is the addition of extent allocation in ext4 (allowing larger ranges to be allocated contiguously for files).

Outside of that ext is very simple. It's journaling supports mounting with journal_checksum but it appears not to be enabled by default. Here is output from a filesystem I just created.

# sudo dumpe2fs /dev/disk/by-id/something | less
Filesystem features:      has_journal ext_attr resize_inode dir_index filetype needs_recovery \
    extent 64bit flex_bg sparse_super large_file huge_file dir_nlink extra_isize metadata_csum
Filesystem flags:         signed_directory_hash
Default mount options:    user_xattr acl

This means that corruption in the journal will potentially damage your FS if a powerloss occurs. To add further insult to this by default most distros don't even use the journal for data - only metadata. Don't believe me? Check the ext4 man page and read the section on data= defaults. This means by default you have no crash protection on your data, only the metadata.

If we examine corruption as a source of a hardware error, we also see that hardware errors are silently ignored.

This means outside of "performance", the files in ext3/4 are generally at risk of silent damage or loss in the event of a crash of the kernel or sudden powerloss.

XFS

XFS is another journaled filesystem which originally came from Silicon Graphics.

XFS supports full metadata and journal checksumming by default, as well as internally implementing transactional operations within the journal.

However, similar to ext3/4, hardware errors are silently ignored.

Unlike ext3/4, XFS journals your data as well.

This leaves XFS as a very fast simple filesystem that is more reliable than ext3/4, but still prone to damage.

As a consolation prize, XFS has a filesystem test suite called xfs_tests. This is now the defacto linux filesystem test suite. This speaks volumes to xfs caring about quality vs everyone else.

Btrfs

Btrfs is cited by one source to be perpetually half finished and I can not fault that description.

As a filesystem Btrfs makes a large number of promises and half-delivers on all of them.

Unlike Ext3/4 and Xfs, Btrfs is the first introduction of a copy-on-write filesystem in Linux. This means that rather than writing to a journal before a write-in-place of data, btrfs copies extents/blocks of data and writes to them separate to the existing data. This in theory, provides perfect corruption and power loss resilience because all writes go into new blocks, and only once they are all complete is the filesystem tree atomicly flipped from the old-version to the new version.

Additionally, this means that the checksums of all the blocks can be created and chained upward in the filesystem tree, providing a perfect trail for validation that each bit stored on the filesystem is exactly as it should be.

However, it doesn't work out quite so nicely in Btrfs - Btrfs has a long history of filesystem corruption, even though it should be impossible in the design presented. This necesitated btrfs to develop a filesystem checker since it can tie itself in knots so easily. Contrast to ZFS that has no such issues.

To add insult to injury, btrfs is the slowest filesystem on linux by a wide margin.

This is to say nothing of the confusing administration experience, and difficulty for btrfs to show basic data like "free space" or "which disk has died in a raid 1". This is why I chose to nominate btrfs as having minimal hardware fault protection, since being able to identify which disk died in a raid is table stakes, and btrfs somehow still fails at this basic operation.

And this is to say nothing of the process to fix a damaged raid 1 - it's painful and risky at best.

Apfs

Apfs is Apple's Filesystem on MacOS, iPhoneOS and iPadOS. It's not available on linux but we include it here because it acts as a great comparison point.

Apfs is a copy-on-write and voluming filesystem. It supports cryptographically signed read-only volumes which Apple uses for it's Boot partitions to allow crytpographic verification of the whole OS from the moment the device turns on (linux has nothing like this!) This also means that if an update goes bad, it can trivially be rolled back to.

The tool to manager Apfs (if you go digging) are really straight forward and easy to use and understand. There are tools to automatically thin-snapshots if too many exist and you need local space reclaimed, or if you want to add new volumes etc.

The biggest weakness of Apfs is that the developers opted to skip data checksumming stating that "nvme drives automatically provide ECC meaning that bad blocks won't be returned from the drive". This is as we call it in the industry a crock of shit. Nvme drives barely function day to day without ECC meaning that anything beyond it's already minimal margins will result in dataloss.

Where Apfs excels though is integration with timemachine, apples in built backup system. By default apfs will snapshot your system (even when you are out and about). When timemachine syncs to the backup destination, it simply transmits the needed apfs snapshots.

This gives you the ability to locally browse snapshots to review versions of files, recover things you deleted (without needing a recycle bin), and only if you need deep historical access you can browse older versions on the backup destination. To add to this, timemachine being "set and forget" makes it the absolute gold standard in backups today.

This is the major reason apfs scores points into hardware failure protection - the fact that backups are built-in and part of the filesystem from day 1 shines through at each point, giving even conusmers data integrity and protection guarantees that no other OS comes near.