Fixing a MacBook Pro 8,2 with dead AMD GPU

I’ve owned a MacBook Pro 8,2 late 2011 edition, which I used from 2011 to about 2018. It was a great piece of hardware, and honestly I’m surprised it lasted so long given how many MacOS and Fedora installs it’s seen.

I upgraded to a MacBook Pro 15,1, and I gave the 8,2 to a friend who was in need of a new computer so she could do her work. It worked really well for her until today when she’s messaged me that the machine is having a problem.

The Problem

The machine appeared to be in a bootloop, where just before swapping from the EFI GPU to the main display server, it would go black and then lock up/reboot. Booting to single user mode (boot holding cmd + s) showed the machine’s disk was intact with a clean apfs. The system.log showed corruption at the time of the fault, which didn’t instill confidence in me.

Attempting a recovery boot (boot holding cmd + r), this also yielded the bootloop. So we have potentially eliminated the installed copy of MacOS as the source of the issue.

I’ve then used the apple hardware test (boot while holding d), and it has passed the machine as a clear bill of health.

I have seen one of these machines give up in the past - my friends mother had one from the same generation and that died in almost the same way - could it be the same?

The 8,2’s cursed gpu stack

The 8,2 15” mbp has dual gpu’s - it has the on cpu Intel 3000, and an AMD radeon 6750M. The two pass through an LVDS graphics multiplexer to the main panel. The external display port however is not so clear - the DDC lines are passed through the GMUX, but the datalines directly attach to the the display port.

The machine is also able to boot with EFI rendering to either card. By default this is the AMD radeon. Which ever card is used at boot is also the first card MacOS attempts to use, but it will try to swap to the radeon later on.

This generation had a large number of the radeons develop faults in their 3d rendering capability so it would render the EFI buffer correctly, but on the initiation of 3d rendering it would fail. Sounds like what we have here!

To fix this …

Okay, so this is fixable. First, we need to tell EFI to boot primarily from the intel card. Boot to single user mode and then run.

nvram fa4ce28d-b62f-4c99-9cc3-6815686e30f9:gpu-power-prefs=%01%00%00%00

Now we need to prevent loading of the AMD drivers so that during the boot MacOS doesn’t attempt to swap from Intel to the Radeon. We can do this by hiding the drivers. System integrity protection will stop you, so you need to do this as part of recovery. Boot with cmd + r, which now works thanks to the EFI changes, then open terminal

cd /Volumes/Macintosh HD
sudo mkdir amdkext
sudo mv System/Library/Extensions/AMDRadeonX3000.kext amdkext/

Then reboot. You’ll notice the fans go crazy because the Radeon card can’t be disabled without the driver. We can post-boot load the driver to stop the fans to fix this up.

To achieve this we make a helper script:

# cat /usr/local/libexec/amd_kext_load.sh
#!/bin/sh
/sbin/kextload /amdkext/AMDRadeonX3000.kext

And a launchctl daemon

# cat /Library/LaunchDaemons/au.net.blackhats.fy.amdkext.plist
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
        <dict>
                <key>Label</key>
                <string>au.net.blackhats.fy.amdkext</string>
                <key>Program</key>
                <string>/usr/local/libexec/amd_kext_load.sh</string>
                <key>RunAtLoad</key>
                <true/>
                <key>StandardOutPath</key>
                <string>/var/log/amd_kext_load.log</string>
        </dict>
</plist>

Now if you reboot, you’ll have a working mac, and the fans will stop properly. I’ve tested this with suspend and resume too and it works! The old beast continues to live :)

There are no root causes

At Gold Coast LCA2020 I gave a lightning talk on swiss cheese. Well, maybe not really swiss cheese. But it was about the swiss cheese failure model which was proposed at the university of manchester.

Please note this will cover some of the same topics as the talk, but in more detail, and with less jokes.

An example problem

So we’ll discuss the current issues behind modern CPU isolation attacks IE spectre. Spectre is an attack that uses timing of a CPU’s speculative execution unit to retrieve information from another running process on the same physical system.

Modern computers rely on hardware features in their CPU to isolate programs from each other. This could be isolating your web-browser from your slack client, or your sibling’s login from yours.

This isolation however has been compromised by attacks like Spectre, and it looks unlikely that it can be resolved.

What is speculative execution?

In order to be “fast” modern CPU’s are far more complex than most of us have been taught. Often we believe that a CPU thread/core is executing “one instruction/operation” at a time. However this isn’t how most CPU’s work. Most work by having a pipeline of instructions that are in various stages of execution. You could imagine it like this:

let mut x = 0
let mut y = 0
x = 15 * some_input;
y = 10 * other_input;
if x > y {
    return true;
} else {
    return false;
}

This is some made up code, but in a CPU, every part of this could be in the “pipeline” at once.

let mut x = 0                   <<-- at the head of the queue and "further" along completion
let mut y = 0                   <<-- it's executed part way, but not to completion
x = 15 * some_input;
y = 10 * other_input;           <<-- all of these are in pipeline, and partially complete
if x > y {                      <<-- what happens here?
    return true;
} else {
    return false;
}

So how does this “pipeline” handle the if statement? If the pipeline is looking ahead, how can we handle a choice like an if? Can we really predict the future?

Speculative execution

At the if statement, the CPU uses past measurements to make a prediction about which branch might be taken, and it then begins to execute that path, even though ‘x > y’ has not been executed or completed yet! At this point x or y may not have even finished being computed yet!

Let’s assume for now our branch predictor thinks that ‘x > y’ is false, so we’ll start to execute the “return false” or any other content in that branch.

Now the instructions ahead catch up, and we resolve “did we really predict correctly?”. If we did, great! We have been able to advance the program state asynchronously even without knowing the answer until we get there.

If not, ohh nooo. We have to unwind what we were doing, clear some of the pipeline and try to do the correct branch.

Of course this has an impact on timing of the program. Some people found you could write a program to manipulate this predictor and using specific addresses and content, they could use these timing variations to “access memory” they are not allowed to by letting the specualative executor contribute to code they are not allowed to access before the unroll occurs. They could time this, and retrieve the memory contents from areas they are not allowed to access, breaking isolation.

Owwww my brain

Yes. Mine too.

Community Reactions

Since this has been found, a large amount of the community reaction has been about the “root cause”. ‘Clearly’ the root cause is “Intel are bad at making CPU’s” and so everyone should buy AMD instead because they “weren’t affected quite as badly”. We’ve had some intel CPU updates and kernel/program fixes so all good right? We addressed the root cause.

Or … did we?

Our computers are still asynchronous, and contain many out-of-order parts. It’s hard to believe we have “found” every method of exploiting this. Indeed in the last year many more ways to bypass hardware isolation due to our systems async nature have been found.

Maybe the “root cause” wasn’t addressed. Maybe … there are no ….

History

To understand how we got to this situation we need to look at how CPU’s have evolved. This is not a complete history.

The PDP11 was a system owned at bell labs, where the C programing language was developed. Back then CPU’s were very simple - A CPU and memory, executing one instruction at a time.

The C programming language gained a lot of popularity as it was able to be “quickly” ported to other CPU models to allow software to be compiled on other platforms. This led to many systems being developed in C.

Intel introduced the 8086, and many C programs were ported to run on it. Intel then released the 80486 in 1989, which had the first pipeline and cache to improve performance. In order to continue to support C, this meant the memory model could not change from the PDP11 - the cache had to be transparent, and the pipeline could not expose state.

This has of course led to computers being more important in our lives and businesses, so we expected further performance, leading to increased frequencies and async behaviours.

The limits of frequencies were really hit in the Pentium 4 era, when about 4GHz was shown to be a barrier of stability for those systems. They had very deep pipelines to improve performance, but that also had issues when branch prediction failed causing pipeline stalls. Systems had to improve their async behaviours futher to squeeze every single piece of performance possible out.

Compiler developers also wanted more performance so they started to develop ways to transform C in ways that “took advantage” of x86_64 tricks, by manipulating the environment so the CPU is “hinted” into states we “hope” it gets into.

Many businesses also started to run servers to provide to consumers, and in order to keep costs low they would put many users onto single pieces of hardware so they could share or overcommit resources.

This has created a series of positive reinforcement loops - C is ‘abi stable’ so we keep developing it due to it’s universal nature. C code can’t be changed without breaking every existing system. We can’t change the CPU memory model without breaking C, which is hugely prevalent. We improve the CPU to make C faster, transparently so that users/businesses can run more C programs and users. And then we improve compilers to make C faster given quirks of the current CPU models that exist …

Swiss cheese model

It’s hard to look at the current state of systems security and simply say “it’s the cpu vendors fault”. There are many layers that have come together to cause this situation.

This is called the “swiss cheese model”. Imagine you take a stack of swiss cheese and rotate and rearrange the slices. You will not be able to see through it. but as you continue to rotate and rearrange, eventually you may see a tunnel through the cheese where all the holes line up.

This is what has happened here - we developed many layers socially and technically that all seemed reasonable over time, and only after enough time and re-arrangements of the layers, have we now arrived at a situation where a failure has occured that permeates all of computer hardware.

To address it, we need to look beyond just “blaming hardware makers” or “software patches”. We need to help developers move away from C to other languages that can be brought onto new memory models that have manual or other cache strategies. We need hardware vendors to implement different async models. We need to educate businesses on risk analysis and how hardware works to provide proper decision making capability. We need developers to alter there behaviour to work in environments with higher performance constraints. And probably much much more.

There are no root causes

It is a very pervasive attitude in IT that every issue has a root cause. However, looking above we can see it’s never quite so simple.

Saying an issue has a root cause, prevents us from examining the social, political, economic and human factors that all become contributing factors to failure. Because we are unable to examine them, we are unable to address the various layers that have contributed to our failures.

There are no root causes. Only contributing factors.

Concurrency 2: Concurrently Readable Structures

In this post, I’ll discuss concurrently readable datastructures that exist, and ideas for future structures. Please note, this post is an inprogress design, and may be altered in the future.

Before you start, make sure you have read part 1

Concurrent Cell

The simplest form of concurrently readable structure is a concurrent cell. This is equivalent to a read-write lock, but has concurrently readable properties instead. The key mechanism to enable this is that when the writer begins, it clones the data before writing it. We trade more memory usage for a gain in concurrency.

To see an implementation, see my rust crate, concread

Concurrent Tree

The concurrent cell is good for small data, but a larger structure - like a tree - may take too long to clone on each write. A good estimate is that if your data in the cell is larger than about 512 bytes, you likely want a concurrent tree instead.

In a concurrent tree, only the branches involved in the operation are cloned. Imagine the following tree:

         --- root 1 ---
        /               \
    branch 1         branch 2
    /     \         /        \
leaf 1   leaf 2  leaf 3    leaf 4

When we attempt to change a value in leaf 4 we copy it before we begin.

      ---------------------------
     /   --- root 1 ---          \-- root 2
     v  /               \                \
    branch 1         branch 2            branch 2(c)
    /     \         /        \           /    \
leaf 1   leaf 2  leaf 3    leaf 4        |    leaf 4(c)
                    ^                    |
                    \-------------------/

In the process the pointers from the new root 2 to branch 1 are maintained. branch 2(c) also maintains a pointer to leaf 3.

This means that in this example only 3/7 nodes are copied, saving a lot of cloning. As your tree grows this saves a lot of work. Consider a tree with node-widths of 7 pointers and at height level 5. Assuming perfect layout, you only need to clone 5/~16000 nodes. A huge saving in memory copy!

The interesting part is a reader of root 1, also is unaffected by the changes to root 2 - the tree from root 1 hasn’t been changed, as all it’s pointers and nodes are still valid.

When any reader of root 1 ends, we clean up all the nodes it pointed to that no longer are needed by root 2 (this can be done with atomic reference counting, or garbage lists in transactions).

         --- root 2 ---
        /               \
    branch 1         branch 2(c)
    /     \         /        \
leaf 1   leaf 2  leaf 3    leaf 4(c)

It is through this copy-on-write (also called multi view concurrency control) that we achieve concurrent readability in the tree.

This is really excellent for databases where you have in memory structures that work in parallel to the database transactions. In kanidm an example is the in-memory schema that is used at run time but loaded from the database. They require transactional behaviours to match the database, and ACID properties so that readers of a past transaction have the “matched” schema in memory.

Future Idea - Concurrent Cache

A design I have floated in my head is a concurrently readable cache - it should have the same transactional properties as a concurrently readable structure - one writer, multiple readers with consistent views of the data. As well it should support rollbacks if a writer fails.

This scheme should work with any cache type - LRU, LRU2Q, LFU. I plan to use ARC however.

ARC was popularised by ZFS - ARC is not specific to ZFS, it’s a strategy for cache replacement.

ARC is a combination of an LRU and LFU with a set of ghost lists and a weighting factor. When an entry is “missed” it’s inserted to the LRU. When it’s accessed from the LRU a second time, it moves to the LFU.

When entries are evicted from the LRU or LFU they are added to the ghost list. When a cache miss occurs, the ghost list is consulted. If the entry “would have been” in the LRU, but was not, the LRU grows and the LFU shrinks. If the item “would have been” in the LFU but was not, the LFU is expanded.

This causes ARC to be self tuning to your workload, as well as balancing “high frequency” and “high locality” operations.

A major problem though is ARC is not designed for concurrency - LFU/LRU rely on double linked lists which is very much something that only a single thread can modify safely.

How to make ARC concurrent

To make this concurrent, I think it’s important to specify the goals.

  • Readers should be able to read and find entries in the cache
  • If a reader locates a missing entry it must be able to load it from the database
  • The reader should be able to send loaded entries to the cache so they can be used.
  • Reader access metrics should be acknowledged by the cache.
  • Multiple reader generations should exist
  • A writer should be able to load entries to the cache
  • A writer should be able to modify an entry of the cache without affecting readers
  • Writers should be able to be rolled back with low penalty

There are a lot of places to draw inspiration from, and I don’t think I can list - or remember them all.

My current “work in progress” is that we use a concurrently readable pair of trees to store the LRU and LFU. These trees are able to be read by readers, and a writer can concurrently write changes.

The ghost lists of the LRU/LFU are maintained single thread by the writer. The linked lists for both are also single threaded and use key-references from the main trees to maintain themselves. The writer maintains the recv end of an mpsc queue. Finally a writer has an always-incrementing transaction id associated.

A reader when initiated has access to the writer of the queue and the transaction id of the writer that created this generation. The reader has a an empty hash map.

Modification to ARC

A modification is that we need to retain the transaction id’s related to items. This means the LRU and LFU contain:

type Txid: usize;

struct ARC<K, Value<V>> {
    lru: LRU<K, Value<V>>,
    lfu: LFU<K, Value<V>>,
    ghost_lru: BTreeMap<K, Txid>
    ghost_lfu: BTreeMap<K, Txid>
}

struct Value<V> {
    txid: Txid,
    data: V,
}

Reader Behaviour

The reader is the simpler part of the two, so we’ll start with that.

When a reader seeks an item in the cache, it references the read-only LRU/LFU trees. If found, we queue a cache-hit marker to the channel.

If we miss, we look in our local hashmap. If found we return that.

If it is not in the local hashmap, we now seek in the database - if found, we load the entry. The entry is stored in our local hashmap.

As the reader transaction ends, we send the set of entries in our local hash map as values (see Modification to ARC), so that the data and the transaction id of the generation when we loaded is associated. This has to be kept together as the queue could be recieving items from many generations at once.

The reader attempts a “try_include” at the end of the operation, and if unable, it proceeds.

enum State<V> {
    Missed<V>
    Accessed
}

struct ChanValue<K, V> {
    txid: Txid,
    key: K,
    data: State<V>
}

Writer Behaviour

There are two major aspects to writer behaviour. The writer is responsible for maintaining a local cache of missed items, a local cache of writen (dirty) items, managing the global LRU/LFU, and responding to the reader inclusion requests.

When the writer looks up a value, it looks in the LFU/LRU. If found (and the writer is reading) we return the data to the caller, and add an “accessed” value to the local thread store.

If the writer is attempting to mutate, we clone the value and put it into the local thread store in the “dirty” state.

enum State {
    Dirty(V),
    Clean(V),
    Accessed
}

struct Value<V> {
    txid: usize,
    state: State<V>
}

If it is not found, we seek the value in the database. It is added to the cache. If this is a write, we flag the entry as dirty. Else it’s flagged clean.

If we abort, we move to the include step before we complete the operation.

If we commit, we write our clean and dirty flagged data to the LFU/LRU as required. The LRU/LFU self manages it’s lists and sets, it’s okay to the concurrent behaviours. We indicate which items have been accessed.

We the perform an “include” operation. Readers attempt this at the end of their operations if the lock can be taken, and skip if not.

We dequeue from the queue up to some limit of values. For each value that is requested, we look it up in our LRU/LFU.

  • If the value was not in the ARC, and in the ghost list, include it + it’s txid if the txid is higher than the ghost key txid
  • If the value was not in the ARC, and not in the ghost list, include it.
  • If the value is in the ARC, but a lower txid, we update the access metrics.
  • If the value is in the ARC and a higher txid, we update the access metrics and update the value to the newer version.
  • If the value is an accessed marker, and the item is in the ghost list, continue
  • If the value is an accessed marker, and the item is in the ARC, we update it’s access metrics

Questions for Future William

ARC moves from LRU -> LFU if the LRU has a hit, but this seems overly aggresive. Perhaps this should be if LRU is a hit on 2 occasions move to LFU?

A thread must wake and feed the cache if we are unable to drain the readers, as we don’t want the queue to grow without bound.

Limitations and Concerns

Cache missing is very expensive - multiple threads may load the value, the readers must queue the value, and the writer must then act on the queue. Sizing the cache to be large enough is critically important as eviction/missing will have a higher penalty than normal. Optimally the cache will be “as large or larger” than the working set.

Due to the inclusion cost, the cache may be “slow” during the warm up, so this style of cache really matters for highly concurrent software that can not tolerate locking behaviour, and for items where the normal code paths are extremely slow. IE large item deserialisation and return.

Concurrency 1: Types of Concurrency

I want to explain different types of concurrent datastructures, so that we can explore their properties and when or why they might be useful.

As our computer systems become increasingly parallel and asynchronous, it’s important that our applications are able to work in these environments effectively. Languages like Rust help us to ensure our concurrent structures are safe.

CPU Memory Model Crash Course

In no way is this a thorough, complete, or 100% accurate representation of CPU memory. My goal is to give you a quick brief on how it works. I highly recommend you read “what every programmer should know about memory” if you want to learn more.

In a CPU we have a view of a memory space. That could be in the order of KB to TB. But it’s a single coherent view of that space.

Of course, over time systems and people have demanded more and more performance. But we also have languages like C, that won’t change from their view of a system as a single memory space, or change how they work. Of course, it turns out C is not a low level language but we like to convince ourselves it is.

To keep working with C and others, CPU’s have acquired cache’s that are transparent to the operation of the memory. You have no control of what is - or is not - in the cache. It “just happens” asynchronously. This is exactly why spectre and meltdown happened (and will continue to happen) because these async behaviours will always have the observable effect of making your CPU faster. Who knew!

Anyway, for this to work, each CPU has multiple layers of cache. At L3 the cache is shared with all the cores on the die. At L1 it is “per cpu”.

Of course it’s a single view into memory. So if address 0xff is in the CPU cache of core 1, and also in cache of core 2, what happens? Well it’s supported! Caches between cores are kept in sync via a state machine called MESI. These states are:

  • Exclusive - The cache is the only owner of this value, and it is unchanged.
  • Modified - The cache is the only owner of this value, and it has been changed.
  • Invalid - The cache holds this value but another cache has changed it.
  • Shared - This cache and maybe others are viewing this valid value.

To gloss very heavily over this topic, we want to avoid invaild. Why? That means two cpus are contending for the value, causing many attempts to keep each other in check. These contentions cause CPU’s to slow down.

We want values to either be in E/M or S. In shared, many cpu’s are able to read the value at maximum speed, all the time. In E/M, we know only this cpu is changing the value.

This cache coherency is also why mutexes and locks exist - they issue the needed CPU commands to keep the caches in the correct states for the memory we are accessing.

Keep in mind Rust’s variables are immutable, and able to share between threads, or mutable and single thread only. Sound familar? Rust is helping with concurrency by keeping our variables in the fastest possible cache states.

Data Structures

We use data structures in programming to help improve behaviours of certain tasks. Maybe we need to find values quicker, sort contents, or search for things. Data Structures are a key element of modern computer performance.

However most data structures are not thread safe. This means only a single CPU can access or change them at a time. Why? Because if a second read them, due to cache-differences in content the second CPU may see an invalid datastructure, leading to undefined behaviour.

Mutexes can be used, but this causes other CPU’s to stall and wait for the mutex to be released - not really what we want on our system. We want every CPU to be able to process data without stopping!

Thread Safe Datastructures

There exist many types of thread safe datastructures that can work on parallel systems. They often avoid mutexes to try and keep CPU’s moving as fast as possible, relying on special atomic cpu operations to keep all the threads in sync.

Multiple classes of these structures exist, which have different properties.

Mutex

I have mentioned these already, but it’s worth specifying the properties of a mutex. A mutex is a system where a single CPU exists in the mutex. It becomes one “reader/writer” and all other CPU’s must wait until the mutex is released by the current CPU holder.

Read Write Lock

Often called RWlock, these allow one writer OR multiple parallel readers. If a reader is reading then a writer request is delayed until the readers complete. If a writer is changing data, all new reads are blocked. All readers will always be reading the same data.

These are great for highly concurrent systems provided your data changes infrequently. If you have a writer changing data a lot, this causes your readers to be continually blocking. The delay on the writer is also high due to a potentially high amount of parallel readers that need to exit.

Lock Free

Lock free is a common (and popular) datastructue type. These are structures that don’t use a mutex at all, and can have multiple readers and multiple writers at the same time.

The most common and popular structure for lock free is queues, where many CPUs can append items and many can dequeue at the same time. There are also a number of lock free sets which can be updated in the same way.

An interesting part of lock free is that all CPU’s are working on the same set - if CPU 1 reads a value, then CPU 2 writes the same value, the next read from CPU 1 will show the new value. This is because these structures aren’t transactional - lock free, but not transactional. There are some times where this is really useful as a property when you need a single view of the world between all threads, and your program can tolerate data changing between reads.

Wait Free

This is a specialisation of lock free, where the reader/writer has guaranteed characteristics about the time they will wait to read or write data. This is very detailed and subtle, only affecting real time systems that have strict deadline and performance requirements.

Concurrently Readable

In between all of these is a type of structure called concurrently readable. A concurrently readable structure allows one writer and multiple parallel readers. An interesting property is that when the reader “begins” to read, the view for that reader is guaranteed not to change until the reader completes. This means that the structure is transactional.

An example being if CPU 1 reads a value, and CPU 2 writes to it, CPU 1 would NOT see the change from CPU 2 - it’s outside of the read transaction!

In this way there are a lot of read-only immutable data, and one writer mutating and changing things … sounds familar? It’s very close to how our CPU’s cache work!

These structures also naturally lend themself well to long processing or database systems where you need transactional (ACID) properties. In fact some databases use concurrent readable structures to achieve ACID semantics.

If it’s not obvious - concurrent readability is where my interest lies, and in the next post I’ll discuss some specific concurrently readable structures that exist today, and ideas for future structures.

Packaging and the Security Proposition

As a follow up to my post on distribution packaging, it was commented by Fraser Tweedale (@hackuador) that traditionally the “security” aspects of distribution packaging was a compelling reason to use distribution packages over “upstreams”. I want to dig into this further.

Why does C need “securing”

C as a language is unsafe in every meaning of the word. The best C programmers on the planet are incapable of writing a secure program. This is because to code in C you have to express a concurrent problem, into a language that is linearised, which is compiled relying on undefined behaviour, to be executed on an asynchronous concurrent out of order CPU. What could possibly go wrong?!

There is a lot you need to hold in mind to make C work. I can tell you now that I spend a majority of my development time thinking about the code to change rather than writing C because of this!

This has led to C based applications having just about every security issue known to man.

How is C “secured”

So as C is security swiss cheese, this means we have developed processes around the language to soften this issue - for example advice like patch and update continually as new changes are continually released to resolve issues.

Distribution packages have always been the “source” of updates for these libraries and applications. These packages are maintained by humans who need to update these packages. This means when a C project releases a fix, these maintainers would apply the patch to various versions, and then release the updates. These library updates due to C’s dynamic nature means when the machine is next rebooted (yes rebooted, not application restarted) that these fixes apply to all consumers who have linked to that library - change one, fix everything. Great!

But there are some (glaring) weaknesses to this model. C historically has little to poor application testing so many of these patches and their effects can’t be reproduced. Which also subsequently means that consuming applications also aren’t re-tested adequately. It can also have impacts where a change to a shared library can impact a consuming application in a way that was unforseen as the library changed.

The Dirty Secret

The dirty secret of many of these things is that “thoughts and prayers” is often the testing strategy of choice when patches are applied. It’s only because humans carefully think about and write tiny amounts of C that we have any reliability in our applications. And we already established that it’s nearly impossible for humans to write correct C …

Why Are We Doing This?

Because C linking and interfaces are so fragile, and due to the huge scope in which C can go wrong due to being a memory unsafe language, distributions and consumers have learnt to fear version changes. So instead we patch ancient C code stacks, barely test them, and hope that our castles of sand don’t fall over, all so we can keep “the same version” of a program to avoid changing it as much as possible. Ironically this makes those stacks even worse because we’ve developed infinite numbers of bespoke barely tested packages that people rely on daily.

To add more insult to this, most of this process is manual - humans monitor mailing lists, and have to know what code needs what patch, and when in what release streams. It’s a monumental amount of human time and labour involved to keep the sand castles standing. This manual involvement is what leads to information overload, and maintainers potentially missing security updates or releases that causes many distribution packages to be outdated, missing patches, or vulnerable more often than not. In other cases packages continue to be shipped that are unmaintained or have no upstream, so any issues that may exist are unknown or unresolved.

Distribution Security

This means all of platform and distribution security comes to one factor.

A lot of manual human labour.

It’s is only because distributions have so many volunteers or paid staff, that this entire system continues to progress to give the illusion of security and reliability. When it fails, it fails silently.

Heartbleed really dragged the poor state of C security into the open , and it’s still not been addressed.

When people say “how can we secure docker/flatpak/Rust” like we do with distributions, I say: “Do we really secure distributions at all?”. We only have a veneer of best effort masquerading as a secure supply chain.

A Different Model …

So let’s look briefly at Rust and how you package it today (against distribution maintainer advice).

Because it’s staticly linked, each application must be rebuilt if a library changes. Because the code comes from a central upstream, there are automated tools to find security issues (like cargo audit). The updates are pulled from the library as a whole working tested unit, and then built into our application to to recieve further testing and verification of the application as a whole singular functional unit.

These dependencies once can then be vendored to a tar (allowing offline builds and some aspects of reproducability). This vendor.tar.gz is placed into the source rpm along with the application source, and then built.

There is a much stronger pipeline of assurances here! And to further aid Rust’s cause, because it is a memory safe language, it eliminates most of the security issues that C is afflicted by, causing security updates to be far fewer, and to often affect higher level or esoteric situations. If you don’t believe me, look at the low frequency, and low severity commits for the rust advisory-db

People have worried that because Rust is staticly linked we’ll have to rebuild it and update it continually to keep it secure - I’d say because it’s Rust we’ll have stronger guarantees at build that security issues are less likely to exist and we won’t have to ship updates nearly as often as a C stack.

Another point to make is Rust libraries don’t release patches - because of Rust’s stronger guarantees at compile time and through integrated testing, people are less afraid of updates to versions. We are very unlikely to see Rust releasing patches, rather than just shipping “updates” to libraries and expecting you to update. Because these are staticly linked, we don’t have to worry about versions for other libraries on the platform, we only need to assure the application is currently working as intended. Because of the strong typing those interfaces of those libraries has stronger compile time guarantees at build time, meaning the issues around shared object versioning and symbol/version mismatching simply don’t exist - one of the key reasons people became version change averse in the first place.

So Why Not Package All The Things?

Many distribution packagers have been demanding a C-like model for Rust and others (remember, square peg, round hole). This means every single crate (library) is packaged, and then added to a set of buildrequires for the application. When a crate updates, it triggers the application to rebuild. When a security update for a library comes out, it rebuilds etc.

This should sound familiar … because it is. It’s reinventing Cargo in a clean-room.

RPM provides a way to manage dependencies. Cargo provides a way to manage dependencies.

RPM provides a way to offline build sources. Cargo provides a way to offline build sources.

RPM provides a way to patch sources. Cargo provides a way to update them inplace - and patch if needed.

RPM provides a way to … okay you get the point.

There is also a list of what we won’t get from distribution packages - remember distribution packages are the C language packaging system

We won’t get the same level of attention to detail, innovation and support as the upstream language tooling has. Simply put, users of the language just won’t use distribution packages (or toolchains, libraries …) in their workflows.

Distribution packages can’t offer is the integration into tools like cargo-audit for scanning for security issues - that needs still needs Cargo, not RPM, meaning the RPM will need to emulate what Cargo does exactly.

Using distribution packages means you have an untested pipeline that may add more risks now. Developers won’t use distribution packages - they’ll use cargo. Remember applications work best as they are tested and developed - outside of that environment they are an unknown.

Finally, the distribution maintainers security proposition is to secure our libraries - for distributions only. That’s acting in self interest. Cargo is offering a way to secure upstream so that everyone benefits. That means less effort and less manual labour all around. And secure libraries are not the full picture. Secure applications is what matters.

The large concerning factor is the sheer amount of human effort. We would spend hundreds if not thousands of hours to reinvent a functional tool in a disengaged manner, just so that we can do things as they have always been done in C - for the benefit of distributions individually rather than languages upstream.

What is the Point

Again - as a platform our role is to provide applications that people can trust. The way we provide these applications is never going to be one size fits all. Our objective isn’t to secure “this library” or “that library”, it’s to secure applications as a functional whole. That means that companies shipping those applications, should hire maintainers to work on those applications to secure their stacks.

Today I honestly think Rust has a better security and updating story than C packages ever has, powered by automation and upstream integration. Let’s lean on that, contribute to it, and focus on shipping applications instead of reinventing tools. We need to accept our current model is focused on C, that developers have moved around distribution packaging, and that we need to change our approach to eliminate the large human risk factor that currently exists.

We can’t keep looking to the models of the past, we need to start to invest in new methods for the future.

Today, distributions should focus on supporting and distributing _applications_ and work with native language supply chains to enable this.

Which is why I’ll keep using cargo’s tooling and auditing, and use distribution packages as a delievery mechanism for those applications.

What Could it Look Like?

We have a platform that updates as a whole (Fedora Atomic comes to mind …) with known snapshots that are tested and well known. This platform has methods to run applications, and those applications are isolated from each other, have their own libraries, and security audits.

And because there are now far fewer moving parts, quality is easier to assert, understand, and security updates are far easier and faster, less risky.

It certainly sounds a lot like what macOS and iOS have been doing with a read-only base, and self-contained applications within that system.