Cleaning up after BPF exceptions

This article brought to you by LWN subscribers

Subscribers to LWN.net made this article — and everything that surrounds it — possible. If you appreciate our content, please buy a subscription and make the next set of articles possible.

By Daroc Alden
April 15, 2024

Kumar Kartikeya Dwivedi has been working to add support for exceptions to BPF since mid-2023. In July, Dwivedi posted the first patch set in this effort, which adds support for basic stack unwinding. In February 2024, he posted the second patch set aimed at letting the kernel release resources held by the BPF program when an exception occurs. This makes exceptions usable in many more contexts.

BPF exceptions are somewhat dissimilar to exceptions in other languages. For one thing, they cannot be caught — any call to bpf_throw() will result in the BPF program exiting. There is a callback that the user can register to set the return code of the BPF program in the event of an exception, but there is no way for the program to recover. In the same vein, there are not different types of exceptions — all BPF exceptions behave the same way. BPF exceptions are subtly different from exit() because they do still unwind the stack.

Currently, unwinding the stack doesn't make much difference. The BPF verifier prevents programs that hold resources (data structures, such as sockets, that must be disposed of in a specific way) from raising an exception by calling bpf_throw(). Letting them do so would be a problem because the current kernel is not prepared to release those resources, which would be a violation of BPF's safety guarantees. Dwivedi's new patch set takes advantage of the fact that BPF exceptions still unwind the stack to release the resources held by each function as its stack frame is unwound. For now, only some types of resource are supported — notably not including spinlocks or read-copy-update (RCU) structures — but future work can add additional types of resource over time.

Or that's the theory, at least. As it stands, the BPF verifier does not always prevent programs from throwing exceptions with resources held. The first patch of Dwivedi's new set notes:

However, there currently exists a loophole in this restriction due to the way the verification procedure is structured. The verifier will first walk over the main subprog's instructions, but not descend into subprog calls to ones with global linkage. These global subprogs will then be independently verified instead. Therefore, in a situation where a global subprog ends up throwing an exception (either directly by calling bpf_throw, or indirectly by way of calling another subprog that does so), the verifier will fail to notice this fact and may permit throwing BPF exceptions with non-zero acquired references.

That patch fixes the issue by making the verifier determine early on which functions can throw exceptions, so that information is available when performing the main analysis. The remaining patches in the set add a new pass to the verifier to let it collect the information needed to actually release any resources the program holds.

In order to free resources the BPF program holds, the kernel needs to know two things: where they are, and what kind of resource they are. The new verifier pass walks the program and generates a map for each location through which an exception could cause the stack to unwind. Each one records which locations on the stack could hold releasable resources at that point in time. These maps also store the type of resource, which the verifier already has to track in order to ensure that resources are properly released in the course of normal execution.

Not all relevant resources live on the stack, however. BPF has callee-saved registers (R6-R9) which might contain resources when bpf_throw() is called. To handle this, the patch set inserts a new hidden wrapper around bpf_throw() that explicitly spills those registers to the stack.

At run time, when an exception is thrown, the kernel looks up the relevant stack map using the current instruction pointer. For each releasable resource, the kernel calls the release function associated with its type. Then the stack frame is unwound, and the kernel repeats the process with the stack map of the calling function. When unwinding the program, the kernel also needs to have the location of callee-saved registers recorded in the map, so that it can restore them to the correct location. This keeps the state of the program being unwound consistent, so that stack maps of earlier frames remain correct.

One advantage of this approach is that subprograms that do not use exceptions don't incur any additional runtime overhead, because they do not need stack maps. In contrast, one complication is that it is perfectly legal for a BPF program to store different things in the same stack slot in different execution paths of the function, as long as the verifier can show that the types remain correct. A completely comprehensive approach to a stack map would therefore need to include some amount of runtime information about which execution path a function has taken.

Dwivedi's patch set does not go that far. Luckily, it turns out that most real-world BPF programs do not actually use stack slots in this way. Dwivedi's patches "merge" stack maps from divergent execution paths when they have compatible types, and return an error when they do not. He investigated and found that existing BPF programs do not run into this error, and that merges of conflicting types were "unlikely to often occur in practice".

There is one special case, however. It is somewhat common for a program to acquire a resource conditionally, which means that its stack slot might contain a null pointer. The new verifier pass handles this by merging other types with null pointers when necessary. In the end, it requires that all the execution paths of a function either store the same type of resource in the same slot, or leave a null pointer there. This allows the verifier to coalesce all of the maps for a given function, preventing a potential combinatorial explosion.

Eduard Zingerman raised some concerns with that approach, saying that he worried about the "possibility that frame merge might fail because of some e.g. clang version changes" that modify how the compiler chooses to lay out the stack. Zingerman suggested a run-time approach that tracks resources as they are acquired and released by the actual program instead, saying such an approach "seems simpler than frame descriptors tracking and merging", and that it could support aborting BPF programs at any point, instead of only at calls to bpf_throw(). The downside would be run-time overhead, even for programs that never actually throw an exception.

Dwivedi responded: "I went over this option early on but rejected it due to the typical downsides you observed". He went on to explain the overhead such a runtime approach would require in detail, concluding by saying: "I just think this is all unnecessary especially when most of the time the exception is not going to be thrown anyway, so it's all cost incurred for a case that will practically never happen in a correct program."

David Vernet also reviewed the patch set, pointing out that the new pass looks fairly similar to the existing check_max_stack_depth_subprog() code, and asking whether they could be combined. Dwivedi agreed that this would be a good idea. He plans to incorporate that work (and a related refactoring of the stack-depth-checking code) into version two of the patch set.

Index entries for this article
Kernel	BPF

(Log in to post comments)

Cleaning up after BPF exceptions

Posted Apr 16, 2024 4:58 UTC (Tue) by rywang014 (subscriber, #167182) [Link]

I am not familiar with BPF - sounds like this exception is exactly what `bpf_exit` should do. If a BPF program is still holding resources, does the verifier allow it to call `bpf_exit`? If so, who is in charge of releasing the resources?

Cleaning up after BPF exceptions

Posted Apr 16, 2024 7:42 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link]

BPF throw might be executed by a helper, so the verifier will have no insight into it.

Cleaning up after BPF exceptions

Posted Apr 16, 2024 10:41 UTC (Tue) by roc (subscriber, #30627) [Link]

> Luckily, it turns out that most real-world BPF programs do not actually use stack slots in this way. Dwivedi's patches "merge" stack maps from divergent execution paths when they have compatible types, and return an error when they do not. He investigated and found that existing BPF programs do not run into this error, and that merges of conflicting types were ""unlikely to often occur in practice"".

I don't understand how the BPF ecosystem works. Normally when you have a virtual or physical machine being targeted by a compiler, there's a specification for what programs that machine will accept, and compiler writers use that to ensure they only generate valid programs. Changes to that contract are made carefully, and only when absolutely necessary do you start rejecting programs previously deemed valid. However it appears that BPF doesn't have such a specification, and that backward compatibility can be ignored as long as a few programs tested by one kernel developer compiled with one specific version of one compiler continue to work.

So how does this work in practice? Is there a larger suite of BPF bytecode programs that are checked in CI or before a kernel release, or some other way to detect breakage of existing compiled programs? And how does someone writing a compiler to BPF bytecode find out what is considered valid and guaranteed to be supported in the future?

Cleaning up after BPF exceptions

Posted Apr 16, 2024 15:09 UTC (Tue) by foom (subscriber, #14868) [Link]

> And how does someone writing a compiler to BPF bytecode find out what is considered valid

There's no way to find out. When the compiler is modified, it will cause programs (which previously worked) to fail verifier checks. Then, either the verifier will be improved for new kernel versions to handle that new output, or unprincipled hacks will be made to the compiler's bpf target to avoid certain optimizations in hopes that it'll allow said program to pass the verifier.

It is a disaster.

Cleaning up after BPF exceptions

Posted Apr 16, 2024 19:44 UTC (Tue) by kkdwivedi (guest, #130744) [Link]

Your criticism for this approach is fair. It's not an ideal situation. The above is one of the reasons why I took a step back and tried to rethink the merging logic for paths with different contents in the same stack slot.

The reason this is unlikely (I don't want to use impossible, that's too strong a word and it is certainly possible to concoct such a scenario) to occur in practice is because a valid program in the path of execution which does not throw, would need to release the different kernel resources at the same stack slot. In such a case, there would be a path condition known to the verifier such that it can see the right release function being called for that kernel resource. While such a program can be constructed, it is unlikely the compiler emits such code. I tried making different kernel helpers cancellation points (which amount to throwing an exception and generating frame descriptors) in the selftest suite (which is a reasonable representative set of BPF programs used in production, and tested in CI), and didn't encounter this.

However, if you look at the final reply to Eduard, we eventually decided to resolve the merge conflicts on the same stack slot, instead of rejecting the program, by spilling every unique acquisition of a conflicting resource (tied to a single program counter) to a unique stack slot in a stack frame. All such stack slots would need to be zeroed on entry, but it would mean that when such a situation does occur in the future, the resource would be saved to a unique location and merge conflicts in the same stack slot would be resolved by instead depending on the unique stack slot's value, ignoring other copies of the same resource elsewhere in the stack.

This is a good middle path between the two extremes (actively keeping track of acquired resources on acquisition and release, which has a non-zero cost linear in the number of resource acquisitions and releases on a given path, and keeping track of resources at any program point statically, placing some constraints on program behaviour), and would work for cases where we cannot ensure resource cleanup by just depending on the verifier's knowledge of the state of the stack and registers at a given program point. It mitigates conflicts in the same stack slot for the uncommon cases, while not requiring every resource to be actively accounted for at runtime in the common case.

I am planning to pursue this option in the next revision of this patch set, alleviating these concerns.

Cleaning up after BPF exceptions

Posted Apr 16, 2024 10:56 UTC (Tue) by roc (subscriber, #30627) [Link]

It's interesting that BPF continues to evolve away from static cleanup checking to dynamic cleanup-on-unwind. I expect BPF will converge on a model where any BPF function can throw at specified cancellation points, where every loop and function call includes at least one cancellation point, and unwinding due to throw can release any/all resources. At that point, there's no need for BPF programs to include code to specifically release those resources, so some kind of scope mechanism will be introduced to trigger cleanup automatically on non-throwing paths (e.g. function returns). Static checking of termination will be completely superseded by dynamic counting of instructions or blocks triggering an interrupt at the next cancellation point. Unsurprisingly it will look more and more like a conventional virtual machine with RAII.

Cleaning up after BPF exceptions

Posted Apr 17, 2024 3:59 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link]

I called it 5 years ago! :) The trajectory of BPF is utterly predictable, they initially hobbled themselves with a made-up restriction (static runtime verification) that perverted the entire BPF ecosystem architecture, so they had to NIH tons of stuff.

But it turned out that nobody actually cares too much about static runtime verification, so it gradually weakened. First by allowing BPF programs to run for millions of instructions instead of several thousand, then by introducing "helper" iterators that already allow BPF programs to run for multiple seconds. Now they are thinking about using runtime metering on back branches instead of static verification.

BPF will eventually be a hacky subset of WASM, fully NIH-ed. It will also experience all the typical issues with homegrown bytecode interpreters/JITs: sandbox escapes, DoS due to memory blowups, DoS due to CPU starvation, etc.

Cleaning up after BPF exceptions

Posted Apr 17, 2024 13:34 UTC (Wed) by auc (subscriber, #45914) [Link]

Will it grow enough to fulfil Greenspun's tenth rule ? :)