LWN.net Weekly Edition for April 18, 2024

Welcome to the LWN.net Weekly Edition for April 18, 2024

This edition contains the following feature content:

Fedora 40 firms up for release: what to expect in the next release of the Fedora distribution.
Identifying dependencies used via dlopen(): a proposal for a different way to handle weak dependencies.
Completing the EEVDF scheduler: improved handling of sleeping tasks to make this new CPU scheduler work as intended.
A tale of two troublesome drivers: adding drivers is rarely controversial, but there are exceptions.
Cleaning up after BPF exceptions: better resource management allowing BPF programs to abort in more situations.
Managing to-do lists on the command line with Taskwarrior: looking at the Taskwarrior 3.0 release.

This week's edition also includes these inner pages:

Brief items: Brief news items from throughout the community.
Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

Fedora 40 firms up for release

By Joe Brockmeier
April 16, 2024

Fedora 40 Beta was released on March 26, and the final release is nearing completion. So far, the release is coming together nicely with major updates for GNOME, KDE Plasma, and the usual cavalcade of smaller updates and enhancements. As part of the release, the project also scuttled Delta RPMs and OpenSSL 1.1.

A Fedora release is not really a release. The project has five official editions: Workstation for desktop users, Server for "traditional" server use, CoreOS for running container-based applications, IoT aimed at edge-computing applications, and Cloud for (unsurprisingly) running Fedora as a virtual machine in cloud environments. The project also offers a wide selection of Spins targeting specific desktops and use cases. All of that is a bit much to cover in a single article. The focus here is desktop use and system-wide changes that impact most or all of the various Fedora editions and spins.

Desktop

Fedora Workstation is updated to GNOME 46, which we covered on its release in March. It is not a huge leap from GNOME 45, so Fedora Workstation users who are upgrading from Fedora 39 will find few surprises. Upgrading from Workstation 39 to 40, even prior to the beta release, went smoothly.

Performing a fresh install of Workstation also went well, and the Anaconda installation workflow will be familiar to anyone who's installed Fedora recently. That was not the original plan for this release, however. Fedora has been working on an Anaconda web-based installation interface based on Cockpit (which we looked at back in March) to replace the standard installer for Fedora Workstation. The web-based installer was originally scheduled to debut in Fedora 39, but it was delayed to this release and then postponed again to Fedora 41. The fly in the ointment is the partitioning workflow, which doesn't (at least yet) provide the same features as the traditional "expert" mode for partitioning. For example, the new tooling makes changes to partitions immediately—rather than batching them—which makes it more likely that a user could accidentally delete a partition they wish to keep. The discussion about the right approach to partitioning is still underway.

The KDE spin has been updated to KDE Plasma 6. LWN covered the Plasma 6 release in February. As expected, packages for KDE Plasma X11 are not installed by default with Fedora's KDE spin. The packages are available, however, for users who want or need X11 support. It is a good to have a fallback to X11 if needed, but it's been unnecessary on my test systems. Plasma on Wayland has been stable on two laptops and one mini-PC.

I've rotated between GNOME and KDE for the past several weeks for daily use, to give (roughly) equal time to each, but using my normal set of applications. This includes Firefox, Emacs, Claws Mail, Strawberry music player, GNOME Terminal or Konsole depending on which desktop is in use, and Distrobox to run packages from other distributions. My daily use also includes a lot of shuffling the primary laptop from the desk to the couch, connecting and disconnecting an external monitor, keyboard, and mouse via USB-C. Twice, over several weeks, the primary laptop failed to wake from suspend. That is not a perfect track record, but it's tolerable (especially for pre-release software), and the problem was neither persistent nor reproducible.

Since it was tax season in the US, it was also necessary to dust off the laser printer and print a few things. It was a pleasant surprise to find that the printer utility in both GNOME and KDE found my networked printer and set it up correctly with a few clicks. The surprise may speak more to how infrequently the printer is used than anything else, but they can be difficult devices to tame.

LWN is, blissfully if unusually, a largely meeting-free workplace. This meant few opportunities to test out video calls, which are an unavoidable part of daily life for many Linux users. I did log a few hours of video calls using Google Meet, and those were (again) uneventful. A few screen-sharing tests also went well, including sharing individual windows instead of the entire screen. However, it is worth mentioning that Fedora 40 did not detect the newest laptop's microphone. It did, however, happily detect and make use of a USB-connected Logitech webcam without any problems.

Even with the major KDE update, this Fedora update is more iterative than innovative. None of the changes have added major new functionality that I've noticed, nor had a dramatic impact on daily desktop use. That's not a bad thing, though, because none of the changes have removed any features or functionality that I'd miss.

Looking under the proverbial hood one will also find a number of iterative changes that are worthy of note.

Progress on unified kernel images (UKI)

The six-month development cycle for Fedora means that some features have to be tackled over two or more releases. Such is the case with UKI, which combines the kernel image, initrd, signature, and other components. The usefulness of this, according to the phase one change proposal is to make Fedora "more robust and secure". LWN covered the project's early UKI plans in January 2023. Support for UKI has been making steady progress, with the first phase, putting the pieces in place to work with UKIs for use in virtual machines, spanning the Fedora 38 and 39 releases. A UKI for virtual machines is available for Fedora 38 and 39, and the default size of the EFI system partition (ESP) was increased to a minimum of 500MB in part to accommodate UKIs.

Part of the phase two plan required changes in Koji (Fedora's build system) to support KIWI in order to build cloud images with a unified kernel. That, too, has been completed and the project is now building Fedora Cloud base images with the unified kernel. It's unclear whether these images will be advertised on the Fedora Cloud download page for the final release. The images do not yet show up with other beta images for download. It should be noted that regular kernel images are not going away, and there are no plans to switch to UKI as the default.

Goodbye Delta RPMs

Delta RPMs have gone away, however. This has been in the works for some time but the plug was finally pulled in Koji in November of last year, so Delta RPMs are no longer being produced for Fedora and DNF/DNF5 have had support turned off in their default configurations.

The discussion of the change was largely in favor of removing support. Jonathan Dieter said: "As the one who did most of the original work of getting deltarpms into Fedora, I wholeheartedly support this change. I'm sorry to see them go, but it's time." Some users were less pleased. Flo pointed out that there are still many Fedora users on slow and metered connections who would benefit from the promise (if not reality) of Delta RPMs. "It would be more inclusive to fix delta rpm and enable bandwidth savings for those who need it." Fedora Project Leader Matthew Miller suggested that alternative spins of Fedora like Silverblue that use rpm-ostree images with delta updates may be a better option.

Miscellaneous upgrades and changes

The Podman container-management tool has gotten a major version update to 5.0.1 (from 4.9.4 currently in Fedora 39). The 5.0 series has a laundry list of minor feature updates and a few breaking changes as well. Some of the new Podman features are meant to achieve better compatibility with Docker. For example, the --gpus option for podman create and podman run, which give containers access to the host's GPUs, now supports NVIDIA GPUs, in addition to other GPUs. Podman had support for NVIDIA GPU pass-through previously, but it required a different option (--devices) that was incompatible with applications expecting Docker-compatible flags.

Container networking interface (CNI) support had been deprecated prior to Podman 5 and is no longer included in builds. Podman has picked up a new tool called pasta that provides user-mode networking without requiring additional capabilities or privileges.

The default configuration for NetworkManager has changed slightly for this release to provide individual, stable MAC addresses for WiFi connections. The idea is to make it harder to identify and track users by their MAC address. When a system connects to a new WiFi network, it will generate a random MAC address for that specific network. For example, if a users connects to a WiFi network at their office, they'll have a MAC address for that network. But if they take the laptop down the street to a coffee shop, NetworkManager will generate a new MAC address for that network. Upon returning to work, NetworkManager will use the same MAC address it generated before.

IP-address conflicts are not uncommon on home and work networks, at least in my experience. When these conflicts happen, the symptoms can be puzzling and make it hard to assess the actual cause. Much better, then, to prevent conflicts in the first place. That's the theory behind enabling IP-address-conflict detection by default. This may make initial connections to networks slower, while the host checks for duplicate addresses, but should prevent two systems trying to fight for the same IP. The feature works with both DHCP and static addresses.

GNU Wget has been upgraded to Wget2, which will replace the old Wget on upgraded systems. Wget2 adds support for multithreaded downloads, HTTP/2, Atom and RSS feeds, and more. The change proposal notes that this is "mostly transparent to users" as a drop-in replacement. "Mostly" here means that Wget2 has a number of user-facing changes in the form of changed CLI options and the removal of FTP and web archive (WARC) support.

A somewhat stealthy removal of the network-scripts portion of the initscripts package landed in Fedora 40 in February. Adam Williamson credited Michel Lind for spotting the removal and argued that the change had not been properly communicated, if it had been communicated at all. While these are legacy networking scripts that have mostly been replaced by NetworkManager, Williamson pointed out that he still uses these scripts to work with Open vSwitch and to pre-create TAP devices; both features that NetworkManager is lacking. Miller pointed out several proposed changes that relate to the package, but did acknowledge that prior change proposals did not specifically say this package was going away:

The words "the network-scripts subpackage is deprecated and will be removed" do not seem to have appeared in the release notes, and in retrospect that probably should have been stronger.

Williamson has asked the Fedora Engineering Steering Council (FESCo) to require the service be reinstated. At the April 15 meeting, FESCo voted to reinstate the package on release.

In contrast, the planned GNU toolchain update has landed as scheduled. This updates Fedora to GCC 14.0.1, GNU C Library (glibc) 2.39, Binutils 2.41, and the GNU Debugger (gdb) 14.1.

Fedora moved to OpenSSL 3.0 with Fedora 36, and deprecated OpenSSL 1.1 in 37, but the package remained for third-party applications that still depended on the older version. The 1.1 release went end-of-life (EOL) in September 2023, but it was still packaged for Fedora 39. It is finally removed in 40 and Rawhide.

Python 3.7 reached EOL in September 2023, but remained in Fedora 39 for developers who were targeting other distributions such as Debian 10 that still included 3.7. With Debian 10 going EOL in June, 3.7 was retired in Fedora 40. The default Python is now 3.12. Users can also find packages for versions 3.8 through 3.11 as well as a pre-release 3.13 package. Python 3.6 remains available for developers who need compatibility with Red Hat Enterprise Linux (RHEL) 8.

Release date

The original ship date for the final release was set for April 16, but several blocker bugs have gotten in the way. Aoife Moloney announced that the new target date is April 23. The next Go/No-Go meeting will be on April 18 to determine whether that holds or slips further. Assuming that the final release is as stable as the beta, it's a safe if not mandatory upgrade for Fedora 39 users and a good time for new users to dip their toes in the Fedora waters.

Comments (30 posted)

Identifying dependencies used via dlopen()

By Daroc Alden
April 16, 2024

The recent XZ backdoor has sparked a lot of discussion about how the open-source community links and packages software. One possible security improvement being discussed is changing how projects like systemd link to dynamic libraries that are only used for optional functionality: using dlopen() to load those libraries only when required. This could shrink the attack surface exposed by dependencies, but the approach is not without downsides — most prominently, it makes discovering which dynamic libraries a program depends on harder. On April 11, Lennart Poettering proposed one way to eliminate that problem in a systemd RFC on GitHub.

The systemd project had actually already been moving away from directly linking optional dependencies — but not for security reasons. In Poettering's explanation of his proposal on Mastodon he noted: "The primary reason for [using dlopen()] was to make it easier to build small disk images without optional components, in particular for the purpose of initrds or container deployments." Some people have speculated that this change is what pushed "Jia Tan" to launch their attack at the beginning of April, instead of waiting until it was more robust.

There are several problems with using dlopen() for dependencies, however. One is that, unlike normal dynamic linking, using dlopen() exposes the functions provided by the dependency as void pointers, which must be cast to the correct type. If the type in the dependency does not match the type in the dependent program, this can open a potential avenue for type-confusion attacks. Several respondants to Poettering's explanation on Mastodon worried that promoting the use of dlopen() would be a detriment to security for this reason. James Henstridge said: "I imagine you could hide some interesting bugs via not-quite-compatible function signatures (e.g. cause an argument to be truncated at 32 bits)." Poettering replied:

In current systemd git we systematically use some typeof() macro magic that ensures we always cast the func ptrs returned by dlopen() to the actual prototype listed in the headers of the library in question. Thus we should get the exact same type safety guarantees as you'd get when doing regular dynamic lib linking. Took us a bit to come up with the idea that typeof() can be used for this, but it's amazing, as we don't have to repeat other libraries' prototypes in our code at all anymore.

Henstridge agreed after looking at the code that it was "quite elegant. It also neatly solves the problem of assigning a symbol to the wrong function pointer." Not all of the problems are so easily dismissed, however. The real problem, according to Poettering's announcement is the fact that using dlopen() removes information from the program's ELF headers about what its dependencies are.

Now, I think there are many pros of this approach, but there are cons too. I personally think the pros by far outweigh the cons, but the cons *do* exist. The most prominent one is that turning shared library dependencies into dlopen() dependencies somewhat hides them from the user's and tools view, as the aforementioned tools won't show them anymore. Tools that care about this information are package managers such as rpm/dpkg (which like to generate automatic package dependencies based on ELF dependencies), as well initrd generators such as dracut.

His proposed solution is to adopt a new convention for explicitly listing optional dependencies as part of the program itself. In the systemd RFC, he gave an example of a macro that could be used to embed the name of optional dependencies in a special section of the binary called ".note.uapi.dlopen". "UAPI" stands for Userspace API — referring to the Linux Userspace API Group, a relatively recent collaboration between distributions, package managers, and large software projects to define standardized interfaces for user-space software. The initial proposal for what to encode in the note section was fairly bare-bones — just a type field, the string "uapi" denoting the ELF section "vendor", and the name of the dependency in question.

Poettering was also clear that it wouldn't be useful to implement this for systemd on its own; the note would only be useful if other tooling decided to read it, and other projects choose to implement it. Mike Yuan was quick to comment positively about the possibility of adding support to mkinitcpio, Arch Linux's initramfs generation tool, and pacman, the Arch package manager.

Luca Boccassi agreed that he could "look into adding a debhelper addon for this", but wondered if there should be some way to indicate whether a dependency is truly optional or that the program will fail if the dependency is missing. Poettering responded: "If it is a hard dep, then it should not be a dlopen() one. The whole reason for using dlopen() over regular ELF shared library deps is after all that they can be weak", although he did point out that the type field means that "the door is open to extend this later."

Antonio Álvarez Feijoo raised another concern, pointing out: "Some people are very picky about the size of the initrd and don't like to include things that aren't really necessary. [...] So yes, it's great to know which libraries are necessary, but how to know what systemd component requires them?" Boccassi replied that this was an example of a situation where information on whether a dependency is required or recommended could be useful. Poettering disagreed, asserting that "which libraries to actually include in an initrd is up to local configuration or distro policy." Ultimately, consumers of the new note section can do whatever they would like with the information, including automatically generating dependencies, or merely using them as a "linter" to complain about new weak dependencies that are not already known.

I think all such approaches are better than the status quo though: we'll add a weak dep or turn a regular dep into a weak dep, and unless downstream actually read NEWS closely (which well, they aren't necessarily good at) they'll just rebuild their packages/initrd and now everything is hosed.

This appealed to Feijoo, who agreed that using the information as a sanity-check on package definitions made sense.

Carlos O'Donell asked whether Poettering cared about exposing the specific symbols and symbol versions that systemd uses, pointing out that existing ELF headers include this information. He asserted that RPM uses this information when packaging a program. Poettering said that was a good question, but replied:

To speak for the systemd usecase: even though we dlopen() quite a number of libraries these days (21 actually), and we actually do provide symbol versioning for our own libraries, we do not bother with listing symbol versions for the stuff we dlopen(). We use plain dlsym() for all of them, not dlvsym().

He went on to point out that requiring people to pin down symbol versions would be "a major additional ask".

Poettering did seem to think that there was some benefit to integrating this new proposal into the existing implementation of dynamic linking in the GNU C library (glibc). He asked O'Donell and Florian Weimer — who are both involved in the glibc project — "should we proceed with this being some independent spec somewhere that just says '.note.uapi.dlopen contains this data, use it or don't, bla bla bla'. Or did the concept of weak leaking interest you enough that you think this should be done natively in glibc, binutils and so on?" Some other operating systems — notably macOS — have a native concept of "weak linking" for optional dependencies, so the idea of incorporating this information into the build system and standard library are not new.

Zbigniew Jędrzejewski-Szmek brought up an additional question about the formatting of the new section, asking whether it would make sense to use "a compact JSON representation". Jędrzejewski-Szmek said that this could make it easy to add a human-meaningful description of what the dependency is used for. With that addition, "it should be fairly easy to integrate this in the rpm build system." Boccassi agreed that the payload should be JSON. Poettering replied: "I have nothing against using JSON for this, but it's key we can reasonably generate this from a simple no-deps C macro I'd say."

Ultimately, the idea of having a standard encoding for optional dependencies seems to have been well-received, with several package managers potentially interested in adding support. With discussion still ongoing and the final format of the added information up in the air, however, it's too soon to say exactly what form the information will take. Anything intended to help ameliorate the pain of removing traditional dynamically linked dependencies seems like a good idea, though, since they reduce the surface open to XZ-backdoor-like attacks.

Comments (63 posted)

Completing the EEVDF scheduler

By Jonathan Corbet
April 11, 2024

The Earliest Virtual Deadline First (EEVDF) scheduler was merged as an option for the 6.6 kernel. It represents a major change to how CPU scheduling is done on Linux systems, but the EEVDF front has been relatively quiet since then. Now, though, scheduler developer Peter Zijlstra has returned from a long absence to post a patch series intended to finish the EEVDF work. Beyond some fixes, this work includes a significant behavioral change and a new feature intended to help latency-sensitive tasks.

A quick EEVDF review

The EEVDF scheduler works to divide the available CPU time equally between all of the runnable tasks in the system (assuming all have the same priority). If four equal-priority tasks are contending for the CPU, each will be given a 25% share. Every task is assigned a virtual run time that represents its allocated share of the CPU; in the four-task example, the virtual run time can be thought of as a clock that runs at 25% of wall-clock speed. As tasks run, the kernel computes the difference between a task's virtual run time and its actual running time; the result is called "lag". A positive lag value means that a task is owed CPU time, while a negative value indicates that a task has received more than its share.

A task is deemed "eligible" if its lag value is zero or greater; whenever the CPU scheduler must pick a task to run, it chooses from the set of eligible tasks. For each of those tasks, a virtual deadline is computed by adding the time remaining in its time slice to the time it became eligible. The task with the earliest virtual deadline will be the one that is chosen to run. Since a longer time slice will lead to a later virtual deadline, tasks with shorter time slices (which are often latency sensitive) will tend to run first.

An example might help to visualize how lag works. Imagine three CPU-bound tasks, called A, B, and C, that all start at the same time. Before any of them runs, they will all have a lag of zero:

A B C

Lag: 0ms 0ms 0ms

	A	B	C
Lag:	0ms	0ms	0ms

Since none of the tasks have a negative lag, all are eligible. If the scheduler picks A to run first with a 30ms (to pick a number) time slice, and if A runs until the time slice is exhausted, the lag situation will look like this:

A B C

Lag: -20ms 10ms 10ms

	A	B	C
Lag:	-20ms	10ms	10ms

Over those 30ms, each task was entitled to 10ms (one-third of the total) of CPU time. A actually got 30ms, so it accumulated a lag of -20ms; the other two tasks, which got no CPU time at all, ended up with 10ms of lag, reflecting the 10ms of CPU time that they should have received.

Task A is no longer eligible, so the scheduler will have to pick one of the others next. If B is given (and uses) a 30ms time slice, the situation becomes:

A B C

Lag: -10ms -10ms 20ms

	A	B	C
Lag:	-10ms	-10ms	20ms

Once again, each task has earned 10ms of lag corresponding to the CPU time it was entitled to, and B burned 30ms by actually running. Now only C is eligible, so the scheduler's next decision is easy.

One property of the EEVDF scheduler that can be seen in the above tables is that the sum of all the lag values in the system is always zero.

Lag and sleeping

The lag calculation is only relevant for tasks that are runnable; a task that sleeps for a day is not actually missing out on its virtual run time (since it has none), so it does not accumulate a huge lag value. The scheduler does, however, retain a task's current lag value when it goes to sleep, and starts from that value when the task wakes. So, if a task had run beyond its allocation before it sleeps, it will pay the penalty for that later, when it wakes.

There does come a point, though, where it may not make sense to preserve a task's lag. Should that task that just slept for a day really be penalized for having been allowed to run a bit over its allocation yesterday? It seems clear that, sooner or later, a task's lag should revert back to zero. But when that should happen is not entirely clear. As Zijlstra pointed out in this patch from the series, forgetting lag immediately on sleep would make it possible for tasks to game the system by sleeping briefly at the end of their time slice (when their lag is probably negative), with the result that they get more than their share of CPU time. Simply decaying the lag value over time will not work well either, he concluded, since lag is tied to virtual run time, which passes at a different (and varying) rate.

The solution is to decay a sleeping task's lag over virtual run time instead. The implementation of this idea in the patch set is somewhat interesting. When a task sleeps, it is normally removed from the run queue so that the scheduler need not consider it. With the new patch, instead, an ineligible process that goes to sleep will be left on the queue, but marked for "deferred dequeue". Since it is ineligible, it will not be chosen to execute, but its lag will increase according to the virtual run time that passes. Once the lag goes positive, the scheduler will notice the task and remove it from the run queue.

The result of this implementation is that a task that sleeps briefly will not be able to escape a negative lag value, but long-sleeping tasks will eventually have their lag debt forgiven. Interestingly, a positive lag value is, instead, retained indefinitely until the task runs again.

Time-slice control

As noted above, tasks with a shorter time slice will have an earlier virtual deadline, causing them to be selected sooner by the scheduler. But, in current kernels, that implicit priority only takes effect when the scheduler is looking for a new task to run. If a latency-sensitive task with a short time slice wakes up, it may still have to wait for the current task to exhaust its time slice (which might be long) before being able to run. Zijlstra's patch series changes that, though, by allowing one task to preempt another if its virtual deadline is earlier. This change provides more consistent timings for short-time-slice tasks, while perhaps slowing long-running tasks slightly.

That leaves one open question, though: how does one specify that a given task should be given a short time slice? In current kernels, there is no way for a non-realtime process to tell the kernel what its time slice should be, so this patch series adds that capability. Specifically, a task can use the sched_setattr() system call, passing the desired slice time (in nanoseconds) in the sched_runtime field of the sched_attr structure. This field, in current kernels, is only used for deadline scheduling. With this addition, any task can request shorter time slices, which will cause it to be run sooner and, possibly, more frequently. If, however, the requested time slice is too short, the task will find itself frequently preempted and will run slower overall.

The allowed range for time slices is 100µs to 100ms. For the curious, Zijlstra has illustrated the results of various time-slice choices as an impressive set of ASCII-art diagrams in the changelog for this patch.

These changes are in a relatively early state and seem likely to require revisions before they can be considered for merging. Among other things, the interaction with control groups has not yet been investigated and may well not work properly. But, once the details have been taken care of, the EEVDF scheduler should be getting to the point where it is ready for wider use.

Comments (18 posted)

A tale of two troublesome drivers

By Jonathan Corbet
April 12, 2024

The kernel project merges dozens of drivers with every development cycle, and almost every one of those drivers is entirely uncontroversial. Occasionally, though, a driver submission raises wider questions, leading to lengthy discussion and, perhaps, opposition. That is currently the case with two separate drivers, both with ties to the networking subsystem. One of them is hung up on questions of whether (and how) all device functionality should be made available to user space, while the other has run into turbulence because it drives a device that is unobtainable outside of a single company.

mlx5ctl and fwctl

The mlx5ctl driver is not a new problem; it was covered here in December 2023. In short: this driver implements a transport channel allowing user space to query and manipulate parameters on an mlx5 device (which provides a range of networking, RDMA, and InfiniBand functionality), crucially without any understanding of those parameters on the kernel's part. Proponents say that this driver is needed to provide users with the access needed to configure and debug their hardware, especially on locked-down systems where other methods of talking directly to the hardware are unavailable. Opponents see it as a way of circumventing the normal development process that governs how device parameters are exported to user space.

Saeed Mahameed posted a new version of the mlx5ctl patch series at the beginning of February, saying: "We continue to think that mlx5ctl is reasonable and aligned with the greater kernel community values". Christoph Hellwig responded with an ack and a complaint about the "subsystem maintainer overreach" that has blocked the merging of this driver. Networking maintainer Jakub Kicinski agreed that "overreach is unfortunate", but also maintained the position that this driver should not be merged: "We have a clear rule against opaque user space to FW [firmware] interfaces". Beyond that, there was not a lot of other discussion on the submission at that time.

At the beginning of March, though, Jason Gunthorpe posted a proposal for a new subsystem called "fwctl" that would be the home for drivers like mlx5ctl. Modern devices, he wrote, tend to come with a large set of tunable parameters controlling many aspects of their functionality; these parameters need to be made accessible by user space if users are to be able to use their hardware.

fwctl's purpose is to define a common set of limited rules, described below, that allow user space to securely construct and execute RPCs inside device FW. The rules serve as an agreement between the operating system and FW on how to correctly design the RPC interface. As a uAPI the subsystem provides a thin layer of discovery and a generic uAPI to deliver the RPCs and collect the response. It supports a system of user space libraries and tools which will use this interface to control the device.

The proposal goes into some detail on the types of functionality that will be made available via fwctl interfaces. It also covers the functionality that cannot be provided, including the ability to DMA to arbitrary memory, manipulate kernel memory or subsystems outside of the driver itself, or provide functionality, such as sending a network packet, that should be handled by another subsystem.

As before, the primary opposition (to both mlx5ctl and fwctl) came from Kicinski. He described the justification for this work as "smoke and mirrors", saying it was a way for manufacturers to "hide as much as possible of what you consider your proprietary advantage in the 'AI gold rush'". Complex hardware, he said, does not need a backdoor to talk to the firmware without the kernel's mediation; he cited the network interfaces used at Meta (his employer) as an example. He questioned whether the restrictions on fwctl drivers would be enforced, and said that the conversation did not appear to be going anywhere useful:

Or should we go for another loop of me talking about openness and building common abstractions, and vendors saying how their way of doing basic configuration is so very special, and this is just for debug and security and because others.
There's absolutely no willingness to try and build a common interface here.

Kicinski has repeatedly said that this functionality should be provided via an API like devlink, where parameters are exposed after a community review that is, among other things, intended to force consistency between hardware from different manufacturers. He complained that his offer to quickly review proposed devlink knobs had been ignored by the vendors looking for interfaces like fwctl.

On the other side, David Ahern asserted that fwctl is the common interface that Kicinski is looking for. Gunthorpe said that all complex devices require hardware-specific tooling to configure them to the customer's needs. The only reason Meta does not need such tools is that, as a large customer, it is able to receive its hardware preconfigured from the vendor; smaller customers do not receive that level of service. Vendors have been providing these tools for years, he said; fwctl is just a way to provide a common interface for them.

The problem with the devlink approach, Gunthorpe added, is that, beyond the slow and painful nature of the process, it is guaranteed to fail. To be useful, an interface must be able to work with all of the parameters provided by the device:

As far as configuration/provisioning goes, it is really all or nothing.
If a specific site can configure only 90% of the stuff required because you will NAK the missing 10% it then it is still not usable and is a wasted effort for everyone.
You have never shown that there is a path to 100% with your approach to devlink. In fact I believe you've said flat out that 100% is not achievable.

Kicinski was not receptive to this argument, though, calling many of the knobs "hacks and lazy workarounds".

As of this writing, this discussion does not appear to be any closer to a resolution than it was in December. The positions taken have only hardened over time. In the end, the fate of this driver (or for a future fwctl subsystem) may well depend on whether Linus Torvalds is willing to allow a networking maintainer to block the merging of a driver that is, by most accounts, independent of the networking subsystem.

A network interface for one

At the beginning of April, Alexander Duyck posted a driver called "fbnic" for a custom network interface card that is used only within Meta. That prompted an immediate question from Jiri Pirko, who wondered why the community needs a driver for a device that nobody is able to acquire. Duyck responded that upstreaming the driver would make maintenance easier, that it would make it easier to introduce new networking features implemented in the driver, and that the company might someday open some of the hardware information as well. Pirko was unimpressed and said that the driver should not be merged.

Duyck called this reasoning "arbitrary and capricious". The driver will have a lot of users at Meta, he said. There have been other proprietary devices added to the kernel in the past; the Intel IDPF driver was mentioned as an example elsewhere in the conversation. Drivers also often show up for devices that are not yet for sale, and may never make it to the market. To reject the driver, he said, is an accusation of "some lack of good faith" from Meta.

Kicinski tried to redirect the discussion somewhat, saying that he did not want to be in the position of judging the "good faith" of companies. The community, he said, had to make its decision based on the interest of the project and the broader user base. He did not say, then, whether he thought the driver should be merged or not. Others, though, such as John Fastabend and Paolo Abeni, argued that fbnic appeared to be good code, and that in any case it is only a network-interface driver with no potential to harm the rest of the kernel, so there is no reason to keep it out.

Gunthorpe, while not arguing against the merging of fbnic, raised some concerns. There is a strong feeling that code should not be merged solely for the purpose of supporting proprietary user-space code, he said, and "this submission is clearly blurring that line". That could, he said, lead to problems in the future as more features are added to the driver.

There was a brief turn in the conversation when Andrew Lunn referred to the mlx5ctl discussion and asked Duyck to show that a separate firmware-tuning driver would not be required for this device. Kicinski said that showing that would not change anybody's mind. Ahern suggested that, in the future, when "the inevitable production problems" show up, a separate, mlx5ctl-like driver may well become necessary.

Perhaps the biggest concern, though, was expressed by Kicinski: what happens if changes elsewhere in the kernel break the driver, creating a regression for its users? Since the community as a whole cannot test the driver, such breaks could be hard to avoid and even harder to fix; that could lead to kernel changes being reverted. In such a situation, a private driver like fbnic could impede kernel development in general.

For that reason, though Kicinski eventually concluded that "there's broad support for merging the driver", he also said that there needs to be a slightly different set of rules governing drivers for private devices. These would include "weaker 'no regression' guarantees" and an expectation that the driver maintainers will participate actively in efforts to refactor subsystem interfaces. In the absence of such participation, a driver for a private device could be removed from the kernel. Pirko eventually agreed that, if the driver were to be marked as belonging to this new regime (which would have to be documented), it "would be ok to let this in".

So the fbnic driver seems likely to be merged in the end. The same may eventually be true of mlx5ctl in some form as well. The Linux kernel did not get to the position it is in by refusing to let users access the full capabilities of their hardware, and it seems unlikely to adopt such a policy now. A more difficult prospect, though, is to guess how many more lengthy discussions will be required to reach that decision.

Comments (25 posted)

Cleaning up after BPF exceptions

By Daroc Alden
April 15, 2024

Kumar Kartikeya Dwivedi has been working to add support for exceptions to BPF since mid-2023. In July, Dwivedi posted the first patch set in this effort, which adds support for basic stack unwinding. In February 2024, he posted the second patch set aimed at letting the kernel release resources held by the BPF program when an exception occurs. This makes exceptions usable in many more contexts.

BPF exceptions are somewhat dissimilar to exceptions in other languages. For one thing, they cannot be caught — any call to bpf_throw() will result in the BPF program exiting. There is a callback that the user can register to set the return code of the BPF program in the event of an exception, but there is no way for the program to recover. In the same vein, there are not different types of exceptions — all BPF exceptions behave the same way. BPF exceptions are subtly different from exit() because they do still unwind the stack.

Currently, unwinding the stack doesn't make much difference. The BPF verifier prevents programs that hold resources (data structures, such as sockets, that must be disposed of in a specific way) from raising an exception by calling bpf_throw(). Letting them do so would be a problem because the current kernel is not prepared to release those resources, which would be a violation of BPF's safety guarantees. Dwivedi's new patch set takes advantage of the fact that BPF exceptions still unwind the stack to release the resources held by each function as its stack frame is unwound. For now, only some types of resource are supported — notably not including spinlocks or read-copy-update (RCU) structures — but future work can add additional types of resource over time.

Or that's the theory, at least. As it stands, the BPF verifier does not always prevent programs from throwing exceptions with resources held. The first patch of Dwivedi's new set notes:

However, there currently exists a loophole in this restriction due to the way the verification procedure is structured. The verifier will first walk over the main subprog's instructions, but not descend into subprog calls to ones with global linkage. These global subprogs will then be independently verified instead. Therefore, in a situation where a global subprog ends up throwing an exception (either directly by calling bpf_throw, or indirectly by way of calling another subprog that does so), the verifier will fail to notice this fact and may permit throwing BPF exceptions with non-zero acquired references.

That patch fixes the issue by making the verifier determine early on which functions can throw exceptions, so that information is available when performing the main analysis. The remaining patches in the set add a new pass to the verifier to let it collect the information needed to actually release any resources the program holds.

In order to free resources the BPF program holds, the kernel needs to know two things: where they are, and what kind of resource they are. The new verifier pass walks the program and generates a map for each location through which an exception could cause the stack to unwind. Each one records which locations on the stack could hold releasable resources at that point in time. These maps also store the type of resource, which the verifier already has to track in order to ensure that resources are properly released in the course of normal execution.

Not all relevant resources live on the stack, however. BPF has callee-saved registers (R6-R9) which might contain resources when bpf_throw() is called. To handle this, the patch set inserts a new hidden wrapper around bpf_throw() that explicitly spills those registers to the stack.

At run time, when an exception is thrown, the kernel looks up the relevant stack map using the current instruction pointer. For each releasable resource, the kernel calls the release function associated with its type. Then the stack frame is unwound, and the kernel repeats the process with the stack map of the calling function. When unwinding the program, the kernel also needs to have the location of callee-saved registers recorded in the map, so that it can restore them to the correct location. This keeps the state of the program being unwound consistent, so that stack maps of earlier frames remain correct.

One advantage of this approach is that subprograms that do not use exceptions don't incur any additional runtime overhead, because they do not need stack maps. In contrast, one complication is that it is perfectly legal for a BPF program to store different things in the same stack slot in different execution paths of the function, as long as the verifier can show that the types remain correct. A completely comprehensive approach to a stack map would therefore need to include some amount of runtime information about which execution path a function has taken.

Dwivedi's patch set does not go that far. Luckily, it turns out that most real-world BPF programs do not actually use stack slots in this way. Dwivedi's patches "merge" stack maps from divergent execution paths when they have compatible types, and return an error when they do not. He investigated and found that existing BPF programs do not run into this error, and that merges of conflicting types were "unlikely to often occur in practice".

There is one special case, however. It is somewhat common for a program to acquire a resource conditionally, which means that its stack slot might contain a null pointer. The new verifier pass handles this by merging other types with null pointers when necessary. In the end, it requires that all the execution paths of a function either store the same type of resource in the same slot, or leave a null pointer there. This allows the verifier to coalesce all of the maps for a given function, preventing a potential combinatorial explosion.

Eduard Zingerman raised some concerns with that approach, saying that he worried about the "possibility that frame merge might fail because of some e.g. clang version changes" that modify how the compiler chooses to lay out the stack. Zingerman suggested a run-time approach that tracks resources as they are acquired and released by the actual program instead, saying such an approach "seems simpler than frame descriptors tracking and merging", and that it could support aborting BPF programs at any point, instead of only at calls to bpf_throw(). The downside would be run-time overhead, even for programs that never actually throw an exception.

Dwivedi responded: "I went over this option early on but rejected it due to the typical downsides you observed". He went on to explain the overhead such a runtime approach would require in detail, concluding by saying: "I just think this is all unnecessary especially when most of the time the exception is not going to be thrown anyway, so it's all cost incurred for a case that will practically never happen in a correct program."

David Vernet also reviewed the patch set, pointing out that the new pass looks fairly similar to the existing check_max_stack_depth_subprog() code, and asking whether they could be combined. Dwivedi agreed that this would be a good idea. He plans to incorporate that work (and a related refactoring of the stack-depth-checking code) into version two of the patch set.

Comments (8 posted)

Managing to-do lists on the command line with Taskwarrior

April 17, 2024

This article was contributed by Koen Vervloesem

Managing to-do lists is something of a universal necessity. While some people handle them mentally or on paper, others resort to a web-based tool or a mobile application. For those preferring the command line, the MIT-licensed Taskwarrior offers a flexible solution with a healthy community and lots of extensions.

Getting started with Taskwarrior is straightforward, but it also supports sophisticated functionality, including projects, due dates, dependencies, user-defined metadata, and hook scripts. The program's philosophy describes values such as openness, low friction, no performance penalty for unused features, extensibility, and a focus on doing one thing well. Taskwarrior does not dictate a specific methodology for users to manage their to-do list. It provides advanced functionality enabling users to integrate the program into their existing workflows. The documentation lists some workflow examples, some of them including elements from Getting Things Done and Kanban methodologies.

Taskwarrior has been in development since 2006. On March 24, the project released Taskwarrior 3.0. Most Linux distributions are still packaging an older version, but downloading and building Taskwarrior 3.0 is a simple process if a reasonably recent version of the Rust toolchain (1.70 or later) is installed. I did stumble upon a build error; however, this was a known issue that was solved by removing a line in the project's Cargo.toml file.

Simple

In its simplest form, using Taskwarrior only requires knowledge of three subcommands: "task add <description>" to add a task to the to-do list, "task <ID> done" to mark a task as done, and "task list" to get a list of all tasks. After adding a task, Taskwarrior shows its ID, which can be used to refer to the task in other commands, for instance to mark it as done.

One potential hiccup is that this ID represents the task's index in the list of all pending tasks. After marking a task as done, the IDs of all subsequent tasks decrement by one. This change, however, only takes effect after a command displaying IDs is executed. Therefore, after listing the tasks and looking up their IDs, the IDs can be safely used in commands to mark various tasks as done. But as soon as a new "task list" command is executed, the IDs will be changed to consecutive numbers again. In practice, this is a non-issue for me. As the shell history shows the previously entered Taskwarrior commands with the IDs of the tasks, I rarely need to explicitly ask for the IDs with a "task list" command.

Taskwarrior has a built-in set of sensible defaults, which can be overridden individually in a configuration file. On its first use, the "task" command creates a minimal configuration file where the user can add configuration options. The "task show" command reveals all defaults and overrides. For a complete description of all supported configuration options, the configuration file's man page can be consulted with "man taskrc". The configuration file can be manually edited, or modified using the "task config" command.

Advanced

Taskwarrior implements many commands using a flexible command-line syntax, which gives access to its advanced functionality. The general form of its syntax is:

    task [<filter>] [<command>] [<modifications> | <miscellaneous>]

A command like "task 12 done" uses 12 as a filter to exclusively mark the task with ID 12 as done. However, a filter can also be a tag like "+work" or an attribute value like "status:pending". Filters can be used to restrict the tasks in the list output. For example, "task project:Book list" shows only the tasks assigned to the project "Book".

Tags are simply single-word alphanumeric labels, and a task may have any number of them. Additionally, a task can be linked to a project, which is a way to group tasks; each task can only be linked to a single project, though each project can have multiple subprojects. Tags and projects assist in filtering tasks conveniently.

Modifications are often used with the "modify" command, which allows users to change tasks after adding them. For instance, this command adds the task with ID 12 to the project Book:

    task 12 modify project:Book

In a similar way, a due date can be assigned to a task:

    task 12 modify due:tomorrow

Furthermore, a task can have a scheduled date (representing the earliest opportunity to work on a task), a wait date (which keeps the task hidden until the specified date), and an until date (the date after which the task is automatically deleted). All these types of dates are optional.

Some commands accept neither a filter nor modifications, but do accept miscellaneous arguments. An example of this is the "show" command, which displays the values of configuration options, such as in "task show verbose".

Recurring tasks are also supported. The following command adds a new task on the first day of every month until the end of March 2026:

    task add Pay the rent due:1st recur:monthly until:2026-03-31

The recurring task does not appear in the list of tasks, however. It has a status of "recurring", so only instances of the template task are created by Taskwarrior and added to the list of tasks, not the template itself. By default, only the next occurrence of the task is added to the list, but the tool can be configured to maintain several additional instances if desired.

Taskwarrior further supports a considerable number of reports for visualizing tasks. This includes burndown charts (seen below) showing the number of pending, active, and completed tasks over time (by day, week, or month), history reports, and lists of projects and tags.

The basic task list command is also a report. Many aspects of its output are configurable. Users can override the filter (which, by default, only shows the pending tasks), the columns (which show metadata for each task), and the task order in the list. Similarly, a "task all" command shows all pending, completed, and deleted tasks.

Extending Taskwarrior

Various external projects exist that extend Taskwarrior's functionality. The Tools page on the project's web site enumerates 870 projects (410 when excluding dormant projects). For instance, taskwarrior-tui and vit implement interactive terminal user interfaces. The Bugwarrior extension can update a local Taskwarrior database from Bugzilla, GitHub, GitLab, and other bug-tracking systems.

Another noteworthy companion tool is Timewarrior, which was created by the same developers as Taskwarrior. A hook script integrates Timewarrior with Taskwarrior to track time whenever a task is active. Hence, starting a task with "task <ID> start" prompts the hook script to begin time tracking in Timewarrior until Taskwarrior stops the task after a "task <ID> stop" command. The "timew summary" command will then show a report of the tracked intervals of the day; users can bring up some charts about the tracked tasks, as well. Timewarrior shares Taskwarrior's simple approach for basic functionality coupled with flexible configurability, as can be seen in its documentation.

Upgrade issues

Taskwarrior 3.0 introduced a completely new task storage and synchronization backend written in Rust: TaskChampion. As a result, users running a 2.x version of Taskwarrior need to export their tasks from Taskwarrior 2 and, after installing Taskwarrior 3, re-import all of their tasks, as explained in the documentation about the upgrade process. After this, the old plaintext data files can be deleted, since the tasks are now stored in a SQLite file.

The upgrade caught some users by surprise, however. For instance, the Arch Linux package didn't provide a migration script nor instructions about the necessary steps. As a result, upgrading the Taskwarrior package on Arch Linux led to an empty task database.

Another breaking change concerns the synchronization functionality. Taskwarrior 2 has its own server for synchronizing to-do lists across machines, "taskd". The documentation still explains this server setup, but Taskwarrior 3 is not compatible with this implementation. For the new version, Taskwarrior developers recommend the use of a cloud-storage backend. However, according to the "task-sync" man page, Google Cloud Platform appears to be the only supported platform at the moment.

The man page offers two alternate options: storing Taskwarrior's data directory on a file sharing service such as Dropbox or Google Drive, or running a TaskChampion sync server. Unfortunately, the latter option suffers from a lack of documentation. In response to an issue about this problem, Taskwarrior developer Dustin J. Mitchell admitted: "I suspect you're not alone in wanting to deploy a self-hosted sync backend. We don't have a great answer for that right now." Thus, users relying on Taskwarrior 2's taskd should probably wait before upgrading to Taskwarrior 3. This includes users of the Flutter-based TaskWarrior Mobile for Android.

Conclusion

More than twelve years after embracing Taskwarrior 2.0 for managing my to-do list and tracking time, I'm still a happy user. My usage pattern has remained relatively unchanged since then, which is a testament to Taskwarrior's non-intrusive nature. Remembering a few commands is enough for my daily work, and its command-line nature makes it easy to integrate into other tools. For instance, I display the ID and description of the currently active task in my tmux status bar. For the rare occasions when I need advanced functionality, the man pages and documentation are excellent. For users spending a lot of time with the command line, Taskwarrior is a great solution for frictionless management of their to-do lists.

Comments (16 posted)

Page editor: Jonathan Corbet

Inside this week's LWN.net Weekly Edition

Briefs: Social engineering; Putty 0.81; XZ takeaways; Quotes; ...
Announcements: Newsletters, conferences, security updates, patches, and more.

Next page: Brief items>>