Systemd heads for a big round-number release

Did you know...?

LWN.net is a subscriber-supported publication; we rely on subscribers to keep the entire operation going. Please help out by buying a subscription and keeping LWN on the net.

By Daroc Alden
May 7, 2024

The systemd project is preparing for a new release. Version 256-rc1 was released on April 25 with a large number of changes and new features. Most of the changes relate to security, easier configuration, unprivileged access to system resources, or all three of these. Users of systemd will find setting up containers — even without root access — much simpler and more secure.

Lennart Poettering chose to experiment with a new format for announcing features this year: posting a series of Mastodon threads that cover features that he's excited about in more detail. Poettering said that he found it easier to get ideas out on Mastodon than in a more official venue, and invited anyone who wished to consolidate his thoughts as a long-form article to do so. One thread — on systemd's new run0 tool — has already generated substantial commentary.

The first thread describes the new way that systemd finds configuration files. Currently, many tools, systemd included, support reading multiple configuration files from a directory (whose name typically ends in .d) and combining them to produce the final configuration. As Poettering points out in his thread, this approach is useful for package managers, because it lets individual packages add to the configuration while keeping those contributions separate.

There are some situations where it's less important to have files from many packages than from different versions of the same package: for example, a container runtime needing to deal with versioned images. Ideally, existing containers could continue using an older version, while new containers would seamlessly use the newest version. Systemd now supports this use case by reading files from a directory whose name ends in ".v". When a systemd tool goes looking for a particular file — example.ext, for example — it will now accept a directory called example.ext.v/ with files example_[version].ext inside. Of the available files, the tool will pick the one with the highest semantic version number.

The rest of the changes Poettering has chosen to highlight are a bit larger. Systemd has had support for encrypted credentials for some time. In systemd terms, a credential is a named blob that an application may interpret however it likes. Credentials are locked to a computer's trusted platform module (TPM), or stored on an encrypted disk if no TPM is available. These credentials have only been usable by system services, however, not by per-user services. Poettering shared that systemd version 256 would support making credentials available to user services. This is useful in its own right, but other improvements make this feature more useful than it might initially appear.

The release also includes support for working with discoverable disk images (DDIs) in an unprivileged context. DDIs are disk images with embedded metadata that systemd uses for various purposes. DDIs are often used as filesystem images for systemd-nspawn containers. Letting unprivileged users work with DDIs was the last step required to permit unprivileged systemd-nspawn containers.

Finally, systemd also supports configuring some settings by adding encrypted credentials — even if these thing are not traditional "credentials", but rather just a useful way to pass configuration parameters into a service using an interface that already existed. For example, systemd-firstboot looks for a credential called firstboot.locale and uses its value as the system's locale. On a physical computer or a virtual machine, those credentials can be passed in via the BIOS or UEFI ESP. In a container, they can be passed in via a mount under /run/host. The number of settings that can be configured this way has been greatly expanded in the new release:

Thus, a regular systemd system will now allow you to configure via credentials: keymap, locale, timezone, issue file, motd file, hosts file, .link files, .network files, .netdev files, DNS servers, DNS search domains, root passwords, root shell, SSH key of root, additional SSH address/port to listen on, sysuser.d/ additions, tmpfiles.d/ additions, sysctl.d/ additions, fstab additions, console font, additional TTYs to spawn gettys on, socket to forward journal data to, socket for sd_notify() messages from the system, machine ID, hostname, systemd-homed users to create, cryptsetup passwords and pins, additional unit files and drop-ins for unit files, udev rules, and more.

The combination of these features means that it is now possible for an unprivileged user to configure their own systemd-nspawn containers — or even entire hierarchies of such containers — using encrypted credentials that are protected from other users on the host system.

That isn't the only feature designed to make interacting with containers or virtual machines more pleasant, however. Many readers may be aware of the sd_notify() protocol that systemd uses to get information from system services about their status. Less well-publicized is the fact that systemd actually sends sd_notify() messages to whatever started it. This is useful for running systemd under another init system, but it also means that systemd can signal the host of a container this way. Since version 253, systemd has also supported the AF_VSOCK option for sending sd_notify() messages, letting it send messages to the virtual machine manager responsible for more traditional virtual machines.

Version 256 adds a new message that systemd will send when a given target is fully activated: X_SYSTEMD_UNIT_ACTIVE=[unit name]. Poettering calls this "both a progress notification and a feature notification". One example use is letting the host system of a virtual machine know when the SSH socket (which systemd sets up before starting SSH, and then hands over when the service is up in socket-activated configurations) is bound, and therefore it can connect without errors or retries. Other uses include discovering what services are running on a virtual machine, or providing a more granular view of how far into starting up the machine is.

Another feature that existed previously in a smaller form, but which is now available to the whole system, is a configuration option called ProtectSystem. Services with this option run in a separate mount namespace where important system directories — particularly /usr — are mounted read-only. Since few programs need to write to /usr, this is a fairly seamless way to make the system more secure.

With version 256, this option can now be applied to the entire system instead of on a service-by-service basis. While this is not practical for most systems, since tools like package managers do still need to write to /usr on occasion, there is one place where enabling the option by default makes sense: the system's initial ramdisk.

When a Linux system starts up, it begins by creating a temporary, in-memory filesystem and unpacking the initial ramdisk into it. Then it starts the init process from the disk, and leaves the task of actually setting up all the expected filesystem mounts and so on to user space. Often, this setup involves talking to the network, receiving encryption secrets to unlock the hard disk, or both. Exposing trusted code to the network is always risky, but the code to handle both of those things can also write to the temporary filesystem, opening an even larger attack surface. With the new version, however, ProtectSystem becomes the default for systemd on a ramdisk, causing it to remount the temporary filesystem as read-only before proceeding with the rest of the boot. Early tests revealed few problems with this change, Poettering said. The only distribution to have a serious problem with it was Fedora; dracut (the tool Fedora uses to create an initial ramdisk) had problems writing hook files with the new protection in place, but has since been fixed.

The final feature that Poettering has discussed at the time of writing (although more threads seem sure to follow) is a quality-of-life improvement for users of systemd-homed — a service that encrypts users home directories until they log in. Unfortunately, encrypted home directories don't work with SSH because it doesn't include a mechanism to ask for encryption secrets before trying to start a shell (systemd-homed loads SSH authorized keys from outside the home directory, so that is not a barrier to SSH logins). Currently, users must log in locally at least once (in order to be prompted to unlock their home directory) in order for SSH logins to work correctly. With the new update, systemd has added a shim that will intercept SSH logins for a user with an encrypted home directory and prompt them to enter encryption credentials over the network.

New systemd versions don't just bring new features, however. They also bring the deprecation of old features. In this case, the most noticeable deprecation is that systemd is finally dropping support for version 1 control groups (cgroups) in favor of the newer version 2 cgroups. A system that boots with version 1 control groups will cause systemd to fail loudly with an error, although version 1 cgroups can still be turned on with an option on the kernel's command line, for now.

There are other, less notable additions and deprecations with the release as well, including changes to nscd caching, configuration file locations, and many others. Interested readers can find the full list in the project's NEWS file. Systemd releases usually have three or four release candidates approximately a week apart, so it is reasonable to expect that systemd version 256 will be fully released in approximately a month, and make its way into distributions from there.

(Log in to post comments)

Systemd heads for a big round-number release

Posted May 7, 2024 16:21 UTC (Tue) by bluca (subscriber, #118303) [Link]

> The release also includes support for working with discoverable disk images (DDIs) in an unprivileged context. DDIs are disk images with embedded metadata that systemd uses for various purposes. DDIs are often used as filesystem images for systemd-nspawn containers. Letting unprivileged users work with DDIs was the last step required to permit unprivileged systemd-nspawn containers.

Note that this feature is not exclusively for nspawn - there are two new services, publishing IPC interfaces, that allow any unprivileged user to send over an uninitialized user+mount namespace FD,s get a uid range assigned to the former, and an image mounted in latter. This is gated by polkit and/or the image verity signature being trusted by the kernel.
This can be used to implement real unprivileged container managers - unlike the currently existing ones, which say they are unprivileged but actually need setuid binaries, which means they are anything but.

Systemd heads for a big round-number release

Posted May 7, 2024 20:42 UTC (Tue) by intelfx (subscriber, #130118) [Link]

> This can be used to implement real unprivileged container managers - unlike the currently existing ones, which say they are unprivileged but actually need setuid binaries

You mean "disk-image" container managers? Not sure how to call it, but I'm pretty sure podman is already truly unprivileged...

Anyway, I can see this could be pretty useful but also dangerous because it would allow the kernel to trip over a potentially malicious filesystem image. I'm assuming polkit/verity integration is there for this exact reason, with polkit covering the workstation use-case and verity covering the "container fleet" use-case?

Systemd heads for a big round-number release

Posted May 7, 2024 21:17 UTC (Tue) by bluca (subscriber, #118303) [Link]

> mean "disk-image" container managers? Not sure how to call it, but I'm pretty sure podman is already truly unprivileged...

Nope, uses setuid binaries

> Anyway, I can see this could be pretty useful but also dangerous because it would allow the kernel to trip over a potentially malicious filesystem image. I'm assuming polkit/verity integration is there for this exact reason, with polkit covering the workstation use-case and verity covering the "container fleet" use-case?

Yes, pretty much

Systemd heads for a big round-number release

Posted May 8, 2024 1:01 UTC (Wed) by intelfx (subscriber, #130118) [Link]

> Nope, uses setuid binaries

Okay, what am I missing here?

# find /usr -perm /u+s,g+s -print0 | parallel -0 -X pacman -Qqo | sort -u | grep -Fxf <(pactree --linear --unique podman)
dbus
krb5
pam
shadow
util-linux

Systemd heads for a big round-number release

Posted May 8, 2024 15:00 UTC (Wed) by pbonzini (subscriber, #60935) [Link]

I think podman uses newuidmap/newgidmap, which is part of the shadow package (it's in a package called shadow-utils on Fedora)

podman unprivileged (Systemd heads for a big round-number release)

Posted May 8, 2024 1:07 UTC (Wed) by geofft (subscriber, #59789) [Link]

podman uses the newuidmap/newgidmap commands to get access to a range of subuids. There are cases where this is close enough to "unprivileged," because those commands can come from the OS and have a pretty stable/minimal interface, and an unprivileged user can install whatever version of podman they like via whatever means (e.g. building from source) and it will work with the existing newuidmap/newgidmap binaries - nothing in podman itself needs to be setuid.

So, for instance, this is more unprivileged than anything that requires a new version of systemd to be running as root: an unprivileged user will often be on some system that sets up subuids/subgids for them and has the setuid helpers because that's what the distro does by default, but the system is running some existing stable release of a distro. (Also, all of this only works if unprivileged user namespaces are enabled, which some distros are clamping down on.)

It is possible to get podman to work without the setuid binaries at all, though - assuming you don't have subuids set up, you can just do podman run --uidmap 0:0:1, and it will realize it doesn't need to map things. (If you do have /etc/subuid and /etc/subgid files but newuidmap/newgidmap aren't actually setuid, e.g. because you're in some no-new-privs mode or using strace or whatever, it's currently a little bit annoying but doable. I got this working with suitable fake subuidmap and subgidmap commands.)

But also, this is a sort of different thing from what systemd-nspawn has gained a privileged helper for. There isn't anything that handles UID/GID mapping in this setup, is there? So the limitation, which would apply to both systemd-nspawn and fully unprivileged podman, is that you only get a single UID/GID inside the container. Which is often fine but not always.

The privileged helper for systemd-nspawn is, as I understand it, an IPC interface to get a mounted root filesystem for the container. podman does not need privilege for that - but it does need FUSE accessible to unprivileged users, which is another common but not guaranteed configuration. (Or, I think, it can just copy a bunch of files with the default "vfs" driver.) I'm curious if you considered a FUSE approach of some sort to allow using untrusted images.

podman unprivileged (Systemd heads for a big round-number release)

Posted May 8, 2024 1:12 UTC (Wed) by intelfx (subscriber, #130118) [Link]

Thanks, this has just filled a gap in my understanding (that, to my shame, I did not even realize was there).

podman unprivileged (Systemd heads for a big round-number release)

Posted May 8, 2024 1:23 UTC (Wed) by geofft (subscriber, #59789) [Link]

On a side note - if systemd wanted to incorporate some IPC interface for doing the thing that newuidmap/newgidmap does but without a setuid binary, that would be awesome, and very much in line with the run0 stuff. I tried looking at whether we could eliminate setuid binaries (or equivalently stick people in a no-new-privs environment at login) at my workplace and I think newuidmap/newgidmap and sudo were the most difficult ones to deal with.

podman unprivileged (Systemd heads for a big round-number release)

Posted May 8, 2024 9:52 UTC (Wed) by bluca (subscriber, #118303) [Link]

That is exactly what the new IPC does, it's provided by the new service systemd-nsresourced, and that's why as I mentioned this is really setuid-free:

https://www.freedesktop.org/software/systemd/man/devel/sy...

podman unprivileged (Systemd heads for a big round-number release)

Posted May 8, 2024 14:11 UTC (Wed) by intelfx (subscriber, #130118) [Link]

BTW, is there any reason this uses varlink instead of D-Bus? Are there any plans to reconcile the two, or make Varlink _the_ primary IPC method (obsoleting D-Bus), or is this another one of those "15 competing standards" situations?

podman unprivileged (Systemd heads for a big round-number release)

Posted May 8, 2024 14:28 UTC (Wed) by bluca (subscriber, #118303) [Link]

The last one

podman unprivileged (Systemd heads for a big round-number release)

Posted May 8, 2024 14:34 UTC (Wed) by daroc (editor, #160859) [Link]

As my mother says, the nice thing about standards is that there are so many of them.

podman unprivileged (Systemd heads for a big round-number release)

Posted May 8, 2024 14:35 UTC (Wed) by paulj (subscriber, #341) [Link]

Sounds like NIH. Also, basing a protocol for simplicity on JSON is.... interesting (passing numbers in JSON in a generally robust way is anything but simple).

podman unprivileged (Systemd heads for a big round-number release)

Posted May 8, 2024 9:55 UTC (Wed) by bluca (subscriber, #118303) [Link]

Podman is not really unprivileged though, as it uses setuid binaries, which escalate privileges of the callers and have all the well-known issues and vulnerabilities. Nspawn doesn't use any of that since this release, and instead gets its UID mappings assigned via IPC from systemd-nsresourced, so it's radically different - no setuid binaries involved.
And this is an IPC API, so anything else can make use of it, if it is enabled on the system, not just nspawn.

podman unprivileged (Systemd heads for a big round-number release)

Posted May 8, 2024 13:47 UTC (Wed) by rbranco (subscriber, #129813) [Link]

Which command? I don't see any setuid binaries for podman on openSUSE, Fedora, Debian & Ubuntu.

podman unprivileged (Systemd heads for a big round-number release)

Posted May 8, 2024 13:53 UTC (Wed) by bluca (subscriber, #118303) [Link]

newuidmap and newgidmap

podman unprivileged (Systemd heads for a big round-number release)

Posted May 8, 2024 21:55 UTC (Wed) by paravoid (subscriber, #32869) [Link]

> podman does not need privilege for that - but it does need FUSE accessible to unprivileged users, which is another common but not guaranteed configuration. (Or, I think, it can just copy a bunch of files with the default "vfs" driver.)

I think you're talking about the fuse-overlayfs driver. That's actually not needed anymore: the kernel's overlayfs these days* supports user namespaces, and can thus be used without root (assuming unprivileged user namespaces are enabled of course). So FUSE is not required anymore.

*: Linux >= 5.11

Systemd heads for a big round-number release

Posted May 7, 2024 23:05 UTC (Tue) by smcv (subscriber, #53363) [Link]

> I'm pretty sure podman is already truly unprivileged...

Sadly, no. If it was, it would have the same limitations as bubblewrap, which can only provide two user IDs (the one that is mapped to the caller's uid, and the kernel's overflow uid, which you can think of as "me" and "not me" respectively) and the analogous situation for group IDs.

This is a kernel-imposed restriction, because the kernel doesn't know that it's OK for me (uid 1000, say) to run arbitrary code as some mapped uid (uid 100000, say). That policy is provided by /etc/subuid and /etc/subgid, which are read by setuid programs like newuidmap but are not special to the kernel.

That's usually fine for a typical bubblewrap use-case like Flatpak that only wants to run a single app, with the wider system protected from the app, but no further privilege separation within it; but it would not be enough for a whole-system container manager like Podman or Incus that wants to distinguish between various different uids inside the container, or perhaps even run a whole OS from init upwards.

Systemd heads for a big round-number release

Posted May 8, 2024 1:04 UTC (Wed) by intelfx (subscriber, #130118) [Link]

> This is a kernel-imposed restriction, because the kernel doesn't know that it's OK for me (uid 1000, say) to run arbitrary code as some mapped uid (uid 100000, say). That policy is provided by /etc/subuid and /etc/subgid, which are read by setuid programs like newuidmap but are not special to the kernel.

Ah, I see. I did not know it worked this way.

Systemd heads for a big round-number release

Posted May 8, 2024 1:21 UTC (Wed) by geofft (subscriber, #59789) [Link]

Well, bubblewrap's manpage explicitly mentions using those tools:

--userns-block-fd FD: Do not initialize the user namespace but wait on FD until it is ready. This allow external processes (like newuidmap/newgidmap) to setup the user namespace before it is used by the sandbox process.

So I don't think that this is really a limitation of either podman or bubblewrap. By themselves they can only put you in single-UID mode (see my other comment for how to do this with podman); with the setuid helpers, they can both support multi-UID mode. podman defaults to calling the helpers and bubblewrap won't do it on its own, but that's more because of what the two projects are trying to be.

(on a side note, thank you for maintaining bubblewrap, it's awesome)

Systemd heads for a big round-number release

Posted May 8, 2024 4:43 UTC (Wed) by josh (subscriber, #17465) [Link]

> This is a kernel-imposed restriction, because the kernel doesn't know that it's OK for me (uid 1000, say) to run arbitrary code as some mapped uid (uid 100000, say).

This was the problem I attempted to solve many years ago with "supplementary UIDs", which would have allowed the login mechanism to give the user access to a range of additional UIDs that it could use as it saw fit. It used a "setusers" syscall, analogous to "setgroups".

Unfortunately, in the course of developing this, the theoretical possibility came up of a file that gives more permission to "group" or "other" than to "user", so the ability to drop a UID from your current identity was considered a security issue. (The observation that supplementary GIDs would already allow that as well led to a security bugfix to prevent that, which is why container GID management has the extra "setgroups" hoop to jump through.)

Because of that, I gave up on the patch. I would love to see someone revive it; it would be really useful for containers.

Systemd heads for a big round-number release

Posted May 11, 2024 11:38 UTC (Sat) by ringerc (subscriber, #3071) [Link]

I actually always assumed that privileges were the greater of the u, g and o bits. It never occurred to me that a mode like 0460 would be writeable by group *unless* the actor is also the file owner. Wacky. I'll have to try that out now.

Systemd heads for a big round-number release

Posted May 13, 2024 6:22 UTC (Mon) by immibis (subscriber, #105511) [Link]

What you're describing is a manager that says it's unprivileged but actually needs a binary already running as root, which is the same security level as a setuid binary.

Systemd heads for a big round-number release

Posted May 13, 2024 8:35 UTC (Mon) by anselm (subscriber, #2796) [Link]

What you're describing is a manager that says it's unprivileged but actually needs a binary already running as root, which is the same security level as a setuid binary.

It's still a lot more difficult to get that binary to do stuff that it's not supposed to do than it is to get a setuid binary to do stuff that it's not supposed to do, so that's a net win.

(For one, all of the setuid binary is running as root, while the unprivileged manager makes requests to the remote binary using something like D-Bus, and only those requests are then handled as root. These requests can hopefully be vetted very carefully at the remote end before anything at all is being done.)

Systemd heads for a big round-number release

Posted May 16, 2024 22:20 UTC (Thu) by immibis (subscriber, #105511) [Link]

I disagree. The already-running binary has to be careful to replicate all of the environment that the setuid binary would inherit, or it creates its own set of bugs. Imagine the havoc if "rm /file" and "sudo rm /file" operate on two different files, because you are in a chroot or a mount namespace.

Systemd heads for a big round-number release

Posted May 16, 2024 22:39 UTC (Thu) by bluca (subscriber, #118303) [Link]

If you are in a chroot or a mount namespace you don't have access to the D-Bus system socket, so you can't run it in the first place. That's a good thing.

Systemd heads for a big round-number release

Posted May 16, 2024 22:40 UTC (Thu) by immibis (subscriber, #105511) [Link]

It would be difficult with a chroot, but normal for a mount namespace, to have access to that socket.

Systemd heads for a big round-number release

Posted May 17, 2024 11:36 UTC (Fri) by farnz (subscriber, #17727) [Link]

It's a tradeoff; run0 has to be careful to replicate enough of the environment that the setuid binary would inherit. setuid binaries have to be careful to not let the inherited environment affect behaviour in undesirable ways; imagine the havoc if rm and sudo rm loaded completely different "rm" binaries thanks to the dynamic linker misinterpreting the inherited environment in some way.

Systemd heads for a big round-number release

Posted May 7, 2024 17:29 UTC (Tue) by flussence (subscriber, #85566) [Link]

> […] systemd is finally dropping support for version 1 control groups (cgroups) in favor of the newer version 2 cgroups.

Great! Maybe that'll encourage some more porting of missing functionality to cgroups2. It's been annoying to repeatedly go through a cycle of finding a useful detail, spending half an hour in confusion trying to get it to work, then finally noticing the fine print in the docs saying it only exists in cg1.

Systemd heads for a big round-number release

Posted May 8, 2024 14:48 UTC (Wed) by cortana (subscriber, #24596) [Link]

v2 has been around for long enough that I'm surprised there's still new stuff being added to v1!

Systemd heads for a big round-number release

Posted May 8, 2024 20:39 UTC (Wed) by mss (subscriber, #138799) [Link]

Can you elaborate on what you find still missing in cgroups2 from cgroups1 in the latest upstream kernel?

Systemd heads for a big round-number release

Posted May 9, 2024 2:08 UTC (Thu) by flussence (subscriber, #85566) [Link]

The big one (for me) is the net.* controllers. I tried having them enabled in the kernel but they're absent in the v2 hierarchy. Right now I have unwieldy "meta skuid {…}" nftables rules to put network services into appropriate tc buckets; I'd prefer to reduce the number of hardcoded lists of magic values in there. Ideally I'd set a default classid once each for user-interactive and non-interactive processes (what systemd calls user.slice and system.slice).

IMO replacing the device controller with BPF hooks was a big usability downgrade too. I guess that's hinting that the audience for that feature isn't really system administrators on the ground but TiVoizers.

The documentation (kernel.org/doc/html/latest/admin-guide/) also could be more helpful. v1 has a page per controller, while v2 comes as one gigantic page (and the TOC only shows top-level headings). Figuring out what's what from that alone is hard and I had to dig into the source to write this.