Inheritable credentials for directory file descriptors

This article brought to you by LWN subscribers

Subscribers to LWN.net made this article — and everything that surrounds it — possible. If you appreciate our content, please buy a subscription and make the next set of articles possible.

By Jonathan Corbet
May 2, 2024

In Unix-like systems, an open file descriptor carries the right to access the opened object in specific ways. As a general rule, that file descriptor does not enable access to any other objects. The recently merged BPF token feature runs counter to this practice by creating file descriptors that carry specific BPF-related access rights. A similar but different approach to capability-carrying file descriptors, in the form of directory file descriptors that include their own credentials, is currently under consideration in the kernel community.

Linux systems allow a process to open a directory with any of the variants of the open() system call. The resulting "directory file descriptor" can be used to read the contents of the directory; it is also useful, when passed to system calls like openat(), to specify the starting directory for the pathname lookup of the file to be opened. A privileged process can open a directory and give the file descriptor to a less-privileged process (or simply drop its own privileges), and that descriptor will continue to be usable to access the directory, even if the owning process would otherwise be unable to do so.

That access does not, however, extend to any files contained within that directory.

Stas Sergeev recently proposed a change to that situation in the form of a new flag (OA2_INHERIT_CRED) for the openat2() system call. If a process uses that flag while opening a file, and that process provides a directory file descriptor, the file will be opened using the credentials that were in effect when the directory was opened. So, if a privileged process created the directory file descriptor, any other process owning that descriptor could open files in the reference directory using the privileged process's user and group IDs.

In other words, when this flag is used, a directory file descriptor grants more than just access to the directory itself; it also provides credentials to access files within the directory. This feature can be used, according to Sergeev, to implement a sort of lightweight sandboxing mechanism to restrict a process (or a container) to a specific directory tree. Such restrictions can be implemented now, but is rather more cumbersome to set up.

Andy Lutomirski said that he liked the idea; "it's a sort of move toward a capability system". He added, though, that turning a directory file descriptor into this sort of capability should require an explicit act — it should not just happen by default. Not every process providing a directory file descriptor to another will want to hand over its rights to access objects in the directory as well. He also worried about potential mischief resulting from directory file descriptors opened in special filesystems like /proc.

As a result of these comments, a number of changes had been made by the time that the patch series got to version 6. To be usable with the (renamed) OA2_CRED_INHERIT flag, a directory file descriptor must have been opened with the new O_CRED_ALLOW flag. An attempt to use the OA2_CRED_INHERIT flag on a directory file descriptor created without O_CRED_ALLOW will just result in an EPERM error. The kernel will also reject OA2_CRED_INHERIT opens that involve /proc or symbolic links that lead out of the directory. Any file descriptors opened using OA2_CRED_INHERIT will be automatically closed in an execve() call.

Meanwhile, O_CRED_ALLOW directory file descriptors cannot be passed to any other process over a Unix-domain socket. This would appear to be the only case where the SCM_RIGHTS mechanism restricts the type of file descriptor that can be passed in this way. This restriction prevents a container from giving its special permissions to a process outside of the container, but it will also block attempts to pass an O_CRED_ALLOW file descriptor into a container. For the intended use case (where a privileged process sets up the file descriptor before dropping privileges) this restriction will not be a problem, but it could possibly impede other use cases.

Sergeev notes in the series that, if this idea is accepted, there are more patches to come:

This patch is just a first step to such sandboxing. If things go well, in the future the same extension can be added to more syscalls. These should include at least unlinkat(), renameat2() and the not-yet-upstreamed setxattrat().

Whether things will, in fact, go well is yet to be determined; this sort of security-related change to a core system call tends to need a high degree of review. And, of course, there will be people with other ideas of how this functionality could be provided. For example, Lutomirski proposed a somewhat more elaborate mechanism where credentials could be attached using open_tree() (which is part of the new(ish) mount API); a process could then mount the given subtree as a separate filesystem. This would allow him to "pick a host directory, pick a host *principal* (UID, GID, label, etc), and have the *entire container* access the directory as that principal".

Lutomirski was seeking comments on this approach and did not include an implementation of this idea. The comment he got came from filesystem-layer maintainer Christian Brauner, who pointed out that ID-mapped mounts can already provide most of the functionality that Lutomirski appeared to be looking for. Lutomirski has not yet responded to indicate whether he agrees.

It may take some time to see whether this work is accepted, and in which form. Adding new security features to an operating-system kernel needs to be done with care; there can often be surprising interactions with existing features, and they may be used in surprising ways. Serious vulnerabilities have resulted from file descriptors passed into containers in the recent past; developers would want to be sure that this feature would not lead to similar problems. But, regardless of how this specific patch set is ultimately received, it does demonstrate a direction — toward more capability-oriented systems — that many developers would like to pursue.

Index entries for this article
Kernel	System calls/openat2()

(Log in to post comments)

Inheritable credentials for directory file descriptors

Posted May 2, 2024 17:03 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link]

How is it going to interact with SELinux (AppArmor, etc.)?

Inheritable credentials for directory file descriptors

Posted May 3, 2024 1:16 UTC (Fri) by gerdesj (subscriber, #5446) [Link]

nwfs and nss (Novell filesystems) used to have a rather nifty feature: ACLs applied at a certain point would flow downwards unless blocked by another ACL mask. This means that setting access on huge file systems takes seconds and not minutes/hours.

I remember auditing and applying trustee rights to some pretty large filesystems within a few seconds, programmatically (an Excel workbook and some VB OCXs).

NWFS volumes would take a while to mount due to loading a lot of data structures into RAM but NSS volumes seemed to work on the fly rather well and offered pretty much everything that ZFS does but without the ridiculous amount of time that ACLs take to write out to every single object.

I used to have a spreadsheet that would find all pools and volumes on site (for a former employer) and query them and work out what space was used by each folder and account for compression too. Due to the trustee/ACL model this was pretty rapid. You could amend the space allowances and hit another button and within about a minute the assignments were fixed up to the new values.

That was back in say 1990-2010ish. Obviously we are far more sophisticated these days.

Inheritable credentials for directory file descriptors

Posted May 3, 2024 2:42 UTC (Fri) by willy (subscriber, #9762) [Link]

The problem with that is hard links and NFS file handles. There may be more than one way to get to a file or we may not know any way to get to a file.

Inheritable credentials for directory file descriptors

Posted May 3, 2024 8:19 UTC (Fri) by skissane (subscriber, #38675) [Link]

> The problem with that is hard links and NFS file handles. There may be more than one way to get to a file or we may not know any way to get to a file.

It would have been better if Unix inodes had a reverse pointer back to the directory entries. Then of all the names of a file, you can single one out as "primary". This is how APFS handles hard links.

Inheritable credentials for directory file descriptors

Posted May 3, 2024 8:29 UTC (Fri) by mjg59 (subscriber, #23239) [Link]

I think it'd probably have been easier to just not provide hardlinks at all - if symlinks had backlinks this would be much easier (it would also make, eg, inotify handles on an entire directory structure possible). NFS is, well, its own special set of problems.

Inheritable credentials for directory file descriptors

Posted May 3, 2024 8:30 UTC (Fri) by mjg59 (subscriber, #23239) [Link]

(And, yes, this would make cross-filesystem symlinks much harder, but if we'd never had them would we have figured something else out instead?)

Inheritable credentials for directory file descriptors

Posted May 3, 2024 9:28 UTC (Fri) by skissane (subscriber, #38675) [Link]

I've often thought we need a third type of link, harder than symlinks but softer than hard links.

One whose link target is a numeric file ID not a path, so if you move the target file, the link still just works. However, if you delete the target file, then the link becomes broken.

Inheritable credentials for directory file descriptors

Posted May 3, 2024 19:45 UTC (Fri) by NYKevin (subscriber, #129325) [Link]

I think we have too many kinds of link right now. In practice, hard links have so many asterisks attached to them that they're basically unusable by end users:

1. It looks like a(n optimized) copy, but it is not a copy, because software could overwrite it in-place and destroy the original.
2. It looks like a (durable) link, but it is not a (durable) link, because software could easily break it without specifically intending to do so.
3. Whether you have problem (1) or problem (2) is purely dependent on the intentions of the user, which are unknowable. There is also no way for any given piece of software to avoid doing one or the other, if it is going to overwrite a file at all. Consequently, every piece of software that modifies files on the filesystem will always cause at least one of those problems for some subset of users (who create their own hard links).
4. Has to be on the same filesystem. Many users do not appreciate what a "filesystem" is.
5. NTFS stores some metadata redundantly in the dentry (as well as the inode), and it can get out of date if there are multiple hard links.
6. NTFS also implements a sort of "quantum tunneling" in which it tries to recognize whether two different inodes are intended to be "the same file" (because they eventually have the same path and one is created at around the same time as the other is unlinked). This is already a bad idea anyway, but I'm sure it's much more complicated and less predictable when one or both of the inodes can have more than one hard link. For example, imagine deleting a local Git clone, then quickly cloning a different local repo into the same directory.
7. I imagine there's probably a version of NFS that is confused by hard links, because there is always a version of NFS that breaks on any given "interesting" filesystem feature.
8. The concept of "it's one file with two names" is a bit too abstract by itself, so you end up having to explain the underlying mechanism, and users don't care.
9. Anything that makes "one file = one path" less true tends to make filesystem-adjacent security more difficult to correctly implement.

It's too late to get rid of hard links by now, but let's not add any more categories of link.

Inheritable credentials for directory file descriptors

Posted May 4, 2024 7:27 UTC (Sat) by donald.buczek (subscriber, #112892) [Link]

Two of your bullet points address aspects specific to NTFS. I don't think that flaws of an individual filesystem serve as a valid argument against hardlinks in general.

In one bullet point, you assume NFS is confused by hardlinks, which I don't believe to be true.

All your other points are based on the assumption that users of hardlinks lack understanding of their functionality. While it may be true that some users lack understanding, there are also applications that use hardlinks correctly and know very well how they work.

Although I don't accept your arguments, I fully agree with your last statement: "It's too late to get rid of hard links by now, but let's not add any more categories of link."

Inheritable credentials for directory file descriptors

Posted May 4, 2024 20:24 UTC (Sat) by epa (subscriber, #39769) [Link]

If copy-on-write, at the level of a whole file, were consistently supported and cleanly exposed to user space then it would remove one of the few use cases of hard links. The other missing piece would be ‘rename’ as a fundamental operation so you don’t need to link and unlink.

I guess you could also allow unlinking a file while still open, if an application particularly wants to create ‘invisible’ temporary files. I don’t see the point myself but I am sure somebody is doing it. Just not allowing two links to the same inode.

The directory names . and .. have to be handled specially and not be hard links as in classical Unix. I think that’s already the case on any modern system.

Inheritable credentials for directory file descriptors

Posted May 3, 2024 20:25 UTC (Fri) by wahern (subscriber, #37304) [Link]

> nwfs and nss (Novell filesystems) used to have a rather nifty feature: ACLs applied at a certain point would flow downwards unless blocked by another ACL mask. This means that setting access on huge file systems takes seconds and not minutes/hours.

This is similar to how OpenBSD unveil and Linux Landlock work, except the ACLs are per process and aren't durable. For each dentry encountered in the open path, check for attached RWX permissions or masks. Once you've reached the end of the path, if the accumulated unveil/Landlock permissions don't match the requested open operation, fail.

Inheritable credentials for directory file descriptors

Posted May 3, 2024 17:26 UTC (Fri) by flussence (subscriber, #85566) [Link]

It looks like this new mechanism needs source changes on both sender and receiver side which limits its usefulness, but if it didn't I could imagine stuffing credentials into /proc/$PID/{cwd,root}, in order to give the receiver access to a group's files without actually granting the process group membership.

The use case I had in mind is how to give multiple processes with their own uid/gid read access to a TLS keypair that updates frequently, without downtime, and without giving the entire process a group membership for one file (or worse, letting them have full root access and pinky promise to drop it at the correct time).

Inheritable credentials for directory file descriptors

Posted May 4, 2024 7:48 UTC (Sat) by donald.buczek (subscriber, #112892) [Link]

I don't like that idea with OA2_CRED_INHERIT access. I prefer that only the permissions of the opened inode itself be the canonical place for access checks, not the permissions of directories where the inode might have links.

As a side note, for the use case of lightweight sandboxing and others, I wish we had an O_TMPFILE feature for directories, but I see that this is nearly impossible to implement for several reasons.

Inheritable credentials for directory file descriptors

Posted May 5, 2024 8:53 UTC (Sun) by ringerc (subscriber, #3071) [Link]

This makes me nervous - the security and access stack is already so incredibly, insanely complex with so many different interactions.

What are the interactions between this new feature and the piles of functionality we have now? ACLs, AppArmor, SELinux, namespaces, you name it.

I'm not saying it's a bad idea - I don't have the expertise for that, and I suspect that someone who got this idea this far knows far far far more than me. But it still makes me nervous. Complexity is the enemy in security.