|
|
Subscribe / Log in / New account

The file_operations structure gets smaller

This article brought to you by LWN subscribers

Subscribers to LWN.net made this article — and everything that surrounds it — possible. If you appreciate our content, please buy a subscription and make the next set of articles possible.

By Jonathan Corbet
May 3, 2024
Kernel developers are encouraged to send their changes in small batches as a way of making life easier for reviewers. So when a longtime developer and maintainer hits the list with a 437-patch series touching 859 files, eyebrows are certain to head skyward. Specifically, this series from Jens Axboe is cleaning up one of the core abstractions that has been part of the Linux kernel almost since the beginning; authors of device drivers (among others) will have to take note.

The origin of struct file_operations

In the beginning, the Linux kernel lacked any sort of virtual filesystem layer. See, for example, the 0.01 implementation of read(), which contained explicit checks for each possible file-descriptor type. That approach worked to get an initial kernel to boot but, before long, Linus Torvalds realized that it would not scale well. As developers sought to add more device types, and to implement more than one filesystem type, the need for an abstraction layer became more urgent.

The Linux 0.95 release, which came out in March 1992, brought a number of changes, including a switch to the GPL license. It also added the first pieces of what was to become the kernel's virtual filesystem layer. A core piece of that layer was the first file_operations structure, defined, in its entirety, as:

    struct file_operations {
	int (*lseek) (struct inode *, struct file *, off_t, int);
	int (*read) (struct inode *, struct file *, char *, int);
	int (*write) (struct inode *, struct file *, char *, int);
    };

This structure contains the pointers to the functions needed to implement specific system calls on anything that can be represented by a file descriptor. Rather than use an extended if-then-else sequence to determine which type of file was being operated on, the kernel could just do an indirect call to the appropriate file_operations member. As might be expected, the most fundamental operations — reading, writing, and seeking — showed up here first. In early versions of the kernel, there wasn't much else that one could do with a file descriptor.

The file_operations structure grew from there. The 1.0 version of this structure included ten members, implementing system calls like readdir(), ioctl(), and mmap(). The 2.0 version of struct file_operations had 13 members, and 2.2 added two more. Through all of this history, the read() and write() members remained the way to read from and write to a file descriptor, though their prototypes changed somewhat.

The plot thickens

The 2.4 release, made at the beginning of 2001, included a version of struct file_operations with these new members:

    ssize_t (*readv) (struct file *, const struct iovec *, unsigned long, loff_t *);
    ssize_t (*writev) (struct file *, const struct iovec *, unsigned long, loff_t *);

User-space developers often needed the ability to perform scatter/gather I/O — operations involving multiple segments of memory that needed to be transferred in a single operation. In response, the kernel gained support for readv() and writev() but, to properly support these system calls, the kernel needed to pass them down to the underlying implementations. The new members, which took an array of iovec structures containing an address (in user space) and size for each segment, were added for this purpose. For device drivers or filesystems that did not implement the new functions, the kernel would emulate them with a series of read() or write() calls instead.

Subsequent work added many more members to struct file_operations, including other variants of read() and write(). aio_read() and aio_write(), used to implement the kernel's somewhat unloved asynchronous I/O mechanism, went into the 2.5.33 development release. splice_read() and splice_write(), implementing the splice() system call, were added for 2.6.17. Removals of file_operations members, like the removal of kernel code in general, was rare, but readv() and writev() were removed in 2.6.19 after all users were switched to use aio_read() and aio_write() instead.

The 3.16 version of struct file_operations, had grown to 27 members, including these additions indicating a new approach to I/O within the kernel:

    ssize_t (*read_iter) (struct kiocb *, struct iov_iter *);
    ssize_t (*write_iter) (struct kiocb *, struct iov_iter *);

Increasingly, I/O operations were being initiated from the kernel, not just from from user space; they often involved multiple segments and needed to be executed asynchronously. The data buffers involved could be referenced in a number of ways. The iov_iter structure used to describe these more complex I/O operations looked like this at the time:

    struct iov_iter {
	int type;
	size_t iov_offset;
	size_t count;
	union {
	    const struct iovec *iov;
	    const struct bio_vec *bvec;
	};
	unsigned long nr_segs;
    };

The key distinguishing feature of this structure is related to the type field. If it was ITER_IOVEC, then the iov union member contained an array of segments using user-space addresses. If it was, instead, ITER_KVEC, then the addresses were in kernel space. And if type was ITER_BVEC, then the bvec field pointed to an array of bio structures (used to describe block I/O requests). An I/O API defined in this way could be called from a number of contexts and would work regardless of whether the operation was initiated from user space or from within the kernel.

The kiocb structure is used by the kernel to coordinate asynchronous I/O operations. Drivers are not required to implement asynchronous I/O (though they may not perform as well if they don't), but if they do implement it, they need the information in this structure. The use of struct kiocb reflects the fact that, among other goals, the new methods were intended to replace aio_read() and aio_write(), which were duly removed for the 4.0 release.

struct iov_iter everywhere

Over time, struct iov_iter has evolved and become rather more complex; see the 6.8 version for the details. The kernel has also accumulated a set of helpers that free code from dealing with that complexity much of the time. Meanwhile, struct file_operations in 6.8 is up to 32 callable members. But, through all of this change, read() and write() have remained essentially unchanged, even though they only handle the simplest of I/O operations in what has become a complicated world.

Axboe has decided that, perhaps, those two members have reached the end of their useful life:

10 years ago we added ->read_iter() and ->write_iter() to struct file_operations. These are great, as they pass in an iov_iter rather than a user buffer + length, and they also take a struct kiocb rather than just a file. Since then we've had two paths for any read or write - one legacy one that can't do per-IO hints like "This read should be non-blocking", they strictly only work with O_NONBLOCK on the file, and a newer one that supports everything the old path does and a bunch more.

Since read_iter() and write_iter() can do everything that read() and write() can do, it makes sense to simply remove the older members. The only problem is, of course, there is a lot of code that only implements read() and write() in the kernel; much of it is in drivers that may not have seen significant development (or even use) in years. Some of them surely are being used, though, and breaking them would undoubtedly increase the (already high) level of grumpiness on the net.

Many modules that use the older interface can, with some effort, be converted to use read_iter() and write_iter() instead, perhaps gaining functionality in the process. But there are a lot of these modules, and trying to understand every one of them well enough to do such a conversion is a path to madness, with little benefit. So, instead, Axboe started by implementing a set of helpers that emulates the new functions with a series of calls to read() or write(); that minimizes the amount of change to any given module while maximizing the chances that the results will be correct. See this patch as an example of what the simplest conversions look like.

The final patch in the series removes read() and write() with a surprising lack of ceremony, given that they have been there for 32 years.

There have not been a lot of comments on the series; perhaps many developers are still waiting for the whole thing to download into their inboxes. Al Viro noted that some of the conversions might need to be done a bit more carefully. But nobody has objected to the overall concept, thus far.

For a series like this to be accepted, it will need to be split into more manageable chunks — which Axboe acknowledged at the outset. This set of changes does simplify the kernel, though, and it removes a fair amount of old code, so chances are that it will happen in some form, sooner or later. At that point, there will likely be a lot of out-of-tree modules that will need to be updated before they can be built on newer kernels. The good news is that developers can make those changes now and get ahead of the game.

Index entries for this article
Kernelstruct file_operations


(Log in to post comments)

The file_operations structure gets smaller

Posted May 4, 2024 12:38 UTC (Sat) by mchehab (subscriber, #41156) [Link]

> There have not been a lot of comments on the series; perhaps many developers are still waiting for the whole thing to download into their inboxes.

Such patch series was sent only to LKML (and without a cover letter), meaning that neither driver maintainers nor driver mailing lists were notified. That's the case, for instance, of media, where patches sent to linux-media at vger are monitored via patchwork.linuxtv.org.

That likely explain why this series didn't receive much comments.

The file_operations structure gets smaller

Posted May 4, 2024 15:21 UTC (Sat) by intelfx (subscriber, #130118) [Link]

> and without a cover letter

Correct me if I'm wrong, but isn't this (https://lore.kernel.org/all/20240411153126.16201-1-axboe@...) the cover letter?

The file_operations structure gets smaller

Posted May 4, 2024 19:08 UTC (Sat) by axboe (subscriber, #904) [Link]

It was only sent with a limited scope since a) the full series only makes sense to see the full picture, actual patches will be sent separately, and b) it's a simplistic RFC where even things like commit messages aren't fully done yet. But having the full series out means I can reference that for smaller postings.

I haven't had time to follow up on this series yet, but it has been updated to each -rc as it gets released and issues sorted out with it.

The file_operations structure gets smaller

Posted May 11, 2024 19:21 UTC (Sat) by andy_shev (subscriber, #75870) [Link]

There are handful list of files toched by that patch bomb that I'm interested in reviewing, but the series neglected MAINTAINERS records. And there are questionable changes as well (like bloating up fs.h with yet another header and hence adding into dependency hell). Thanks to LWN coverage to get known about this activity...

The file_operations structure gets smaller

Posted May 4, 2024 18:21 UTC (Sat) by iabervon (subscriber, #722) [Link]

Al Viro's comment seems like a good topic for an article; there are files for which two separate write operations in a row has a different effect from a single write with the combined buffer, and there's something userspace can do that would come out differently with this series. I couldn't tell whether it was a single write getting split or multiple writes getting combined, but there's an interesting subtlety there.

The file_operations structure gets smaller

Posted May 4, 2024 18:45 UTC (Sat) by pbonzini (subscriber, #60935) [Link]

It's a single writev getting split to multiple calls to th3 .write member of the file_operations struct.

The file_operations structure gets smaller

Posted May 4, 2024 20:08 UTC (Sat) by iabervon (subscriber, #722) [Link]

To be specific, the original code for drivers without .write_iter would call .write multiple times for writev (in do_loop_readv_writev), but some of the patches (e.g. #39, which Viro points out) provide a .write_iter that combines them. I couldn't find any documentation that writev is supposed to be a batch of multiple operations (without any interleaved other operations from other tasks) rather than one operation with a pasted buffer, if it matters.

The file_operations structure gets smaller

Posted May 5, 2024 3:20 UTC (Sun) by dskoll (subscriber, #1630) [Link]

This could be unrelated, but the N_HDLC line discipline for TTYs used to ensure that write boundaries were preserved. If you did a write followed by another write followed by a read, the read would only get what the first write wrote, and you need to do a second read to get what the second write wrote. In other words, it preserved frame boundaries.

This was broken sometime between 4.19 and 5.11 as I reported.

The file_operations structure gets smaller

Posted May 10, 2024 17:43 UTC (Fri) by andy_shev (subscriber, #75870) [Link]

v4.19..v5.11 is a quite big range to say anything about that. You need to bisect.

The file_operations structure gets smaller

Posted May 6, 2024 8:11 UTC (Mon) by Kamiccolo (subscriber, #95159) [Link]

Woaaa, I'm enjoying very much these kind of walks over the history and progress in development. Thanks a lot!


Copyright © 2024, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds