LWN.net Weekly Edition for December 1, 2022

Welcome to the LWN.net Weekly Edition for December 1, 2022

This edition contains the following feature content:

Python and hashing None: the simple question of whether the Python None singleton should have a fixed hash value raises a number of related issues.
Rust in the 6.2 kernel: a look at the Rust infrastructure poised to be pushed into the mainline for 6.2.
Averting excessive oopses: thwarting attacks based on crashing the kernel too many times.
Yet another try at the BPF program allocator: the ongoing challenge of improving how memory is allocated for BPF programs.
Microblogging with ActivityPub: a survey of the free-software ecosystem that has built up around the ActivityPub protocol.

This week's edition also includes these inner pages:

Brief items: Brief news items from throughout the community.
Announcements: Newsletters, conferences, security updates, patches, and more.

Please enjoy this week's edition, and, as always, thank you for supporting LWN.net.

Comments (none posted)

Python and hashing None

By Jake Edge
November 30, 2022

The recent discussion of a proposed change to the Python language—the usual fare on the language's Ideas forum—was interesting, somewhat less for the actual feature under discussion than for the other issues raised. The change itself is a minor, convenience feature that would provide a reproducible iteration order for certain kinds of sets between separate invocations of the interpreter. That is a pretty limited use case, and one that could perhaps be fulfilled in other ways, but the discussion also highlighted some potentially worrying trends in the way that feature ideas are handled in the Python community.

An idea

On November 16, Yoni Lavi posted his idea (in a post that is "temporarily hidden", for unclear reasons, but is still viewable) that the Python None singleton should consistently hash to the same value—even on separate runs of the interpreter. His goal is to be able to reproduce and verify successive runs of a complex application. Currently, those values are often, but not always, different:

    $ python -c "print(hash(None))"
    5928233778750
    $ python -c "print(hash(None))"
    5880283780414

For Python binaries built with address-space layout randomization (ASLR), as above, the address of the None object changes on each interpreter run. Without ASLR, a given CPython binary will repeatedly produce the same value. For CPython, the "object ID" (i.e. id()) of an object, in the general case, is its address, which is used to generate the hash() value. Those hash values are used for speeding up the lookup of dictionary entries, but they have some other effects as well.

In particular, as Lavi described, the hash value determines the order of set iteration, so reproducing program behavior from run to run is not possible when None can be a member of the set. Beyond that, other types (e.g. tuple, collections.namedtuple, and frozen dataclasses.dataclass) that might be stored in a set can get "infected" by the non-reproducibility because they contain None fields. For example:

    $ python -c "print({ (1, None), (2, 3) })"
    {(2, 3), (1, None)}
    $ python -c "print({ (1, None), (2, 3) })"
    {(1, None), (2, 3)}

Because one of the tuples contains None, the order in the set can change between runs, but the same is not true if there are only simple values in the tuples. So it is not directly a problem with hash(None) that Lavi is having; the problem lies in exactly reproducing behavior (and, presumably, output) for program validation and debugging purposes when these somewhat complicated data structures are in use.

The hashes for some simple objects are reproducible, however. Numbers always hash to their values, True and False are special (1 and 0, respectively), floating-point values have reproducible hashes, and so on. Strings used to have reproducible hashes, but that led to a denial-of-service vulnerability via hash collisions, especially for web frameworks that process lots of untrusted strings. Strings (and bytes) now have random hash values based on a seed that gets set when the interpreter is initialized; those interested in reproducible hashes can set PYTHONHASHSEED, however. But there is no equivalent for a constant hash(None).

Reaction

Steven D'Aprano cautioned that the order of elements in a set depends on more than just the hash values of its elements; it also depends on the order of operations on the set (some of which can be seen in our Python ordered set article from October). The ordering is also specific to the current CPython set implementation, which could change without warning:

Because sets are documented as being unordered, in theory the iteration order could even change from one call to another: str(a) == str(a) is not guaranteed to be true, even if the set a doesn't change from one call to the next. If it happens to be true right now, that's an accident of the implementation, not a promise made by the language.

A fixed value for hash(None) seemed "harmless enough" to Paul Moore, though he saw no reason for it to be an option; "We should either do it or not." He was a little concerned that there might be some security implication from making the change, since ASLR is done for security purposes. Lavi did not see much difference from the hash values of True and False and Petr Viktorin agreed that a constant hash(None) should not be a security problem. In fact, not leaking information about the base address being used for ASLR "might be a slight win, actually".

Lavi filed a GitHub issue and a pull request for a trivial patch that simply hardcodes the value of hash(None) to 0xbadcab1e. The issue entry describes the problem at more length:

CPython's builtin set classes, as do all other non-concurrent hash-tables, either open or closed, AFAIK, grant the user a certain stability property. Given a specific sequence of initialization and subsequent mutation (if any), and given specific inputs with certain hash values, if one were to "replay" it, the result set will be in the same observable state every time: not only have the same items (correctness), but also they would be retrieved from the set in the same order when iterated.
This property means that code that starts out with identical data, performs computations and makes decisions based on the results will behave identically between runs. For example, if based on some mathematical properties of the input, we have computed a set of N valid choices, they are given integer scores, then we pick the first choice that has maximal score. If the set guarantees the property described above, we are also guaranteed that the exact same choice will be made every time this code runs, even in case of ties. This is very helpful for reproducibility, especially in complex algorithmic code that makes a lot of combinatorial decisions of that kind.

Creating a set class that does have those properties—and maintains them between runs—is possible, of course, but there is a substantial performance penalty to doing so, Lavi said. The reproducibility he describes does not require the stricter guarantee that would come with some kind of ordered set either. Meanwhile, in a large code base using some alternate mechanism, someone could unknowingly use a regular set and reintroduce the problem all over again. While determinism between runs is a "rather niche" concern, the change he suggests is minimal and, seemingly, harmless.

But core developer Raymond Hettinger closed the issue with the explanation that it did not make sense: "The default hash for every object is its object id. There is nothing special about None in this regard." Back in the thread, Viktorin was a bit apologetic about encouraging Lavi with this idea, but noted that there are some potential downsides even to the small change suggested:

Assuming that a set with the same hashes and order of construction is relying an implementation detail that might change in the future. It doesn't need to be a change in CPython itself – some extra caching or serialization might re-create the object and ruin the determinism, because set order is not important.
Making the None hash constant would lead more people down this path. That's the external cost of the change – definitely not zero.

The thread starts to go off the rails shortly thereafter, with the same complaints (and rebuttals) being raised multiple times. As is common when threads go awry, the participants are often talking past each other. There are certainly other ways to achieve what Lavi is trying to do, but they have their own costs, both in performance and in "enforcing a non-standard idiom over an entire codebase" as he put it.

From a high-level perspective, wanting determinism between invocations of the interpreter does not seem completely unreasonable, but it would rely on sets having guaranteed behavior that the Python core developers are not willing to grant—even if the hash of None was already a constant. D'Aprano summarized the discussion shortly before the thread was closed for being argumentative. He suggested turning off ASLR or using classes that implement their own __hash__() method as a possible approach, but warned that "it is still unsafe to assume that set iteration order will be predictable"

New threads

A few days after the original thread was closed, Lavi was back on November 27 with a request to reconsider the idea and reopen the pull request. It did not really add anything new to the debate, though he did attach a lengthy document further describing the problem and some of the issues surrounding it. He had been asked to try again (apparently on Discord), so he posted it to both the forum and to the largely moribund python-dev mailing list.

On python-dev, the thread ran much of the same course as the original forum post did, though there were some surprises, perhaps. Python creator Guido van Rossum said that he would be perfectly happy to see sets have the same ordering guarantees as dicts, if there are no performance degradations that result. Furthermore:

"Sets weren't meant to be deterministic" sounds like a remnant of the old philosophy, where we said the same about dicts -- until they became deterministic without slowing down, and then everybody loved it.

Meanwhile, he noted that hashing an empty tuple (hash(())) is stable over successive runs; the arguments against making hash(None) stable "feel like rationalizations for FUD". He thinks that there is a good argument for making hash(None) constant across interpreter runs.

Part of the problem in all of the threads is confusion about what Lavi is asking for. He does not require any specific ordering for sets, not the insertion-order guarantee of dicts nor any other. The determinism he is looking for is already available in the language, even though it is not guaranteed, as long as None (or things containing it) is not used in the set. It is, in fact, hard to see how there would be any benefit to some hypothetical version of Python that did not produce the same order for set iteration if the order of operations constructing the set are the same—unless that order was being deliberately perturbed for some other reason.

Meanwhile, over in the forum thread, Moore immediately responded that he thought Lavi was wasting his own and everyone else's time. Lavi agreed and suggested the topic be deleted; that was not his last suggestion to drop the whole thing that would be ignored by others in the thread. Chris Angelico, who was sympathetic to the idea in the earlier thread, pointed out that there is something of a trend here:

Only because there's a general tendency to yell at ideas until they go away, which isn't exactly a good policy, but it's an effective way of wielding "status quo wins a stalemate" to win arguments. See the previous thread for details, or look at any of myriad other places where an idea was killed the same way.

He is referring to a blog post from Nick Coghlan entitled "Status quo wins a stalemate" that D'Aprano and others had mentioned. But, as Angelico noted, the key word is "stalemate"; "Not 'status quo wins any debate if we can just tire out everyone else'." But Moore said there was more to it than that, at least in this case; a core developer had reviewed the pull request and rejected it. Even though the change is "small and relatively [innocuous]", it was not important enough "to warrant overturning another core dev's decision" in his opinion.

Angelico further lamented the treatment of new ideas: "It's like posting an idea is treated as a gauntlet that has to be run - not of technical merit, but of pure endurance." Moore agreed that it is a problem, but the flipside is one as well:

But it's equally bad if people propose ideas with little or no justification, and no appreciation of the costs of their proposal. Making people "run the gauntlet" is a bad way of weeding out bad proposals, but we don't seem very good at finding a better way (I've seen far too many "I think it would be neat if Python did such-and-such" proposals that wasted way too much of people's time because they didn't get shut down fast enough or effectively enough).

That led to some discussion of said costs; D'Aprano posted a lengthy list, which Lavi then rebutted, but it is clear that no minds are being changed at this point. It seems unlikely that any core developer feels strongly enough to override Hettinger's rejection, but the discussion continues on. In part, that's because proposals sometimes have a different outcome after they have been discussed for "a while"—or if they periodically get resurrected until they finally succeed. Witness structural pattern matching, which had been proposed and discussed many times in different guises until it was adopted, or dict "addition", which had a similar path. As Angelico put it:

It's been proven that tenacity in the face of opposition is crucial to a proposal's success. Proven time and again with successful proposals, not just failed ones. Are you surprised, then, when one rejection doesn't kill a proposal instantly?
That's exactly my point. It's a gauntlet, not a technical discussion.

It is perhaps inconvenient for Lavi (and others), but the hash(None) "problem" is apparently not going away. It seems like there should be a straightforward "fix" or workaround for it, but wholesale changes, and subsequent enforcement of them, seem like the only way forward—at least if always running without ASLR is not an option. That is all rather unsatisfying, in truth. And the gauntlet remains for the next ideas that the community comes up with.

Comments (34 posted)

Rust in the 6.2 kernel

By Jonathan Corbet
November 17, 2022

The merge window for the 6.1 release brought in basic support for writing kernel code in Rust — with an emphasis on "basic". It is possible to create a "hello world" module for 6.1, but not much can be done beyond that. There is, however, a lot more Rust code for the kernel out there; it's just waiting for its turn to be reviewed and merged into the mainline. Miguel Ojeda has now posted the next round of Rust patches, adding to the support infrastructure in the kernel.

This 28-part patch series is focused on low-level support code, still without much in the way of abstractions for dealing with the rest of the kernel. There will be no shiny new drivers built on this base alone. But it does show another step toward the creation of a workable environment for the development of code in the Linux kernel.

As an example of how stripped-down the initial Rust support is, consider that the kernel has eight different logging levels, from "debug" through "emergency". There is a macro defined for each level to make printing simple; screaming about an imminent crash can be done with pr_emerg(), for example. The Rust code in 6.1 defines equivalent macros, but only two of them: pr_info!() and pr_emerg!(); the macros for the other log levels were left out. The first order of business for 6.2 appears to be to fill in the rest of the set, from pr_debug!() at one end through pr_alert!() at the other. There is also pr_cont!() for messages that are pieced together from multiple calls. This sample kernel module shows all of the print macros in action.

A rather more complex macro added in this series is #[vtable]. The kernel makes extensive use of structures full of pointers to functions; these structures are at the core of the kernel's object model. A classic example is struct file_operations, which is used to provide implementations of the many things that can be done with an open file. The functions found therein vary from relatively obvious operations like read() and write() through to more esoteric functionality like setlease() or remap_file_range(). Anything in the kernel that can be represented as an open file provides one of these structures to implement the operations on that file.

Operations structures like file_operations thus look a lot like Rust traits, and they can indeed be modeled as traits in Rust code. But the kernel allows any given implementation to leave out any functions that are not relevant; a remap_file_range() operation will make no sense in most device drivers, for example. In the kernel's C code, missing operations are represented by a null pointer; code that calls those operations will detect the null pointer and execute a default action instead. Null pointers, though, are the sort of thing that the Rust world goes out of its way to avoid, so representing an operations structure in Rust requires some extra work.

The #[vtable] macro is intended to perform the necessary impedance matching between C operations structures and Rust traits. Both the declaration of a trait and of any implementations will use this macro, so a trait definition will look like:

    #[vtable]
    pub trait Operations {
        /// Corresponds to the `open` function pointer in `struct file_operations`.
    	fn open(context: &Self::OpenData, file: &File) -> Result<Self::Data>;
    // ...
    }

While an implementation for a specific device looks like:

    #[vtable]
    impl kernel::file::Operations for some_driver {
        fn open(_data: &(), _file: &File) -> Result {
            Ok(())
        }
	// ...
    }

If this implementation is to be passed into the rest of the kernel, it must be turned into the proper C structure. Rust can create the structure, but it is hard-put to detect which operations have been implemented and which should, instead, be represented by a null pointer. The #[vtable] macro helps by generating a special constant member for each defined function; in the above example, the some_driver type would have a constant HAS_OPEN member set to true. The code that generates the C operations structure can query those constants (at compile time) and insert null pointers for missing operations; the details of how that works can be seen in this patch.

The submission for 6.2 adds #[vtable] but does not include any uses of it. The curious can see it in use by looking at this large patch posted in August; searching for #[vtable] and HAS_ will turn up the places where this infrastructure is used.

Yet another new macro is declare_err!(), which can be used to declare the various error-code constants like EPERM. The 6.2 kernel will likely include a full set of error codes declared with this macro rather than the minimal set included in 6.1. There is also a mechanism to translate many internal Rust error into Linux error codes.

The Rust Vec type implements an array that will grow as needed to hold whatever is put into it. Growing, of course, involves memory allocation, which can fail in the kernel. In 6.2, Vec as implemented in the kernel will likely have two methods called try_with_capacity() and try_with_capacity_in(). They act like the standard with_capacity() and with_capacity_in() Vec methods in that they preallocate memory for data to be stored later, but with the difference that they can return a failure code. The try_ variants will allow kernel code to attempt to allocate Vecs of the needed size and do the right thing if the allocation fails, rather than just calling panic() like the standard versions do.

One of the more confusing aspects of Rust for a neophyte like your editor is the existence of two string types: str and String; the former represents a borrowed reference to a string stored elsewhere, while the latter actually owns the string. The kernel's Rust support will define two variants of those, called CStr and CString, which serve the same function for C strings. Specifically, they deal with a string that is represented as an array of bytes and terminated with a NUL character. Rust code that passes strings into the rest of the kernel will need to use these types.

The series ends with a grab-bag of components that will be useful for future additions. The dbg!() macro makes certain types of debugging easier. There is code for compile-time assertions and to force build errors. The Either type can hold an object that can be either one of two distinct types. Finally, the Opaque type is for structures used by the kernel that are never accessed by Rust code. Using this type can improve performance by removing the need to zero-initialize the memory holding it before calling the initialization function.

As can be seen, these patches are slowly building the in-kernel Rust code up so that real functionality can be implemented in Rust, but this process has some ground to cover yet. It's not clear whether more Rust code will be proposed for 6.2, or whether this is the full set. The pace of change may seem slow to developers who would like to start doing real work in Rust, but it does have the advantage of moving in steps that can be understood — and reviewed — by the kernel community. The Rust-for-Linux work has been underway for a few years already; getting up to full functionality may well take a while longer yet.

Comments (56 posted)

Averting excessive oopses

By Jonathan Corbet
November 18, 2022

Even a single kernel oops is never a good thing; it is an indication that something has gone badly wrong in the system somewhere and a straightforward recovery is not possible. But it seems that oopsing a large number of times has the potential to be even worse. To head off problems that might result from repeated oopsing, there is currently work afoot to put an upper limit on the number of times that the kernel can be allowed to oops before just giving up and rebooting.

An oops in the kernel is the equivalent of a crash in user space. It can come about for a number of reasons, including dereferencing a stray pointer, hardware problems, or a bug detected by checks within the kernel code itself. The normal response to an oops is to output a bunch of diagnostic information to the system log and kill the process that was running when the problem occurred.

The system as a whole, however, will continue on after an oops if at all possible. Killing the system would deprive the users of the ability to save any outstanding work and can also make problems much harder to debug than they would otherwise be. So the kernel will do its best to continue executing even when something has clearly gone badly wrong. An immediate result of that design decision is that any given system can oops more than once. Indeed, for some types of problems, multiple oopses are common and may continue until somebody gets fed up and reboots the system.

Jann Horn recently started to wonder whether perhaps the kernel should just give up and go into a panic (which will cause a reboot) if it oopses too many times. This could be a wise course of action in general; a kernel that is oopsing frequently is clearly not in a good condition and allowing it to continue could lead to problems like data corruption. But Horn had another concern: oopsing a system enough times might be a way to exploit security problems.

An oops, almost by definition, will leave an operation halfway completed; there is usually no way to clean up everything that might need cleaning when something has gone wrong in an unexpected place. So an oops might cause locks to be left in a held state or might lead to the failure to decrement counters that have been incremented. Counters are a particular concern; if an oops causes a counter to not be properly decremented, oopsing repeatedly might well become a way to overflow that counter, creating an exploitable situation.

To thwart attacks of this type, Horn wrote a patch putting an upper limit on the number of times the system can oops before it simply calls panic() and reboots. The limit was set to 10,000, but can be changed with the oops_limit command-line parameter.

One might well wonder whether oopsing the kernel repeatedly is a realistic way to exploit a kernel vulnerability. A kernel oops takes a bit of time, depending on a number of factors including the amount of data to be logged and the speed of the console device(s). The development community has put a vast amount of effort into optimizing many parts of the kernel, but speeding up oops handling has somehow never been a priority. To determine how long handling an oops takes, Horn wrote a special sort of benchmark:

In a quick single-threaded benchmark, it looks like oopsing in a vfork() child with a very short stack trace only takes ~510 microseconds per run when a graphical console is active; but switching to a text console that oopses are printed to slows it down around 87x, to ~45 milliseconds per run.

Based on that, he concluded that it would take between eight and 12 days of constant oopsing, in the best of conditions, to overflow a 32-bit counter that was incremented once for each oops. So it is not the quickest path toward a system exploit; it is also not the most discreet: "this is a *very* noisy and violent approach to exploiting the kernel". While there are almost certainly systems out there that can oops continuously for over a week without anybody noticing, they are probably relatively rare.

Even so, nobody seemed opposed to putting an upper limit on the number of oopses any given kernel can be expected to endure. Nobody even really felt the need to argue over the 10,000 number, though Linus Torvalds did note in passing that he would have preferred a higher limit. Alexander "Solar Designer" Peslyak suggested that, rather than creating a separate command-line parameter, Horn could just turn the existing panic_on_oops boolean parameter into an integer and use that. That idea didn't get too far, though, due to the number of places in the kernel that check that parameter and use it to modify their behavior now.

A few days later, Kees Cook posted an updated version of the patch (since revised), turning it into a six-part series. The behavior implemented by Horn remained unchanged, but there have been some additions, starting with a separate count to panic the system if the kernel emits too many warnings. Cook also concluded that, since the kernel was now tracking the number of oopses and warnings, that information could be provided to user space via sysfs, where it might be useful to monitoring systems.

No opposition to this change appears to be in the offing, so chances are that this patch set will find its way into the 6.2 kernel in something close to its current form. Thereafter, no kernel will be forced to put up with the indignity of constant oopsing for too long and, perhaps, some clever exploit might even be fended off. "Don't panic" might be good advice for galactic hitchhikers, but sometimes it is the right thing for the kernel to do.

Comments (24 posted)

Yet another try at the BPF program allocator

By Jonathan Corbet
November 28, 2022

The BPF subsystem, which allows code to be loaded into the kernel from user space and safely executed in the kernel context, is bound to create a number of challenges for the kernel as a whole. One might not think that allocating memory for BPF programs would be high on the list of problems, but life (and memory management) can be surprising. The attempts to do a better job of providing space for compiled BPF code have, to date, only been partially successful; now Song Liu is back with a new approach to finish the job.

Small, transient, and numerous

The problem with BPF programs is that they tend to be small, are often transient, and can be numerous. That, alone, would not be hard for the kernel to deal with; the slab allocators are highly tuned for the efficient allocation and freeing of small objects. But BPF programs, being executable code, must be stored in memory that allows execute access, and that complicates the picture.

Any memory that is both executable and writable presents an attractive target for attackers, so the kernel goes well out of its way to prevent that combination from occurring; some architectures prohibit it entirely. The kernel's own code is loaded at boot time, made read-only, and (usually) never changed again. Loadable modules, which require the addition of kernel code at run time, complicate things a bit, but modules are relatively large and tend to be stable. The kernel will load the modules it needs shortly after boot, and the set of loaded modules will rarely change thereafter. As a result, even if the handling of loadable modules is not optimal, things normally work well enough anyway.

As noted above, though, BPF programs can come and go frequently, and there can be a lot of small programs in the system. All of this would be fine in the absence of the prohibition on memory that is both writable and executable; that restriction requires that memory holding BPF programs, which are executable, be made non-writable. That, in turn, requires changing the permissions in the "direct map", the range of kernel address space that (on 64-bit systems) maps all of the system's physical memory. Even if direct-map addresses are not used to access BPF memory (as would happen if the vmalloc() family of allocators is used to obtain it), the existence of a writable direct mapping to executable code would create a potential vulnerability.

The kernel's direct map uses huge-page mappings (of 1GB size when possible). Huge-page mappings reduce the pressure on the system's translation lookaside buffer (TLB) and improve the performance of the system overall. If a portion of the direct map must be made read-only, though, then the huge page the contains it must be split into smaller pages, fragmenting the direct map with a measurable impact on performance. Doing that once or twice might not be a big problem but, in a system where BPF programs come and go frequently, the impact on the direct map can be severe.

The smallness of BPF programs also turns out to be a problem. In older kernels, each BPF program was loaded into its own (4KB) page, meaning that, often, most of the page was wasted. If many of these programs are loaded, that wasted memory starts to add up.

In February, Liu set out to solve these problems. The "bpf_prog_pack" allocator worked by allocating 2MB huge pages from the kernel, then handing out portions of those pages for BPF programs as they are loaded. The concentration of multiple BPF programs into huge pages addressed both problems: it minimized fragmentation of the direct map and reduced memory waste by packing BPF programs together in the same page. This allocator looked like a good solution and was quickly pulled into the mainline during the 5.18 merge window.

Unfortunately, a number of problems quickly surfaced, and much of the bpf_prog_pack functionality was backed out despite the fact that the source of some of the trouble was to be found in the memory-management subsystem. The allocator is still present in the kernel, but it uses 4KB "base" pages, so it does not help performance as much as it could.

Trying again

Liu's new proposal replaces bpf_prog_pack with a new allocator that addresses the complaints about the previous version and, once again, uses huge pages to hold BPF programs. That leads to improved performance:

Based on our experiments, we measured 0.5% performance improvement from bpf_prog_pack. This patchset further boosts the improvement to 0.7%. The difference is because bpf_prog_pack uses 512x 4kB pages instead of 1x 2MB page.

The use of 2MB pages is now possible as the result of fixing the related problems in the memory-management subsystem. This new allocator goes beyond the use of huge pages, though, and creates a new API for the management of transient, executable code in the kernel:

    void *execmem_alloc(unsigned long size, unsigned long align);
    void *execmem_fill(void *dst, void *src, size_t len);
    void execmem_free(void *addr);

Any kernel subsystem that needs to set up a segment of executable code can allocate the memory with execmem_alloc(). The memory that is returned will have read-only protection, so the caller cannot copy the code into it directly. Instead, execmem_fill() must be called to populate this memory with the executable text. On the x86 architecture (the only one that supports this mechanism now), the "text_poke" machinery will be used to safely copy the code while dodging the many race conditions that can present themselves when code is being modified. If a range of executable memory is no longer needed, it can be returned with execmem_free().

The advantage of this new API is that it is not limited to just BPF programs; it can also potentially be used in other places where code is loaded into the kernel — specifically for loadable modules. That would improve the efficiency of those allocations while simultaneously reducing the number of code-loading implementations in the kernel. That seems like a significant benefit, but there is just one little problem: the module loader has not been changed to actually use this API, so there is no proof that it will work in that context.

Indeed, it almost certainly will not work for module loading yet, simply because there is no support for any architectures other than x86. Loading code into a running kernel is a tricky business, and the details of how it can be done safely vary widely from one architecture to the next. A number of architectures now implement at least parts of the text_poke API, which simplifies the task, but text_poke is not universal; arm64 does not support it, for example. Architectures also have differing requirements around the placement of data areas for modules; it may not work to put a module's BSS memory far away from its text, for example. All of this adds up to a number of potential headaches for anybody trying to actually use the new API for module loading.

Reviewers of this work would, understandably, like some assurance that the new API can work beyond BPF before accepting it; Mike Rapoport, for example, has asked for "at least some explanation how modules etc can use execmem_ APIs without breaking !x86 architectures". Rick Edgecombe responded with an assertion that other architectures could be supported with minor changes to the API, but questioned whether it it is truly necessary to solve the whole problem at this point.

Luis Chamberlain has also expressed frustration at the lack of solid (and reproducible) data showing how this work improves system performance. He clearly sees some advantages overall, though, since one of his complaints is that the patch changelogs do not sufficiently highlight "the gains of this effort helping with the long term advantage of centralizing the semantics for permissions on memory". Liu has responded with a bit more data on TLB-miss improvement.

The benefits of the work seem clear, should it manage to not run into surprises like its predecessor. The biggest question with regard to merging would seem to be just how much work will be required to convince reviewers that this API can handle the module case. If a complete solution is required, the new BPF program allocator seems unlikely to land anytime soon. Since there are no user-space API issues to resolve, though, it should be possible to proceed with the BPF solution once reviewers are convinced that it does not actively lead in the wrong direction.

Comments (2 posted)

Microblogging with ActivityPub

November 29, 2022

This article was contributed by Jordan Webb

As of late, concerns about the future of Twitter have caused many of its users to seek alternatives. Amid this upheaval, an open-source microblogging service called Mastodon has received a great deal of attention. Mastodon is not reliant on any single company or central authority to run its servers; anyone can run their own. Servers communicate with each other, allowing people on different servers to send each other messages and follow each other's posts. Mastodon doesn't just talk to itself, though; it can exchange messages with anything that speaks the ActivityPub protocol. There are many such implementations, so someone who wants to deploy their own microblogging service enjoys a variety of choices.

ActivityPub is a W3C Recommendation that describes how servers can exchange messages in the form of JSON Activity Streams. The ActivityPub protocol is highly flexible and is used for all sorts of things, but the scope of this article is limited to software that uses it to provide a Twitter-like microblogging service. The idea of federated microblogging was pioneered by a project called StatusNet, which was the software that originally ran at identi.ca, an early Twitter alternative. Instances of StatusNet, and eventually other software, were federated with each other using a protocol called OStatus.

ActivityPub and OStatus are not interoperable, but they have some parentage in common; Evan Prodromou, the creator of StatusNet, subsequently created pump.io and changed identi.ca to use it. Though development on pump.io seems to have stalled, its API formed the basis of ActivityPub. Some projects continue to support both protocols; Mastodon removed support for OStatus in 2019.

All of the various servers running this software exist in a federation somewhat like email, colloquially known as "The Fediverse"; usernames are similar to Twitter handles, but also include a domain component that identifies the server hosting the account. For example, LWN has an account at @LWN@fosstodon.org; it is hosted on Fosstodon, which is a Mastodon server for people interested in free and open-source software.

Mastodon

Mastodon is a Ruby on Rails application; its source is available on GitHub and it is released under the AGPL 3.0. A working installation of Mastodon has a number of moving parts. Two different instances of the web application are needed; one to handle normal requests and a separate one for the streaming API that serves real-time updates to clients. It also needs a PostgreSQL database and a Redis server for caching. At least one instance of the Sidekiq job scheduler is needed to handle background jobs; busy servers may need to configure several of them in order to remain responsive under load. Elasticsearch is also needed to fully enable Mastodon's search capabilities, although it can be run without Elasticsearch at the cost of degraded functionality. Running all of this requires significant resources; the Raspberry Pi Foundation hosts its own instance on a Pi 4, but it only plays host to a single account. Someone who intends to have more than a handful of users on their server would be wise to budget for something more powerful, as system requirements scale up relative to the number of incoming and outgoing messages that a server needs to process.

Mastodon's web interface appears to have taken most of its design cues from Twitter. It also offers a multi-column "advanced" interface, similar to TweetDeck. Mastodon has an official mobile application but it is oddly incomplete; in particular, it does not offer access to the federated timeline where posts from other servers may be viewed. Alternative clients are available to fill the gap; Metatext, tooot, and Tusky are more fully-featured alternatives. Note that while the ActivityPub standard specifies both client-to-server and server-to-server communications, Mastodon only implements the server-to-server side. Mastodon clients are not generic ActivityPub clients; they communicate with the server using a Mastodon-specific API.

Though it has received a lot of attention lately, Mastodon is not a new project; development began in 2016 and the first stable release was in 2017. Mastodon's development is sponsored by a number of companies along with donations via Patreon, where it currently receives about $31,000 per month.

Mastodon's creator, Eugene Rochko, runs the project on a "benevolent dictator" model and a number of forks have sprung up to add support for features that upstream is not interested in merging. The first feature that many Mastodon forks add is a way to increase the maximum length of posts; upstream Mastodon limits post to 500 characters and offers no way to configure a higher limit, though the ActivityPub protocol imposes no such limitation and Mastodon has no problem displaying longer posts from other servers. Glitch Edition is a fork that tracks Mastodon's main branch closely and includes a completely overhauled user interface. Hometown is a more conservative fork, based on stable releases of Mastodon and focused on adding noninvasive quality-of-life improvements.

Pleroma

Pleroma is written in Elixir and its source is available on the project's own GitLab server; most of it is licensed under AGPL 3.0, but some assets and documentation are covered by Creative Commons licenses instead. Pleroma is known for being more lightweight than Mastodon, and it is significantly less complex from an operational perspective; the only additional service that it needs is a PostgreSQL database. Despite this, it also includes some features that Mastodon lacks; while Mastodon only permits marking a post as a "favorite", Pleroma extends ActivityPub's "Like Activity" to allow users to react to posts with an emoji, similar to Facebook and LinkedIn. It also includes a realtime chat feature. Unlike Mastodon, Pleroma allows the maximum post length to be changed and it defaults to a generous 5000 characters.

Pleroma's default web interface is much less obviously inspired by Twitter than Mastodon's, with a full-window background image and semi-transparent UI elements. It has no official mobile app of its own, but Pleroma implements much of the Mastodon API, so many clients intended for use with Mastodon will work with it. Pleroma's documentation includes a list of working clients.

Perhaps in part due to its ease of deployment, Pleroma has acquired a bad reputation in many communities. This is not for any technical reason, but because it frequently seems to be the software of choice for people aiming to abuse and harass people on other servers over their race, gender, sexual identity, or other characteristic. Many unmoderated "free speech maximalist" servers also run Pleroma, and some of its developers are active in those communities; these types of servers often come into conflict with (and are subsequently blocked by) servers with moderation policies that prohibit abusive content. LWN readers who do not wish to be exposed to such content are advised to steer clear of the "Featured Instances" on Pleroma's website.

Pleroma's commit history dates back to 2017; version 1.0.0 was tagged in 2019. Unlike Mastodon, Pleroma does not solicit donations to fund development, and claims no corporate sponsors. Pleroma has spawned at least one fork; Akkoma was inspired at least in part by differing views about content moderation. Akkoma adds a number of features to Pleroma, like support for search, integration with translation services, and improved domain blocking. Its developers have also added a code of conduct.

Misskey

Misskey is written in TypeScript; its source is available on GitHub and, like Mastodon and Pleroma, uses the AGPL 3.0 license. A deployment of Misskey weighs in somewhere between Mastodon and Pleroma; like Mastodon, it needs PostgreSQL and Redis, and can optionally make use of Elasticsearch, but it does not need separate processes for streaming APIs or background-job processing. Like Pleroma, Misskey allows administrators to configure the maximum post length (3000 is the default), and throws in a few more features that have no equivalent in Mastodon. Misskey originated custom emoji reactions; Pleroma's implementation was built to be compatible with it. Misskey also offers a cloud-storage feature called "Drive" that can be used to share files and images.

Misskey's core developers are from Japan, and it is most popular among Japanese-speaking communities. English translations are available for the user interface and documentation, though. Its web interface is more Twitter-like than Pleroma, but more visually distinct from Twitter than Mastodon. Misskey has no official mobile client, but unlike Pleroma, Mastodon clients can't be used with it because it does not implement the Mastodon API. At the time of this writing, the list of clients was missing from the English documentation, but a list of clients is available in Japanese.

Development of Misskey began in 2014, but the commit history of the current version only goes back to 2016; version 1.0.0 was tagged in 2018. It was initially built as a standalone bulletin board; support for ActivityPub was added on later. Misskey accepts donations on Patreon and currently receives a bit more than $700 a month; it also apparently accepts corporate sponsors, although only one is listed. Misskey also has its share of forks; Calckey adds a long list of user-interface improvements, while FoundKey is focused on fixing bugs and making the code more maintainable.

Others

Mastodon, Pleroma, and Misskey are currently the three most popular choices for ActivityPub microblogging servers, but there are a number of other options. The AGPL 3.0 license seems to be unusually popular in this space; it is used by all of the projects listed below, unless otherwise mentioned.

GNU social is a PHP microblogging server with an interesting history. The initial version of GNU social was originally spun out of the code that powers Libre.fm, but current versions are direct descendants of the original StatusNet codebase, which has been extended to support ActivityPub. Development of GNU social is split into two branches, in two different places; version 2 appears to be in maintenance mode on NotABug.org, while version 3 is receiving frequent commits on Codeberg.
GoToSocial is a lightweight ActivityPub server written in Go. It is still in alpha, but rapidly maturing; source is available on GitHub. While Mastodon, Pleroma, and Misskey require PostgreSQL, GoToSocial also supports SQLite as an option for people who don't want to run a separate database server.
Takahē is another alpha-stage project, which just made its first numbered release. It is written in Python and stores its data in a PostgreSQL database. Takahē's distinguishing feature is its support for virtual hosting; other ActivityPub servers only support a single domain per instance, but Takahē can support multiple domains on a single instance. Source is available on GitHub under the three-clause BSD license.
Socialhome has an unusual user interface which owes more to Pinterest than to Twitter. It is written in Python and stores its data in PostgreSQL; it also needs Redis. Source is available on GitHub.
microblog.pub is a single-user ActivityPub server written in Python that stores its data in SQLite. In addition to ActivityPub, it supports a number of IndieWeb standards, including IndieAuth, Micropub, Webmention, and microformats. Source is available on SourceHut.
honk is another single-user ActivityPub server, with a minimalist aesthetic. It is written in Go and stores its data in SQLite. The license is described as "generally ISC or compatible." Source is available from the project's website, which is also a Mercurial repository.
Friendica is more focused on social networking than microblogging, with an interface and set of features that resemble Facebook more than Twitter. Friendica predates ActivityPub but interoperates with it, along with a number of other protocols. A collection of optional addons is available to connect Friendica to other services; a bidirectional gateway to Twitter is included. It is written in PHP, and stores its data in MySQL or MariaDB; source is available on GitHub.
Hubzilla is another PHP project from the original author of Friendica. While Hubzilla's website and documentation mostly highlight its support for the Zot protocol, it also interoperates with ActivityPub. Source is available on Framasoft's GitLab server. Hubzilla's authors have chosen to publish the source under the MIT license.

The year of the Mastodon?

Many media outlets are covering Mastodon as a "new alternative social network", but it is neither new, nor is it a network unto itself. Mastodon has been around since 2016 and some of the projects it interoperates with are even older. It is true that many of the ActivityPub-speaking servers on the internet are running Mastodon, and indisputable that it has been the primary beneficiary of the current wave of attention, but the collection of servers that make up the Fediverse is far from a monoculture; many thriving alternatives exist.

As a result of the uncertainty around Twitter, public servers have been inundated with new users. It remains to be seen if many of these people will stick around, but ActivityPub was a tried and proven solution with a healthy base of users long before the current spike in interest. There is no reason to think that an established standard with multiple robust implementations is going to go away any time soon, even if its current moment in the sun turns out to be fleeting.

Comments (49 posted)

Page editor: Jonathan Corbet

Inside this week's LWN.net Weekly Edition

Briefs: M1 GPU driver; BPF scheduler; Asahi Linux progress; Emacs 29 soon; FFmpeg guide; Gimp 3.0 soon; Quotes; ...
Announcements: Newsletters, conferences, security updates, patches, and more.

Next page: Brief items>>