Losing the magic

LWN.net needs you!

Without subscribers, LWN would simply not exist. Please consider signing up for a subscription and helping to keep LWN publishing

By Jonathan Corbet
December 5, 2022

The kernel project is now more than three decades old; over that time, a number of development practices have come and gone. Once upon a time, the use of "magic numbers" to identify kernel data structures was seen as a good way to help detect and debug problems. Over the years, though, the use of magic numbers has gone into decline; this patch set from Ahelenia Ziemiańska may be an indication that the reign of magic numbers may be reaching its end.

A magic number is simply a specific constant value that is placed within a structure, typically as the first member, to identify the type of that structure. When structures are labeled in this way, in-kernel debugging code can check the magic number and raise the alarm if the expected value is not found, thus detecting problems related to type confusion or data corruption. These numbers can also be located in hexadecimal data dumps (stack contents, for example) to identify known data structures.

The use of magic numbers in the kernel appears to have had its origin in the filesystem code, where it was initially used to identify (and verify) the superblock in the disk image. Even the 0.10 kernel release included a test against SUPER_MAGIC (0x137f) to verify that the boot disk was, indeed, a Minix filesystem. Other filesystems came later, starting with the "first extended" (ext), which used 0x137d for its EXT_SUPER_MAGIC value in the 0.96c release in July 1992.

In the 0.99 release (December 1992), the sk_buff structure that is still used in the networking subsystem to hold packets was rather smaller than it is now, but it did gain a magic field to identify the queue a packet was expected to be in. Toward the middle of 1993, the 0.99.11 release acquired an updated kmalloc() implementation that sprinkled magic numbers around as a debugging aid. That release, incidentally, is also the one where an attempt was made to use C++ to build the kernel; that only lasted until 0.99.13, a couple of months later.

The use of magic numbers in the kernel grew slowly after that. The 1.1.13 release, in May 1994, added a file called MAGIC in the top-level directory to keep track of the various numbers in use; it listed eight such numbers. This file was, incidentally, nearly the first documentation file in the kernel beyond the basic installation information; the kernel would not gain a directory for documentation until 1.3.22 in 1995. In this new file, Ted Ts'o wrote:

It is a *very* good idea to protect kernel data structures with magic numbers. This allows you to check at run time whether (a) a structure has been clobbered, or (b) you've passed the wrong structure to a routine. This last is especially useful --- particlarly when you are passing pointers to structures via a void * pointer. The tty code, for example, does this frequently to pass driver-specific and line discipline-specific structures back and forth.

The file went on to ask developers to follow this practice for future additions to the kernel. (For the curious: the typo of "particularly" lasted until 1.1.42 in August 1994; otherwise that text persists to this day).

The 1.3.99 release saw the movement of MAGIC to Documentation/magic-number.txt, perhaps as part of a general cleaning-up prior to the imminent 2.0 release. Some developers, at least, had clearly taken Ts'o's advice; there were, at this point, 21 entries in that file. The 2.2.0 version (January 1999) held 51 entries. Magic numbers appeared to be an established kernel-development practice.

The 2.4.0 release came almost exactly two years later. The 2.4 version of magic-number.txt was, except for one small change, identical to the 2.2 version; no new magic numbers had been added. That doesn't necessarily reflect a change in development practices so much as the eternal habit of letting documentation go out of date. Some effort went into updating the file during the 2.5 development series, and the 2.6.0 version contained an even 100 magic numbers. For the rest of the 2.6.x series, though, the only changes were small tweaks and the removal of a couple of obsolete entries; Documentation/magic-number.txt began to shrink.

In fact, no additions to that file have been made in the entire Git history. In 2016 the file was converted to the RST format and moved into the nascent development-process manual. It mostly sat idle and unnoticed until earlier this year, when the file began to shrink; the 6.1 version of Documentation/process/magic-number.rst is down to 14 entries. Where has the magic gone?

This change is the result of Ziemiańska's work, which is aimed at removing this file entirely; the current patch set describes it as "a low-value historical relic at best and misleading cruft at worst". In that series, Ziemiańska systematically deletes the final entries, often removing the associated structure fields and magic-number checks in the code, until the file is empty; the final patch in the series simply deletes it. Chances are that it will not be missed.

There is no clear point where the development community made a collective decision to move away from the magic-number practice; it just sort of faded away. The are probably a few reasons behind this change. The kernel community has, for many years now, tried to use type-safe interfaces rather than passing void pointers around, making it less likely that the wrong structure type will be passed into a function. Developers spend less time staring at hex dumps of data, preferring more structured output, tracepoints, and interactive debuggers as ways of tracking down problems. Debugging features in the kernel's memory allocators mean that many sorts of memory-corruption issues will be caught directly. Magic numbers are just not as helpful as they once were.

Magic numbers will still have their place; they still, for example, can help filesystem code be confident that it is dealing with the correct type of filesystem image. Even there, though, the use of checksums for on-disk data structures should provide better protection against many types of problems. But, for the most part, kernel development has lost some of the magic it once had; as often happens, it slipped away when nobody was paying attention.

Index entries for this article
Kernel	Development model/Patterns

(Log in to post comments)

Losing the magic

Posted Dec 5, 2022 16:39 UTC (Mon) by developer122 (guest, #152928) [Link]

I initially thought this was something to do with the magic file used by the "file" utility.

Losing the magic

Posted Dec 5, 2022 19:03 UTC (Mon) by stop50 (subscriber, #154894) [Link]

Me too, probably because users have more touch with the magic file than kernel imternals.

Losing the magic

Posted Dec 5, 2022 17:01 UTC (Mon) by marcH (subscriber, #57642) [Link]

> The kernel community has, for many years now, tried to use type-safe interfaces rather than passing void pointers around, making it less likely that the wrong structure type will be passed into a function.

Some "void *" statistics would be great. They can be really hard to avoid in generic code and so-called "private data"

Losing the magic

Posted Dec 6, 2022 14:15 UTC (Tue) by sima (subscriber, #160698) [Link]

Tons of places have moved to container_of and struct embedded, and away from void*. It's not really more typesafe when you try to get things intentionally wrong, but it does help a lot with silly mistakes and "oops wrong offset" screwups.

But yeah relative amount of void* might be an interesting metric to see whether the conjecture that the kernel moved towards more type safety actually holds.

Losing the magic

Posted Dec 7, 2022 17:41 UTC (Wed) by abatters (✭ supporter ✭, #6932) [Link]

As an example of the kernel moving in the unsafe direction, the kernel has lots of special printk format specifiers for specific pointer types that are not typechecked by the compiler.

https://docs.kernel.org/core-api/printk-formats.html

Losing the magic

Posted Dec 8, 2022 10:53 UTC (Thu) by geert (subscriber, #98403) [Link]

Sounds like a good task for sparse?

Losing the magic

Posted Jan 2, 2023 18:31 UTC (Mon) by tabberson (guest, #162965) [Link]

Can you please explain what you mean by "struct embedded"?

Losing the magic

Posted Jan 3, 2023 9:56 UTC (Tue) by geert (subscriber, #98403) [Link]

struct foo {
        ...
        struct bar embedded;
        ...
};

If you have a pointer to the "embedded" member of type "struct bar", you can convert it to a pointer to the containing "struct foo" using the container_of() macro.

Losing the magic

Posted Dec 5, 2022 20:58 UTC (Mon) by iabervon (subscriber, #722) [Link]

I think the innovation is really the macros that enforce type safety before discarding the type information. There are still plenty of structures that store a function with an argument and a pointer that has to match, where the structure can't validate them (and different instances of the structures don't match each other), but the macros for creating the structures ensure that, still having the type information, you could call the function with the argument without getting a type error. I don't think it was common knowledge until that time frame that it was possible to tell a C compiler "generate any warnings or errors for this expression, but don't generate code to evaluate it". (Or maybe it's still not common knowledge, but the kernel has all the macros needed to do it, and they're commonly used.)

Losing the magic

Posted Dec 5, 2022 21:40 UTC (Mon) by Sesse (subscriber, #53779) [Link]

Out of curiosity, what is the trick? Using sizeof?

Losing the magic

Posted Dec 5, 2022 22:06 UTC (Mon) by iabervon (subscriber, #722) [Link]

I think the original was (0 ? (*fn)(arg) : store_callback(fn, arg)), and using enough optimization that the direct call won't be generated. Once people demonstrated the benefit and possibility of doing this sort of testing, a later innovation was the sizeof trick to make the code really explicitly execute as "discard a small integer", which will be nothing following even unoptimized code generation (since even minimal register allocation will just discard the code).

Losing the magic

Posted Dec 5, 2022 21:42 UTC (Mon) by khim (subscriber, #9252) [Link]

> Or maybe it's still not common knowledge, but the kernel has all the macros needed to do it, and they're commonly used.

It would be interesting to ask about that random C developers. While the answer is kinda “obvious” (every C programmer I ever knew used sizeof at one point or another and it's pretty obvious that if you ask for sizeof of some object there are no need to actually evaluate an expression that produces that object, you only need to know type of that expression), yet I wouldn't be surprised to see lots of developers who just don't realize they have all the tools they need in a basic course of C.

Losing the magic

Posted Dec 6, 2022 5:01 UTC (Tue) by NYKevin (subscriber, #129325) [Link]

Bear in mind, many of the people using C are not coming from a compsci or software engineering background, but from an embedded systems background. For them, C is like old-school javascript: You write ~10 lines of code to make the monkey dance (except it's, like, an animatronic monkey or some such thing, wired up to an Arduino). There is no operating system and you probably don't have a full libc, or possibly any libc. While I'm sure these are very smart people, it is not necessarily reasonable to expect them to know about the intricacies of compile-time vs. runtime evaluation, because that's simply not relevant to the work they're trying to do. They may never even use sizeof, because malloc is not typical in these environments (everything is statically allocated).

Losing the magic

Posted Dec 6, 2022 13:59 UTC (Tue) by khim (subscriber, #9252) [Link]

These guys are extra hopeless. Not because they couldn't learn proper discipline, but because they don't want to learn it.

Thankfully these days even tiny microcontrollers are powerful enough to run high-level languages, even python.

There are absolutely no need to [try to] teach them C.

Losing the magic

Posted Dec 6, 2022 14:07 UTC (Tue) by andi8086 (subscriber, #153876) [Link]

That is only your personal point of view. I appreciate the simplicity and readability of C very much. Although I DO USE sizeof :D and I certainly do no monkey dances, except maybe in hardware projects... According to your definition I am SEMI-Extra-Hopeless. In environments like bare metal, you need a lot of discipline, otherwise you don't understand your own code a few years later... or everything will just crash. It seems to me you lack a lot of experience!

Losing the magic

Posted Dec 6, 2022 14:38 UTC (Tue) by pizza (subscriber, #46) [Link]

Bare-metal embedded (not to mention the actual hardware) requires a _lot_ more discipline than most other software categories.

On average, you'll find embedded and hw folks a lot more vigorous when it comes to testing/validation, as fixing bugs after things have shipped can be prohibitively expensive.

Losing the magic

Posted Dec 6, 2022 15:24 UTC (Tue) by khim (subscriber, #9252) [Link]

> Bare-metal embedded (not to mention the actual hardware) requires a _lot_ more discipline than most other software categories.

It requires entirely different discipline, that's the issue.

> On average, you'll find embedded and hw folks a lot more vigorous when it comes to testing/validation, as fixing bugs after things have shipped can be prohibitively expensive.

Yes, but more often than not they do bazillion tests and conclude that it's enough to be confident that thing actually works as it should.

Often they are even right: hardware is hardware, it often limits input to your program severely (which makes things like buffer overflow impossible simply because laws of physics protect you). And hardware is rarely behaves 100% like specs say it would behave thus without testing math models wouldn't save you.

Software is thoroughly different: adversary may control inputs so well and do things which are so far beyond anything you may even imagine that all these defenses built by folks with hardware experience and their tests are sidesteped without much trouble.

You need math, logic and rigorous rules to make things work. It's really interesting how attitude of linux kernel developers have slowly shifted from hardware mindset to software mindset when fuzzing guys found more and more crazy ways to break what they have thought was well-designed and tested piece of code.

Now they are even trying to use Rust as a mitigation tool. It would be interesting to see whether it would actually work or not: linux kernel sits between hardware and software worlds which means that pure math, logic and rigorous rules are not enough to make it robust.

Losing the magic

Posted Dec 6, 2022 17:18 UTC (Tue) by pizza (subscriber, #46) [Link]

In other words, you're saying we need more rigorous/detailed specifications for software.

...And you're the one going on about folks asking for O_PONIES?

Losing the magic

Posted Dec 6, 2022 17:58 UTC (Tue) by Wol (subscriber, #4433) [Link]

And if I've understood the "higher maths" correctly - I'm more a chemist/medical guy by education - the O_PONY I'm asking for is that signed multiplication be a group. Any group, I don't care, so long as when it doesn't overflow the result is what naive arithmetic would expect.

Because, on the principle of least surprise, it's a very unpleasant surprise to discover that multiplying two numbers could legally result in the computer squirting coffee up your nose ... :-)

Is there really anything wrong in asking for the result of computer operations to MAKE SENSE? (No, double-free and things like that - bugs through and through - clearly can't make sense. That's just common sense :-)

Cheers,
Wol

Losing the magic

Posted Dec 6, 2022 18:20 UTC (Tue) by khim (subscriber, #9252) [Link]

> And if I've understood the "higher maths" correctly - I'm more a chemist/medical guy by education - the O_PONY I'm asking for is that signed multiplication be a group. Any group, I don't care, so long as when it doesn't overflow the result is what naive arithmetic would expect.

Doesn't look that way to me. Compiler developers already acquiesced to these demands and provided flag which makes clang and gcc to make signed integers behave that way.

Now you arguing about different thing: “right to be ignorant”. You don't want to use flag, you don't want to use provided functions, you don't want to accept anything but complete surrender from guys who have never promised you that your approach would work in the first place (because it's not guaranteed to work even in C90 and gcc 2.95 from last century already assumes you write correct code and don't overflow signed integers).

> That's just common sense :-)

And that's precisely the problem. You ask for common sense but neither computers nor compilers have it.

They couldn't employ common sense during the optimisation because it's not possible to formally describe what “common sense” means!

Thus they use the best next substitute: list of logical rules collected in the C standard.

> Is there really anything wrong in asking for the result of computer operations to MAKE SENSE? (No, double-free and things like that - bugs through and through - clearly can't make sense.

That's how specification is changed and how new switches are added. People employ their common sense and discuss things and arrive at some set of rules.

Similarly to how law is produced: people start with common sense, but common sense is different for different people thus we end up with certain set of rules which some people like, some people don't like, but all have to follow.

Only with C standard situation is both simpler and more complicated: subject matter is much more limited, but agent which does the interpretation doesn't have even vestigial amounts of common sense (law assumes that where there are contradictions or ambiguities judge would use common sense, C language specification writers have no such luxury) thus you have to make specification as rigid and strict as possible.

Losing the magic

Posted Dec 6, 2022 15:16 UTC (Tue) by khim (subscriber, #9252) [Link]

> It seems to me you lack a lot of experience!

I worked with embedded guys and even know a guy who spent insane amount of time to squeeze AES into 256 bytes on some 4-bit Samsung CPU.

I've seen how these folks behave.

> In environments like bare metal, you need a lot of discipline, otherwise you don't understand your own code a few years later... or everything will just crash.

The big trouble (and also what makes them truly hopeless) is that hardware is often buggy, it very much contains lots of UBs (especially if you use raw prototypes) but because it's physical thing, UBs are limited. Something doesn't become set when it should, or you need a delay, of if you do something too quickly (or too slowly!) there's a crash… but it rarely happens that issue in one part of you device affects another, completely unrelated part (except if you are developing something like modern 100-billion transistors CPU/GPU… but that's not embedded and I'm not even sure you may classify what these people are doing as “hardware” novadays).

They cope with hardware UBs with tests and, naïvely, try to apply the same approach to software. Which rarely ends well and just leads to P_PONIES ultimatums (which remain mostly ignored because software is not hardware and effects of UB maybe thoroughly non-local).

They they adopt these idiot compiler makers couldn't be trusted and we are right, thus we would just froze the version of compiler we are using. Which, of course leads to inability to reuse code written is supposedly portable language later (which greatly surprises their bosses).

It's a mess. The worst of all is the attitude we don't have time to sharpen the axe, we need to chop trees!

> I appreciate the simplicity and readability of C very much.

Unfortunately it's simplicity is one skin-deep: it's syntax is readable enough (if you forget blunder with pointers to functions), but it semantic is, often, extremely non-trivial and very few understand it.

This goes back to the fact that C was never, actually, designed, it was cobbled together by similarly-minded folks thus when people who knew how languages are supposed to be designed have tried to clarify how C works they only could do so much if they don't want to create an entirely different language which doesn't support programs written before that point at all (which would defeat the purpose of clarification work).

Losing the magic

Posted Dec 6, 2022 18:03 UTC (Tue) by Wol (subscriber, #4433) [Link]

> The big trouble (and also what makes them truly hopeless) is that hardware is often buggy, it very much contains lots of UBs (especially if you use raw prototypes) but because it's physical thing, UBs are limited. Something doesn't become set when it should, or you need a delay, of if you do something too quickly (or too slowly!) there's a crash… but it rarely happens that issue in one part of you device affects another, completely unrelated part (except if you are developing something like modern 100-billion transistors CPU/GPU… but that's not embedded and I'm not even sure you may classify what these people are doing as “hardware” novadays).

In other words, with C's assumption that UB is impossible, we now have a conundrum if we want to write Operating Systems in C!

Which has been my problem ALL ALONG. I want to be able to reason, SANELY, in the face of UB without the compiler screwing me over. If that's an O_PONY then we really are fscked.

Cheers,
Wol

Losing the magic

Posted Dec 6, 2022 18:47 UTC (Tue) by khim (subscriber, #9252) [Link]

> In other words, with C's assumption that UB is impossible, we now have a conundrum if we want to write Operating Systems in C!

Why would it be so? There are lots of art built around how can you avoid UBs in practice. Starting from switches which turn certain UBs into IBs (and thus make them safe to use) to sanitizers which [try to] catch UBs like race conditions or double-free or out-of-bounds array access.

If you accept the goal (ensure that your OS doesn't ever trigger UB) there are plenty of ways to achieve it. Here is an interesting article on subject.

I, personally, did something similar on smaller scale (not OS kernel, but another security-critical component of the system). Ended up with one bug in 10 years system was in use (and that was related to problem with specification of hardware).

But if you insist on your ability to predict what code with UBs would do… you can't write Operation System in C that way (or, rather, you can, it's just there are no guarantees that it will work).

> Which has been my problem ALL ALONG. I want to be able to reason, SANELY, in the face of UB without the compiler screwing me over.

Not in the cards, sorry. In you code can trigger UB then the only guaranteed fix is to change code and make it stop doing that.

> If that's an O_PONY then we really are fscked.

Why? Rust pushes UBs into tiny corner of your code and there are already enough research into how we can avoid UBs completely (by replacing these with markup which includes proof that your code doesn't trigger any UBs). Here is related (and very practical) project.

Of course even after all that we would have issue of bugs in hardware, but that's entirely different can of worms.

Losing the magic

Posted Dec 6, 2022 20:19 UTC (Tue) by Wol (subscriber, #4433) [Link]

> > In other words, with C's assumption that UB is impossible, we now have a conundrum if we want to write Operating Systems in C!

> Why would it be so? There are lots of art built around how can you avoid UBs in practice. Starting from switches which turn certain UBs into IBs (and thus make them safe to use) to sanitizers which [try to] catch UBs like race conditions or double-free or out-of-bounds array access.

I notice you didn't bother to quote what I was replying to. IF THE HARDWARE HAS UBs (those were your own words!), and the compiler assumes that there is no UB, then we're screwed ...

Cheers,
Wol

Losing the magic

Posted Dec 6, 2022 22:45 UTC (Tue) by khim (subscriber, #9252) [Link]

> IF THE HARDWARE HAS UBs (those were your own words!)

Not if. Hardware most definitely have an UB. x86 have less UBs than most other architectures, but it, too can provide mathematically impossible results!

On the hardware level, without help from the compiler! If used incorrectly, of course.

These UBs are results of hardware optimizations, instead. You can not turn these off!

But you can find series of articles which are explaining how one is supposed to work with all that right here, on LWN!

You have probably seen then already, but probably haven't realized what they are actually covering.

> if the compiler assumes that there is no UB, then we're screwed ...

Why? How? What suddenly happened? Compiler deals with these UBs precisely and exactly like with any other UBs: it assumes they never happen.

And then programmer is supposed to deal with all that in the exact same fashion as with any other UBs: by ensuring that compiler assertion is correct.

Losing the magic

Posted Dec 7, 2022 11:05 UTC (Wed) by farnz (subscriber, #17727) [Link]

It's worth noting that you're making the situation sound a little worse than it actually is.

The compiler's job is translate your program from one language (say C) to another language (say x86-64 machine code), with the constraint that the output program's behaviour must be the same as the input program's behaviour. Effectively, therefore, the compiler's job is to translate defined behaviour in the source program into identically defined behaviour in the output program.

For source languages without undefined behaviour, this means that the compiler must know about the destination language's undefined behaviour and ensure that it never outputs a construct with undefined behaviour - this can hurt performance, because the compiler may be forced to insert run-time checks (e.g. "is the shift value greater than the number of bits in the input type, if so jump to special case").

For source languages with undefined behaviour, the compiler gets a bit more freedom; it can translate a source construct with UB to any destination construct it likes, including one with UB. This is fine, because the compiler hasn't added new UB to the program - it's "merely" chosen a behaviour for something with UB.

Losing the magic

Posted Dec 7, 2022 12:44 UTC (Wed) by khim (subscriber, #9252) [Link]

> It's worth noting that you're making the situation sound a little worse than it actually is.

You are mixing issues. Of course it's possible to make language without UB! There are tons of such languages: C#, Java, Haskell, Python…

But that's not what O_PONIES lovers want! They want the ability to “program to hardware”. Lie to compiler (because “they know better”), do certain manipulations to hardware which compiler have no idea about and then expect that code would still work.

That is impossible (and I, probably, underestimate the complexity of task). It's as if Java program opened /proc/self/men, poked the runtime internals and then, when upgrade broken it, its author demanded satisfaction and claimed that since his code worked in one version of JRE then it must work in all of them.

That is what happens when you “use UB to combat UB”. Onus is on you to support new versions of compiler. Just like onus is on you to support new versions of Windows if you use undocumented functions, onus is on you if you poke into linux kernel internals via debugfs and so on.

And Linux kernel developers are not shy when they say that when programs rely on such intricate internal details all bets are off. Even O_PONIES term was coined by them, not by compiler developers!

> For source languages with undefined behaviour, the compiler gets a bit more freedom; it can translate a source construct with UB to any destination construct it likes, including one with UB. This is fine, because the compiler hasn't added new UB to the program - it's "merely" chosen a behaviour for something with UB.

Yes, but that's precisely what O_PONIES lovers object against. Just read the damn paper already. It doesn't even entertain the notion that programs can be written without use of UBs for one minute. They just assert they would continue to write code with UBs (“write code for the hardware” since “C is a portable assembler”) and compilers have to adapt, somehow. Then they discuss how compiler would have to deal with mess they are creating.

You may consider that as a concession of sorts (no doubt caused by the fact that you can not avoid UBs in today's world because even bare hardware have UBs), but it's still not a discussable position because instead of listing constructs which are allowed in the source program they want to just only blacklist certain “bad things”.

Because it doesn't work! Ask any security guy what he thinks about black lists and you would hear that they are always only a papering over the problem and just lead to the “whack the mole” busywork. To arrive at some promises you have to whitelist good programs, not blacklist the bad ones!

Losing the magic

Posted Dec 7, 2022 14:18 UTC (Wed) by farnz (subscriber, #17727) [Link]

You're arguing a different point, around people who demand a definition of UB in their language of choice L, by analogy to another language M.

I'm saying that the situation is not as awful as it might sound; if I write in language L, and compile it to language M, it's a compiler bug if the compiler converts defined behaviour in L into undefined behaviour in M. As a result, when working in language L (whether that's C, Haskell, Python, Rust, JavaScript, ALGOL, Lisp, PL/I, Prolog, BASIC, Idris, whatever), I do not need to worry about whether or not there's UB in language M - I only need care about language L, because it's a compiler bug if the compiler translates defined behaviour language L into undefined behaviour in language M.

So, for example, if language L says that a left shift by more than the number of bits in my integer type always results in a zero value, it's up to the compiler to make that happen. If language M says that a left shift by more than the number of bits in my integer type results in UB, then the compiler has to handle putting in the checks (or proving that they're not needed) so that if I do have a left shift by more than the number of bits in my integer type, I get 0, not some undefined behaviour.

And this applies all the way up the stack if I have multiple languages involved; if machine code on my platform has UB (and it probably does in a high-performance CPU design), it makes no difference if I compile BASIC to Idris, Idris to Chicken Scheme, Chicken Scheme to C, C to LLVM IR and finally LLVM IR to machine code, or if I compile BASIC directly to machine code. Each compiler in the chain must ensure that all defined behaviour of the source language translates to identically defined behaviour in the destination language.

In other words, as you compile from language L to language M, the compiler can leave you with as much UB as you had before, or it can decrease the amount of UB present in language M, but it can never add UB. The only "problem" this leaves you with if you're the O_PONIES sort is that it means that defining what it actually means for UB to flow from language M to language L is tricky, because in the current world, UB doesn't flow that way, it only flows from language L to language M.

Losing the magic

Posted Dec 7, 2022 15:09 UTC (Wed) by khim (subscriber, #9252) [Link]

> In other words, as you compile from language L to language M, the compiler can leave you with as much UB as you had before, or it can decrease the amount of UB present in language M, but it can never add UB.

Of course it can add UB! Every language with manual memory management, without GC, adds UB related to these. On hardware level there are no such UBs, memory is managed by user when he adds new DIMMs or removes then, there may never be any confusion about whether memory is accessible or not.

But Ada, C, Pascal and many other such languages add memory management functions and then say “hey, if you freed memory then onus is on you to make sure you wouldn't try to use object which no longer exists”.

The desire to do what you are talking about is what gave rise to GC infestation and abuse of managed code.

> The only "problem" this leaves you with if you're the O_PONIES sort is that it means that defining what it actually means for UB to flow from language M to language L is tricky, because in the current world, UB doesn't flow that way, it only flows from language L to language M.

UBs can flow in any direction and don't, actually, cause any problems as long as you understand what UB is: something that you are not supposed to do. If you understand what UB is and can list them — you can deal with them.

If you don't understand that UB is (O_PONIES people) or don't understand where they are (Clément Bœsch case or, of we are talking about hardware, Meltdown and Spectre case) then there's trouble.

Ignorance can be fixed easily. But attitude adjustments are hard. If someone believes it's his right to ignore traffic light because that's how he drove for last half-century in his small village then it becomes a huge problem when someone like that moves to big city.

Losing the magic

Posted Dec 7, 2022 16:21 UTC (Wed) by farnz (subscriber, #17727) [Link]

You're misunderstanding me still. If there is no UB in my source code, then there is also no UB in the resulting binary, absent bugs/faults in the compiler, the OS or the hardware.

Your examples are cases where I have UB in language L, I translate to language M, and I still have UB - in other words, no new UB has been introduced, but the existing UB has resulted in the output program having UB, too. The only gotcha is that the UB in the output program may surprise the programmer, since UB in the source language simply leaves the target language behaviour completely unconstrained.

There is never a case where I write a program in language L that is free of UB, but a legitimate compilation of that program to language M results in the program having UB. If this does happen, it's a bug - the compiler has produced invalid output, just as it's a bug for a C compiler to turn int a = 1 + 2; into int a = 4;.

In turn, this means that UB in language M does not create new UB in language L - the flow of UB is entirely one-way in this respect (there was UB in language L, when I compiled it, I ended up with a program that has UB in language M).

The only thing that people find tricky here is that they have a mental model of what consequences of UB are "reasonable", and what consequences of UB are "unreasonable", and get upset when a result of compiling a program from L to M results in the compiler producing a program in language M with "unreasonable" UB, when as far as they were concerned, the program in language L only had "reasonable" UB. But this is not a defensible position - the point of UB is that the behaviour of a program that executes a construct that contains UB is undefined, while "reasonable" UB is a definition of what behaviour is acceptable.

And here we come to the underlying fun with O_PONIES: Coming up with definitions for existing UB and pushing that through the standards process is hard work, and involves thinking about a lot of use cases for the language, not just your own, and getting agreement either on a set of allowable behaviours for a construct that's currently UB, or getting the standards process to agree that something should be implementation-defined (i.e. documented set of allowable behaviours from the compiler implementation). This is a lot of work, and involves getting a full understanding of why people want certain behaviours to be UB, rather than defined in a non-deterministic fashion.

Losing the magic

Posted Dec 7, 2022 17:28 UTC (Wed) by Wol (subscriber, #4433) [Link]

> And here we come to the underlying fun with O_PONIES: Coming up with definitions for existing UB and pushing that through the standards process is hard work, and involves thinking about a lot of use cases for the language, not just your own, and getting agreement either on a set of allowable behaviours for a construct that's currently UB, or getting the standards process to agree that something should be implementation-defined (i.e. documented set of allowable behaviours from the compiler implementation). This is a lot of work, and involves getting a full understanding of why people want certain behaviours to be UB, rather than defined in a non-deterministic fashion.

I don't know whether khim's English skills are letting him down, or whether he's trolling, but I think you've just encapsulated my view completely.

Multiplication exists in C. Multiplication exists in Machine Code. All I want is for the C spec to declare them equivalent. If the result is sane in C, then machine code has to return a sane result. If the result is insane in C, then machine code is going to return an insane result. Whatever, it's down to the PROGRAMMER to deal with.

khim is determined to drag in features that are on their face insane, like double frees and the like. I'm quite happy for the compiler to optimise on the basis of "this code is insane, I'm going to assume it can't happen (because it's a bug EVERYWHERE). What I'm unhappy with is SchrodinUB where the EXACT SAME CODE may, or may not, exhibit UB depending on situations outside the control of the programmer (and then the compiler deletes the programmer's checks!).

And it's all very well khim saying "the compiler writers have given you an opt-out". But SchrodinUB should always be opt IN. Principle of "least surprise" and all that. (And actually, I get the impression Rust is like that - bounds checks and all that sort of thing are suppressed in runtime code I think I heard some people say. That's fine - actively turn off checks in production in exchange for speed IF YOU WANT TO, but it's a conscious opt-in.)

Cheers,
Wol

Losing the magic

Posted Dec 7, 2022 18:48 UTC (Wed) by khim (subscriber, #9252) [Link]

> All I want is for the C spec to declare them equivalent.

If that's really your desire then you sure found a funny way to achieve it.

But I'm not putting you with O_PONIES crowd. It seems you are acting out of ignorance not malice.

John Regehr tried to do what you are proposing to do — and failed spectacularly, of course.

But look here: My paper does not propose a tightening of the C standard. Instead, it tells C compiler maintainers how they can change their compilers without breaking existing, working, tested programs. Such programs may be compiler-specific and architecture-specific (so beyond anything that a standard tries to address), but that's no reason to break them on the next version of the same compiler on the same architecture.

Basically O_PONIES lovers position is the following: if language M (machine code) have UBs then it's Ok for L to have UB in that place, but if M doesn't have UB then it should be permitted to violate rules of L and still produce working program.

But yeah, that's probably problem with me understanding English or you having trouble explaining things.

> What I'm unhappy with is SchrodinUB where the EXACT SAME CODE may, or may not, exhibit UB depending on situations outside the control of the programmer

How is that compatible with this:

> khim is determined to drag in features that are on their face insane, like double frees and the like. I'm quite happy for the compiler to optimise on the basis of "this code is insane, I'm going to assume it can't happen (because it's a bug EVERYWHERE).

I don't see why do you say that this feature is insane. Let's consider concrete example:

On beta versions of Windows 95, SimCity wasn’t working in testing. Microsoft tracked down the bug and added specific code to Windows 95 that looks for SimCity. If it finds SimCity running, it runs the memory allocator in a special mode that doesn’t free memory right away.

It looks as if your approach the EXACT SAME CODE may, or may not, exhibit UB depending on situations outside the control of the programmer very much does cover double free, dangling pointers and other such things. It's even possible to make it work if you have enough billions in bank and obsession with backward compatibility.

The question: are these a well-spent billion? Should we have a dedicated team which cooks up such patches for the clang and/or gcc? Who would pay for it?

Without changing spec (which people like Anton Ertl or Victor Yodaiken very explicitly say not what they want) this would be the only alternative, I'm afraid.

> But SchrodinUB should always be opt IN.

Why? It's not part of the C standard, why should it affect good programs which are not abusing C?

> And actually, I get the impression Rust is like that - bounds checks and all that sort of thing are suppressed in runtime code I think I heard some people say.

Only integer overflow checks are disabled. If you would try to divide by zero you would still get check and panic if divisor is zero.

But if you violate some other thing (e.g. if you program would try to access undefined variable) all bets are still off.

Let's consider the following example:

bool to_be_or_not_to_be() {
    int be;
    return be == 0 || be != 0;
}

With Rust you need to jump through the hoops to use uninitialized variable but with unsafe it's possible:

pub fn to_be_or_not_to_be() -> bool {
    let be: i32 = unsafe { MaybeUninit::uninit().assume_init() };
    return be == 0 || be != 0;
}

You may argue that what Rust is doing (removing the code which follows to_be_or_not_to_be call and replacing it with unconditional crash) is, somehow, better then what C is doing (claiming that value of the be == 0 || be != 0 is false).

But that would hard sell to O_PONIES lover who was counting on getting true from it (like Rust did only few weeks ago).

Yes, Rust is better-defined language, no doubt about it. It has smaller number of UBs and they are more sane. But C and Rust are cast in the same mold!

You either avoid UBs and have a predictable result or not's avoid them and end up with something strange… and there are absolutely no guarantee that program which works today would continue to work tomorrow… you have to ensure you program doesn't trigger UB to cash on that promise.

Losing the magic

Posted Dec 7, 2022 19:48 UTC (Wed) by pizza (subscriber, #46) [Link]

> And it's all very well khim saying "the compiler writers have given you an opt-out". But SchrodinUB should always be opt IN.

The thing is... they are! Run GCC without any arguments and you'll get -O0, ie "no optimization".

These UB-affected optimizations are only ever attempted if the compiler is explicitly told to try.

Now what I find hilarious are folks who complain about the pitfalls of modern optimization techniques failing on their code while simultaneously complaining "but without '-O5 -fmoar_ponies' my program is too big/slow/whatever". Those folks also tend to ignore or disable warnings, so.. yeah.

Losing the magic

Posted Dec 7, 2022 19:06 UTC (Wed) by khim (subscriber, #9252) [Link]

> If there is no UB in my source code, then there is also no UB in the resulting binary, absent bugs/faults in the compiler, the OS or the hardware.

We don't disagree there, but that's not what O_PONIES lovers are ready to accept.

> Your examples are cases where I have UB in language L, I translate to language M, and I still have UB - in other words, no new UB has been introduced, but the existing UB has resulted in the output program having UB, too.

Yes. Because that's what O_PONIES lovers demand to handle! They, basically, say that it doesn't matter whether L have UB or not. It only matters whether M have UB. If M doesn't have “suitably similar” UB then program in L must be handled correctly even if it violates rules of language L.

Unfortunately on practice it works only in two cases:

If L and M are extremely similar (like machine code and assembeler)
or
If translator from L to M is so primitive that you can, basically, predict how precisely each construct from L maps to M (old C compilers)

> In turn, this means that UB in language M does not create new UB in language L - the flow of UB is entirely one-way in this respect (there was UB in language L, when I compiled it, I ended up with a program that has UB in language M).

Ah, got it. Yeah, in that sense it's one-way street in the absence of bugs. Of course bugs may move things from M to L (see Meltdown and Spectre), but in the absence of bugs it's one way street, I agree.

> This is a lot of work, and involves getting a full understanding of why people want certain behaviours to be UB, rather than defined in a non-deterministic fashion.

And it's also explicitly not what O_PONIES lovers want. They explicitly don't want all that hassle, they just want the ability to write code in L with UB and get a working program. That is really pure O_PONIES — exactly like in that story with Linux kernel.

List of UBs in C and C++ is still patently insane, but that's different issue. It would have been possible to tackle that issue if O_PONIES lovers actually wanted to alter the spec. That's not what they want. They want ponies.

Losing the magic

Posted Dec 8, 2022 11:11 UTC (Thu) by farnz (subscriber, #17727) [Link]

Yep - and the O_PONIES problem, when you reduce it to its core is simple. The standard permits non-deterministic behaviour (some behaviours are defined as "each execution of the program must exhibit one behaviour from the allowed list of behaviours", not as a single definite behaviour). The standard also permits implementation-defined behaviour - where the standard doesn't define how a construct behaves, but instead says "your implementation will document how it interprets this construct".

What the O_PONIES crowd want is to convert "annoying" UB in C and C++ to implementation-defined behaviour. There's a process for doing that - it involves going through the standards committees writing papers and convincing people that this is the right thing to do. The trouble is that this is hard work - as John Regehr has already demonstrated by making the attempt - since UB has been used by the standards committee as a way of avoiding difficult discussions about what is, and is not, acceptable in a standards-compliant compiler, and thus re-opening the discussion is going to force people to confront those arguments all over again.

Losing the magic

Posted Dec 7, 2022 10:04 UTC (Wed) by anselm (subscriber, #2796) [Link]

This goes back to the fact that C was never, actually, designed, it was cobbled together by similarly-minded folks thus when people who knew how languages are supposed to be designed have tried to clarify how C works they only could do so much if they don't want to create an entirely different language which doesn't support programs written before that point at all (which would defeat the purpose of clarification work).

That, of course, applies to all programming languages which are in common use – especially to programming languages that have more than one implementation. (Life is a lot easier when a programming language has only one implementation and you can decree that the official behaviour of the language is whatever that implementation does in every case.)

C had been around for almost 20 years, with a considerable number of implementations on a wide variety of hardware architectures, when the first official C standard was published. Considering that, ANSI/ISO 9899-1990, within its limits, was a very important and remarkably useful document. It's easy to argue, with the benefit of 30-plus years' worth of hindsight, that C is a terrible language and the various C standards not worth the paper they're printed on, but OTOH as far as the Internet is concerned, C is still what makes the world go round – between the Linux kernel, Apache and friends, and for that matter many implementations of nicer languages than C, most of the code we're running on a daily basis remains written in C, and it will be a while yet before new languages like Rust get enough traction (and architecture support) to be serious competitors.

Of course C has its problems, but it is virtually guaranteed that any other language, once it has achieved the popularity and widespread use of C, will too (even if invented by “people who know how languages are supposed to be designed”). Certainly in the last 80 years of programming language design, and claims to the contrary notwithstanding, nobody has so far been able to come up with a systems programming language that has no problems at all, that runs everywhere, and that people are actually prepared to adopt.

Losing the magic

Posted Dec 7, 2022 12:04 UTC (Wed) by khim (subscriber, #9252) [Link]

> That, of course, applies to all programming languages which are in common use

Depends on how would you define “common use”, though. The first language which was actually designed is, arguably, Lisp (and it wasn't even designed to write real-world programs). It's still in use.

Also different version of Algol were designed, Pascal, Ada, Haskell… Even Java, C#, Go were designed to some degree! The goal of all these projects were to create something people can use to discuss how programs are crated, features of the language were extensively discussed and rejected (or accepted) on that basis.

C or PHP, on the other hand, were never actually designed, C was create just by pressing need to have something to rewrite PDP-7 only OS to support PDP-11, too. Later more machines were added and C was stretched and stretched till it started to break.

Only then committee started it's work and it stitched it together to the best of their abilities, but because some cracks were so deep some results were… unexpected.

> Life is a lot easier when a programming language has only one implementation and you can decree that the official behaviour of the language is whatever that implementation does in every case.

You never can do that. Look on languages with one implementation: PHP, Python (theoretically many implementation, but CPython is the definition one), or even Rust (although there are new implementations in development). Different versions may behave differently and you have to decide which one is “right” even if are no other implementation.

Life is “simple” only when language and it's implementation never change (Lua comes close).

> most of the code we're running on a daily basis remains written in C

I wouldn't say so. Maybe in embedded, but in most other places C is replaced with C++. Even if you say that C and C++ is the same language (which is true to some degree), then you would have to admit that most code today is not written in C, it's written in Python, Java, JavaScript or Visual Basic.

It was never true that C was the language which was used to the exclusion of everything else. And it wasn't even initially popular with OS writers: MacOS was written in Pascal, e.g. and early versions of Microsoft's development tools (Assembler, Linker, etc) were written in Microsoft Pascal, too.

Success of UNIX made C popular and this success wasn't even based in technical merits! Rather AT&T was forced to refrain from selling software thus it couldn't compete with sellers of UNIX.

It was always known that C is awful language, since day one. It was just not obvious how awful it was till sensible alternative arrived.

Approach was: “C is that horrible, awful thing, let's hide it from mere mortals”. Mostly because IT industry bought the cool-aid of GC-based solution to memory safety which, of course, can only work if there are something under your language to provide GC and other important runtime.

Most managed languages remained with runtime written in C/C++ because “hey, it's used by professionals, they can deal with sharp corners”. Only Go avoided that and it still need some other language for OS kernel, even in theory.

> Of course C has its problems, but it is virtually guaranteed that any other language, once it has achieved the popularity and widespread use of C, will too (even if invented by “people who know how languages are supposed to be designed”).

Oh, absolutely. Pascal was stretched, too and it, too, got many strange warts when ~~Borland~~, ~~CodeGear~~, ~~Borland~~, ~~Embarcadero~~, Idera was adding hot new features without thinking how to integrate them.

Rust is definitely not immune: while it's core is very good async approach is questionable and chances are high that we would know, 10 or 20 years later, how to do it much better than how Rust does it today.

But today it's unclear how to it better thus we have what we have.

> Certainly in the last 80 years of programming language design, and claims to the contrary notwithstanding, nobody has so far been able to come up with a systems programming language that has no problems at all, that runs everywhere, and that people are actually prepared to adopt.

That's just not possible. Languages come and go. C lifespan was artificially extended by invention and slow adoption of C++, though (when C++ was created it became possible to froze C and say that if you want a modern language you can go use C++ and since C++ wasn't “ready” for so many years it was always easy to say “hey, don't worry, next version would fix everything”). It's a bit funny and sad when you read concepts complete C++ templates as originally envisioned 30+ years after language was made, but oh, well… that's life.

Rust wasn't built in a vacuum, after all. It took many concepts developed in C++! RAII was invented there, ownership rules were invented there (only initially enforced by style guide, not compiler and so on.

Only, at some point, it becomes obvious that the only way forward is to remove some things from the language — which, basically, makes it a different language (it's really hard to remove something from popular language, recall the Python3 saga). Here Graydon Hoare lists things which were either removed from Rust or (in some cases) never added.

Thus yes, it would be interesting to see what would happen in 10-20 years when we would need to remove something from Rust. Would we get Rust 2.0 (like happened with Python) or entirely different language (like happening with C)? Who knows.

But no, I don't expect Rust to live forever. Far from it. It's full of problems, we just have to idea how to solve these problems properly yet, thus we solve these problems the same way we solve them in C (ask developer “to hold it right”).

Losing the magic

Posted Dec 5, 2022 21:57 UTC (Mon) by developer122 (guest, #152928) [Link]

Would be interesting to do an in-depth comparison with rust's type system.

Losing the magic

Posted Dec 5, 2022 22:28 UTC (Mon) by unixbhaskar (guest, #44758) [Link]

Well, another day, I stumbled on binfmt_misc and the register file has a "magic" number field too :)

Losing the magic

Posted Dec 6, 2022 0:04 UTC (Tue) by magfr (subscriber, #16052) [Link]

That is an entirely different school closer related to file(1) than the magic that is fading from the kernel sources.

Losing the magic

Posted Dec 6, 2022 6:44 UTC (Tue) by adobriyan (subscriber, #30858) [Link]

At this rate kernel will never run out of magic!

$ find . -type f -name '*.[chS]' | xargs grep -e magic -i | wc -l
12342

Losing the magic

Posted Dec 7, 2022 1:14 UTC (Wed) by ejr (subscriber, #51652) [Link]

Isn't this the situation from which Rusgocamkell will save us?

Sorry. Couldn't resist the snark. Run-time-ish methods to ensure compatibility likely are good at the correct abstraction points. I've used eight-character strings to detect (integer) endianness as a trivial example. Not zero terminated.

And losing magic for openness isn't bad necessarily. Just a tad sad for ex-wizards.

Losing the magic

Posted Dec 13, 2022 6:02 UTC (Tue) by tytso (subscriber, #9993) [Link]

The use of magic numbers is something that I learned from Multics. One advantage of structure magic numbers is that it also provides protection against use-after-free bugs, since you can zero the magic number before you free the structure, and even if you don't, when it gets reused, if everyone uses the magic number scheme where the first four bytes contain a magic number, then it becomes a super-cheap defense a certain class of bugs without needing to rely on things like KMSAN, which (a) is super-heavyweight and so won't be used on production kernels, and (b) didn't exist in the early days of Linux.

Like everything, it's a trade-off. Yes, there is overhead associated with magic numbers. But it's not a lot of overhead (and it's certainly cheaper than KMSAN!) and the ethos of "trying to eliminate an entire set of bugs" which is something is well accepted for making the kernel more secure, is someting that could be applied for magic numbers as well.

I still use magic numbers in e2fprogs, where the magic number is generated using the com_err library (another Multicism; where the top 24-bits identify the subsystem, and the low 8-bits is the error code for that subsystem). This means it's super easy to do things like this:

In lib/ext2fs/ext2fs.h:

#define EXT2_CHECK_MAGIC(struct, code) \
	  if ((struct)->magic != (code)) return (code)

In lib/ext2fs/ext2_err.et.in:

	error_table ext2

ec	EXT2_ET_BASE,
	"EXT2FS Library version @E2FSPROGS_VERSION@"

ec	EXT2_ET_MAGIC_EXT2FS_FILSYS,
	"Wrong magic number for ext2_filsys structure"

ec	EXT2_ET_MAGIC_BADBLOCKS_LIST,
	"Wrong magic number for badblocks_list structure"

The compile_et program generates ext2_err.h and ext2_err.c, for which ext2_err.h will have definitions like this:

#define EXT2_ET_BASE                             (2133571328L)
#define EXT2_ET_MAGIC_EXT2FS_FILSYS              (2133571329L)
#define EXT2_ET_MAGIC_BADBLOCKS_LIST             (2133571330L)
...

Then in various library functions:

errcode_t ext2fs_dir_iterate2(ext2_filsys fs,
			      ext2_ino_t dir,
...
{
	EXT2_CHECK_MAGIC(fs, EXT2_ET_MAGIC_EXT2FS_FILSYS);
        ...

And of course:

void ext2fs_free(ext2_filsys fs)
{
	if (!fs || (fs->magic != EXT2_ET_MAGIC_EXT2FS_FILSYS))
		return;
       ...
	fs->magic = 0;
	ext2fs_free_mem(&fs);
}

Callers of ext2fs library functions then will do things like this:

 	errcode_t    retval;

	retval = ext2fs_read_inode(fs, ino, &file->inode);
	if (retval)
		return retval;

or in application code:

		retval = ext2fs_read_bitmaps (fs);
		if (retval) {
			printf(_("\n%s: %s: error reading bitmaps: %s\n"),
			       program_name, device_name,
			       error_message(retval));
			exit(1);
		}

This scheme has absolutely found bugs, and given that there is a full set of regression tests that get run via "make check", I've definitely found that having this kind of software engineering practice increases developer velocity, and reduces my stress when I code since when I do make a mistake, it generally gets caught really quickly as a result.

Personally, I find this coding discipline easier to understand and write than Rust, and more performant than using things like valgrind and MSan. Of course, I use those tools too, but if I can catch bugs early, my experience is that it allows me to generate code much more quickly and reliably.

Shrug. Various programming styles go in and out of fashion. And structure magic numbers goes all the way back to the 1960's (Multics was developed as a joint project between MIT, GE, and Bell Labs starting in 1964).

Losing the magic

Posted Dec 13, 2022 6:19 UTC (Tue) by Fowl (subscriber, #65667) [Link]

Is it still 'magic' if it's a vtable pointer? ;p

Losing the magic

Posted Dec 13, 2022 6:27 UTC (Tue) by tytso (subscriber, #9993) [Link]

And by the way.... the com_err library is not just used by e2fsprogs. It's also used by Kerberos, as well as a number of other projects that were developed at MIT's Project Athena[1] (including Zephyr[2], Moira[3], Hesiod[4], Discuss[5], etc.)

[1] http://web.mit.edu/saltzer/www/publications/atp.html
[2] http://web.mit.edu/saltzer/www/publications/athenaplan/e....
[3] http://web.mit.edu/saltzer/www/publications/athenaplan/e....
[4] http://web.mit.edu/saltzer/www/publications/athenaplan/e....
[5] http://www.mit.edu/afs/sipb/project/www/discuss/discuss.html

Losing the magic

Posted Dec 13, 2022 12:11 UTC (Tue) by mathstuf (subscriber, #69389) [Link]

I run my userspace under `MALLOC_CHECK_=3` and `MALLOC_PERTURB_=…` (updated occasionally by a user timer unit) to catch things like this. Is some kind of "memset-on-kfree" mechanism not suitable for debugging the entire kernel for use-after-free while also being far less heavy than KMSAN?

I ask because some day, a very smart compiler might see that dead write of `fs->magic = 0;` given the immediate free afterwards and optimize it out as UB to observe. Additionally, while it's also against UAF in ext2 code, non-ext2 code that gets its hands on the pointer that somehow that doesn't have the magic-checking logic is just as dead too (I have no gauge on how "likely" this is in the design's use of pointers).

Losing the magic

Posted Dec 13, 2022 13:08 UTC (Tue) by excors (subscriber, #95769) [Link]

> I ask because some day, a very smart compiler might see that dead write of `fs->magic = 0;` given the immediate free afterwards and optimize it out as UB to observe.

That day was at least 8 years ago. GCC 4.9 with -O1 will optimise away the writes, defeating this attempt at memory protection, because ext2fs_free_mem is an inline function so the compiler knows the object is passed to free() and can no longer be observed. See e.g. https://godbolt.org/z/nWYEa34a6

I guess the cheapest way to prevent that is to insert a compiler barrier (`asm volatile ("" ::: "memory")`) just after writing to fs->magic, to prevent the compiler making assumptions about observability of memory.

Losing the magic

Posted Dec 13, 2022 14:47 UTC (Tue) by adobriyan (subscriber, #30858) [Link]

This is the usecase for "volatile": *(volatile int *)&fs->magic = 0;

Losing the magic

Posted Dec 14, 2022 2:46 UTC (Wed) by nybble41 (subscriber, #55106) [Link]

Even then, couldn't the compiler just set it back after the volatile write, but before the memory is freed? Seeing as how it's unobservable and all. Perhaps it decides to use that "dead" memory as scratch space for some other operation.

Losing the magic

Posted Dec 14, 2022 16:47 UTC (Wed) by adobriyan (subscriber, #30858) [Link]

I don't think so. "volatile" means "load/store instruction must be somewhere in the instruction stream" which is what needed.

Losing the magic

Posted Dec 14, 2022 17:10 UTC (Wed) by mathstuf (subscriber, #69389) [Link]

It means that the value being represented is not trackable in the C abstract machine and therefore no assumptions can be made about it. Because no assumptions can be made, optimizers are hard-pressed to do much of anything about it because the "as-if" rule is likely impossible to track accurately.

However, given that this is trivially detectable as about-to-be-freed memory, I don't know what kind of rules exist around "volatile values living in C-maintained memory" might allow even this to still be seen as a dead store and unobservable via UAF == UB.

Losing the magic

Posted Dec 14, 2022 19:31 UTC (Wed) by farnz (subscriber, #17727) [Link]

If I'm reading the standard correctly, the compiler has to output the stores, because it is possible that the program has shared that memory with an external entity using mechanisms outside the scope of the standard. What the implementation does after the memory is freed is not specified (although the implementation is allowed to assume that the memory is no longer shared with an external entity at this point), and in theory a sufficiently malicious implementation could undo those final stores after you called free, but before the memory is reused.

In practice, I don't think this is a significant concern for tricks intended to help with debugging. It is for security-oriented code, but that's not the case here.

Losing the magic

Posted Dec 14, 2022 23:25 UTC (Wed) by mathstuf (subscriber, #69389) [Link]

Can the C abstract machine really say that memory it obtains through `malloc` has some other magical property? Wouldn't that require you to get "lucky" with what `malloc` gives you in the first place to have that address space "mean something" to some other part of the system?

Maybe the kernel gets away with it by "hiding" behind non-standard allocation APIs…

Losing the magic

Posted Dec 15, 2022 10:47 UTC (Thu) by farnz (subscriber, #17727) [Link]

It's more that you can have an external observer outside the abstract machine, but able to understand abstract machine pointers; I can, in theory, store a pointer from malloc in a way that allows the external observer to reach into the abstract machine and read the malloc'd block. I can also have the external observer be looking not at the addresses, but at the pattern of data written into the block (just as in hardware, it's not unknown to have chips only connected to the address bus, and to rely on the pattern of address accesses to determine what to do).

The compiler is not allowed to make assumptions about what the external environment can, or cannot, see, and thus has to assume that any write to a volatile is visible in an interesting fashion.

Losing the magic

Posted Dec 14, 2022 18:33 UTC (Wed) by excors (subscriber, #95769) [Link]

> Even then, couldn't the compiler just set it back after the volatile write, but before the memory is freed? Seeing as how it's unobservable and all. Perhaps it decides to use that "dead" memory as scratch space for some other operation.

It probably could, but I don't think it's particularly fruitful to consider what the compiler 'could' do, because the goal of the magic numbers here is to detect use-after-free bugs, i.e. we're interested in the practical behaviour of a situation that the standard says is undefined behaviour. We're outside the scope of the standard, so all we can do is look at what GCC/Clang actually will do.

If there is no barrier or volatile, and some optimisations are turned on, they demonstrably will delete the write-before-free. With barrier or volatile, it appears (in my basic testing) they don't delete it, so the code will behave as intended - that doesn't prove they'll never delete it, but I can't immediately find any examples where that trick fails, and intuitively I think it'd be very surprising if it didn't work, so I'd be happy to make that assumption until shown a counterexample.

(The same issue comes up when trying to zero a sensitive buffer before releasing it, to reduce the risk of leaking its data when some other code has a memory-safety bug - you need to be very careful that the compiler doesn't remove all the zeroing code, and you can't look to the C standard for an answer.)