|
|
Subscribe / Log in / New account

Python and hashing None

Please consider subscribing to LWN

Subscriptions are the lifeblood of LWN.net. If you appreciate this content and would like to see more of it, your subscription will help to ensure that LWN continues to thrive. Please visit this page to join up and keep LWN on the net.

By Jake Edge
November 30, 2022

The recent discussion of a proposed change to the Python language—the usual fare on the language's Ideas forum—was interesting, somewhat less for the actual feature under discussion than for the other issues raised. The change itself is a minor, convenience feature that would provide a reproducible iteration order for certain kinds of sets between separate invocations of the interpreter. That is a pretty limited use case, and one that could perhaps be fulfilled in other ways, but the discussion also highlighted some potentially worrying trends in the way that feature ideas are handled in the Python community.

An idea

On November 16, Yoni Lavi posted his idea (in a post that is "temporarily hidden", for unclear reasons, but is still viewable) that the Python None singleton should consistently hash to the same value—even on separate runs of the interpreter. His goal is to be able to reproduce and verify successive runs of a complex application. Currently, those values are often, but not always, different:

    $ python -c "print(hash(None))"
    5928233778750
    $ python -c "print(hash(None))"
    5880283780414
For Python binaries built with address-space layout randomization (ASLR), as above, the address of the None object changes on each interpreter run. Without ASLR, a given CPython binary will repeatedly produce the same value. For CPython, the "object ID" (i.e. id()) of an object, in the general case, is its address, which is used to generate the hash() value. Those hash values are used for speeding up the lookup of dictionary entries, but they have some other effects as well.

In particular, as Lavi described, the hash value determines the order of set iteration, so reproducing program behavior from run to run is not possible when None can be a member of the set. Beyond that, other types (e.g. tuple, collections.namedtuple, and frozen dataclasses.dataclass) that might be stored in a set can get "infected" by the non-reproducibility because they contain None fields. For example:

    $ python -c "print({ (1, None), (2, 3) })"
    {(2, 3), (1, None)}
    $ python -c "print({ (1, None), (2, 3) })"
    {(1, None), (2, 3)}
Because one of the tuples contains None, the order in the set can change between runs, but the same is not true if there are only simple values in the tuples. So it is not directly a problem with hash(None) that Lavi is having; the problem lies in exactly reproducing behavior (and, presumably, output) for program validation and debugging purposes when these somewhat complicated data structures are in use.

The hashes for some simple objects are reproducible, however. Numbers always hash to their values, True and False are special (1 and 0, respectively), floating-point values have reproducible hashes, and so on. Strings used to have reproducible hashes, but that led to a denial-of-service vulnerability via hash collisions, especially for web frameworks that process lots of untrusted strings. Strings (and bytes) now have random hash values based on a seed that gets set when the interpreter is initialized; those interested in reproducible hashes can set PYTHONHASHSEED, however. But there is no equivalent for a constant hash(None).

Reaction

Steven D'Aprano cautioned that the order of elements in a set depends on more than just the hash values of its elements; it also depends on the order of operations on the set (some of which can be seen in our Python ordered set article from October). The ordering is also specific to the current CPython set implementation, which could change without warning:

Because sets are documented as being unordered, in theory the iteration order could even change from one call to another: str(a) == str(a) is not guaranteed to be true, even if the set a doesn't change from one call to the next. If it happens to be true right now, that's an accident of the implementation, not a promise made by the language.

A fixed value for hash(None) seemed "harmless enough" to Paul Moore, though he saw no reason for it to be an option; "We should either do it or not." He was a little concerned that there might be some security implication from making the change, since ASLR is done for security purposes. Lavi did not see much difference from the hash values of True and False and Petr Viktorin agreed that a constant hash(None) should not be a security problem. In fact, not leaking information about the base address being used for ASLR "might be a slight win, actually".

Lavi filed a GitHub issue and a pull request for a trivial patch that simply hardcodes the value of hash(None) to 0xbadcab1e. The issue entry describes the problem at more length:

CPython's builtin set classes, as do all other non-concurrent hash-tables, either open or closed, AFAIK, grant the user a certain stability property. Given a specific sequence of initialization and subsequent mutation (if any), and given specific inputs with certain hash values, if one were to "replay" it, the result set will be in the same observable state every time: not only have the same items (correctness), but also they would be retrieved from the set in the same order when iterated.

This property means that code that starts out with identical data, performs computations and makes decisions based on the results will behave identically between runs. For example, if based on some mathematical properties of the input, we have computed a set of N valid choices, they are given integer scores, then we pick the first choice that has maximal score. If the set guarantees the property described above, we are also guaranteed that the exact same choice will be made every time this code runs, even in case of ties. This is very helpful for reproducibility, especially in complex algorithmic code that makes a lot of combinatorial decisions of that kind.

Creating a set class that does have those properties—and maintains them between runs—is possible, of course, but there is a substantial performance penalty to doing so, Lavi said. The reproducibility he describes does not require the stricter guarantee that would come with some kind of ordered set either. Meanwhile, in a large code base using some alternate mechanism, someone could unknowingly use a regular set and reintroduce the problem all over again. While determinism between runs is a "rather niche" concern, the change he suggests is minimal and, seemingly, harmless.

But core developer Raymond Hettinger closed the issue with the explanation that it did not make sense: "The default hash for every object is its object id. There is nothing special about None in this regard." Back in the thread, Viktorin was a bit apologetic about encouraging Lavi with this idea, but noted that there are some potential downsides even to the small change suggested:

Assuming that a set with the same hashes and order of construction is relying an implementation detail that might change in the future. It doesn't need to be a change in CPython itself – some extra caching or serialization might re-create the object and ruin the determinism, because set order is not important.

Making the None hash constant would lead more people down this path. That's the external cost of the change – definitely not zero.

The thread starts to go off the rails shortly thereafter, with the same complaints (and rebuttals) being raised multiple times. As is common when threads go awry, the participants are often talking past each other. There are certainly other ways to achieve what Lavi is trying to do, but they have their own costs, both in performance and in "enforcing a non-standard idiom over an entire codebase" as he put it.

From a high-level perspective, wanting determinism between invocations of the interpreter does not seem completely unreasonable, but it would rely on sets having guaranteed behavior that the Python core developers are not willing to grant—even if the hash of None was already a constant. D'Aprano summarized the discussion shortly before the thread was closed for being argumentative. He suggested turning off ASLR or using classes that implement their own __hash__() method as a possible approach, but warned that "it is still unsafe to assume that set iteration order will be predictable"

New threads

A few days after the original thread was closed, Lavi was back on November 27 with a request to reconsider the idea and reopen the pull request. It did not really add anything new to the debate, though he did attach a lengthy document further describing the problem and some of the issues surrounding it. He had been asked to try again (apparently on Discord), so he posted it to both the forum and to the largely moribund python-dev mailing list.

On python-dev, the thread ran much of the same course as the original forum post did, though there were some surprises, perhaps. Python creator Guido van Rossum said that he would be perfectly happy to see sets have the same ordering guarantees as dicts, if there are no performance degradations that result. Furthermore:

"Sets weren't meant to be deterministic" sounds like a remnant of the old philosophy, where we said the same about dicts -- until they became deterministic without slowing down, and then everybody loved it.

Meanwhile, he noted that hashing an empty tuple (hash(())) is stable over successive runs; the arguments against making hash(None) stable "feel like rationalizations for FUD". He thinks that there is a good argument for making hash(None) constant across interpreter runs.

Part of the problem in all of the threads is confusion about what Lavi is asking for. He does not require any specific ordering for sets, not the insertion-order guarantee of dicts nor any other. The determinism he is looking for is already available in the language, even though it is not guaranteed, as long as None (or things containing it) is not used in the set. It is, in fact, hard to see how there would be any benefit to some hypothetical version of Python that did not produce the same order for set iteration if the order of operations constructing the set are the same—unless that order was being deliberately perturbed for some other reason.

Meanwhile, over in the forum thread, Moore immediately responded that he thought Lavi was wasting his own and everyone else's time. Lavi agreed and suggested the topic be deleted; that was not his last suggestion to drop the whole thing that would be ignored by others in the thread. Chris Angelico, who was sympathetic to the idea in the earlier thread, pointed out that there is something of a trend here:

Only because there's a general tendency to yell at ideas until they go away, which isn't exactly a good policy, but it's an effective way of wielding "status quo wins a stalemate" to win arguments. See the previous thread for details, or look at any of myriad other places where an idea was killed the same way.

He is referring to a blog post from Nick Coghlan entitled "Status quo wins a stalemate" that D'Aprano and others had mentioned. But, as Angelico noted, the key word is "stalemate"; "Not 'status quo wins any debate if we can just tire out everyone else'." But Moore said there was more to it than that, at least in this case; a core developer had reviewed the pull request and rejected it. Even though the change is "small and relatively [innocuous]", it was not important enough "to warrant overturning another core dev's decision" in his opinion.

Angelico further lamented the treatment of new ideas: "It's like posting an idea is treated as a gauntlet that has to be run - not of technical merit, but of pure endurance." Moore agreed that it is a problem, but the flipside is one as well:

But it's equally bad if people propose ideas with little or no justification, and no appreciation of the costs of their proposal. Making people "run the gauntlet" is a bad way of weeding out bad proposals, but we don't seem very good at finding a better way (I've seen far too many "I think it would be neat if Python did such-and-such" proposals that wasted way too much of people's time because they didn't get shut down fast enough or effectively enough).

That led to some discussion of said costs; D'Aprano posted a lengthy list, which Lavi then rebutted, but it is clear that no minds are being changed at this point. It seems unlikely that any core developer feels strongly enough to override Hettinger's rejection, but the discussion continues on. In part, that's because proposals sometimes have a different outcome after they have been discussed for "a while"—or if they periodically get resurrected until they finally succeed. Witness structural pattern matching, which had been proposed and discussed many times in different guises until it was adopted, or dict "addition", which had a similar path. As Angelico put it:

It's been proven that tenacity in the face of opposition is crucial to a proposal's success. Proven time and again with successful proposals, not just failed ones. Are you surprised, then, when one rejection doesn't kill a proposal instantly?

That's exactly my point. It's a gauntlet, not a technical discussion.

It is perhaps inconvenient for Lavi (and others), but the hash(None) "problem" is apparently not going away. It seems like there should be a straightforward "fix" or workaround for it, but wholesale changes, and subsequent enforcement of them, seem like the only way forward—at least if always running without ASLR is not an option. That is all rather unsatisfying, in truth. And the gauntlet remains for the next ideas that the community comes up with.


Index entries for this article
PythonIdeas
PythonNone
PythonSets


(Log in to post comments)

Python and hashing None

Posted Dec 1, 2022 0:42 UTC (Thu) by Fishwaldo (subscriber, #47595) [Link]

Preface: I didn’t go read any of the linked threads, so I could be way off here. If I am, apologies

But - what the? At least from reading this article, this is a simple change, no known regression, and you get “something” (deterministic ordering for None as an implementation detail) for “nothing”.

But it’s rejected because why?

And if None really does hash to the ASRL base address, I can see a bunch of “security researchers” having a new vector to explore.

Python and hashing None

Posted Dec 1, 2022 1:19 UTC (Thu) by Paf (subscriber, #91811) [Link]

I think the core objection was that it creates a behavior (and so, an API) expectation where one does not exist today, so it restricts possible future choices (or it could). It spans out in to all of the details of whether and how much that is true/matters in this case.

I’m not trying to take a position on the specifics, but that’s the abstract concern I think.

I find myself stuck a bit here! I don’t have a strong feel either direction. Or maybe it’s that I have several feels in *different* directions.

Python and hashing None

Posted Dec 4, 2022 0:52 UTC (Sun) by nybble41 (subscriber, #55106) [Link]

Yes, that does seem to be the core of the issue. My own modest suggestion would be: Go ahead and make hash(None) a constant—it's only logical that "plain old data" values which are compared based on their content should have deterministic hashes—but at the same time add explicit non-determinism to the ordering of items in sets, for example by permuting the hash of each item in the set with a per-instance constant (based on PYTHONHASHSEED if set). This way you at least have the *potential* for data structures based on hashes which provide a stable ordering when None values are present, without creating any expectation that the built-in set type is such a data structure.

Python and hashing None

Posted Dec 1, 2022 2:44 UTC (Thu) by devkev (subscriber, #74096) [Link]

I also haven't read any of the links, but I find myself wondering why PYTHONHASHSEED isn't considered an easy solution here.

It already exists precisely to solve this problem (repeatable hashing) for strings. So why not have it similarly control None's hashing? If it's unset or 0, then you get the current behaviour. If it's anything else, then hash(None) returns a fixed value (possibly based on the seed).

I'm struggling to see any downsides of this approach, especially since it's opt-in...

Python and hashing None

Posted Dec 1, 2022 5:24 UTC (Thu) by Paf (subscriber, #91811) [Link]

Yes, I’ve been wondering about that myself. It seems ideal and simple. A reasonable compromise to say the least.

Python and hashing None

Posted Dec 1, 2022 7:36 UTC (Thu) by NYKevin (subscriber, #129325) [Link]

I tend to imagine that the concern is, if we give people deterministic hash(None) this week, then next week they will ask for deterministic hash(object()) or hash(MyClass()). And that's a whole different kettle of fish.

Python and hashing None

Posted Dec 1, 2022 8:19 UTC (Thu) by geert (subscriber, #98403) [Link]

Anyone trying to implement Merkle Trees (e.g. a git-like system) using python hashes?

Python and hashing None

Posted Dec 1, 2022 9:05 UTC (Thu) by NYKevin (subscriber, #129325) [Link]

Python's hash() is (mostly) isomorphic to Java's Object.hashCode(). It is not a cryptographic hash:

* Small integers hash to themselves (except -1 due to a quirk of the API), so they have no preimage resistance.
* hash(-1) == -2 == hash(-2), so that's a free collision you can put anywhere that negative integers are valid.
* Strings hash nondeterministically unless you set a command line flag or environment variable.
* Most (other) objects hash nondeterministically if ASLR is enabled.
* Even if ASLR is disabled, hashing is dependent on memory layout, which probably varies with operating environment.
* hash() has not been cryptographically studied and probably has very weak or nonexistent collision and/or preimage resistance in general.
* The purpose of hash() is to make efficient hash tables. Using it for any other purpose is not guaranteed to do anything useful.

As a consequence, using hash() to implement Merkle trees is unlikely to accomplish much of value, since you will not have meaningful tamper resistance, and they can't even be serialized out to disk (or network) because of the nondeterminism (it wouldn't validate on another run or machine because of all the nondeterminism).

Python and hashing None

Posted Dec 1, 2022 12:57 UTC (Thu) by devkev (subscriber, #74096) [Link]

It is a whole different kettle of fish, because None is a special singleton with special semantics, whereas object() and MyClass() aren't. I don't see how "but you did it for None last week" would be a compelling argument for "please do it for object() now".

Also, the MyClass case is trivial, because it can easily have its own stable implementation of __hash__(), if that's what its author wants. This approach could be a solution to the original problem, except that converting all occurrences of None to a special StableNone (with the desired __hash__() behaviour) isn't really feasible (and it's not possible to override the __hash__() for None itself, AFAIK).

Python and hashing None

Posted Dec 1, 2022 13:36 UTC (Thu) by hkario (subscriber, #94864) [Link]

Yup, object() is differnt, and having id(object()) != id(object()) is not only very much intentional, it's necessary for hashes to work in python, those are different and separate objects being compared, so they really should have different ids.

Python and hashing None

Posted Dec 1, 2022 18:54 UTC (Thu) by dtlin (subscriber, #36537) [Link]

With garbage collection,
id(object()) == id(object())
is entirely possible, since the first object is unreferenced, so it may be destroyed before the second object is created, and the second object could be allocated at same address.

Python and hashing None

Posted Dec 1, 2022 19:20 UTC (Thu) by hkario (subscriber, #94864) [Link]

That's why I wrote "should have different hashes" and not "must have different hashes".
For hash tables to work objects need to have hash values unique enough, not globally unique.

Python and hashing None

Posted Dec 1, 2022 18:03 UTC (Thu) by NYKevin (subscriber, #129325) [Link]

None is actually not all that special. There are only two differences between None and object():

* None has a special name and repr().
* None is falsey instead of truthy.

Everything else is just a convention. Why add the behavior of hash() to that very short list?

Python and hashing None

Posted Dec 2, 2022 1:20 UTC (Fri) by devkev (subscriber, #74096) [Link]

The specialness is indeed mostly because of the strong usage conventions - None is a well-known singleton, with ecosystem-wide expectations on how it should be used.

Anyone who wants stable-None-hashing behaviour could certainly create and use an alternative 'StableNone' in their code. But the strong conventions means it's easy to forget to use it, and accidentally use None instead.

It's also a problem if your code needs to work with other code that use None in the normal way (eg. libraries, and/or your code is a library). eg. you might need stuff like this throughout your code:

    y = somethingThatUsesNoneNormally(x if x is not StableNone else None)
    if y is None:
        y = StableNone
Yes, you could wrap/hide these in helper functions, but you still have to remember to use them everywhere they need to be, aka "sprinkle the magic pixie dust", ie. this is only marginally better:
    y = fixNone(somethingThatUsesNoneNormally(fixNone(x))
The question then becomes, why should the application programmer carry this burden? Why do they require a separate StableNone, when it serves the same purpose as None? Why can't the programmer opt-in to tweaked None hashing? There's already a lever to tweak hashing behaviour when warranted.

Python and hashing None

Posted Dec 3, 2022 7:10 UTC (Sat) by NYKevin (subscriber, #129325) [Link]

There are a lot of built-in objects that inherit object.__hash__():

* Module objects
* Function objects
* Bound methods
* Various technical types (such as code objects, frame objects, etc.)

There are also a ton of such classes defined throughout the standard library (and which the average user is much more likely to run into). For example, as far as I can tell, all objects that you can obtain from the sqlite3 module inherit object.__hash__() (e.g. connections, cursors, etc.).

If you really want predictable hashing, you cannot stop at None. You have to monkey-patch object.__hash__() *anyway*, so you might as well just do that and stop asking upstream to modify the language, which never promised predictable hashing in the first place.

Python and hashing None

Posted Dec 3, 2022 10:42 UTC (Sat) by devkev (subscriber, #74096) [Link]

I think we will have to agree to disagree.

I'm happy to explain to anyone who is looking for consistent hashing of arbitrary objects - let alone consistent ordering of arbitrary objects in _unordered_ sets - that that expectation is unreasonable.

But, although None is technically an object, I don't think it's right to consider it an _arbitrary_ object, on an equal footing with a module object or an sqlite3 cursor object. Since it serves the purpose of a nil/null placeholder value, which can be used in place of anything else, I think it's much more productive to consider it as a special kind of basic type, eg. more in line with bool than object. In fact None really is a lot like True and False (which do have fixed hashes), except it's "unary" with only one possible value, rather than two binary/boolean values.

Based on all this I think it's totally fine to consider None's hashing needs completely independently from the hashing needs of arbitrary objects.

Python and hashing None

Posted Dec 4, 2022 1:45 UTC (Sun) by NYKevin (subscriber, #129325) [Link]

> I'm happy to explain to anyone who is looking for consistent hashing of arbitrary objects - let alone consistent ordering of arbitrary objects in _unordered_ sets - that that expectation is unreasonable.

It is my position that this expectation is unreasonable with respect to all objects, other than small integers (which have identity hashing for performance reasons) and other numerical types. True and False have fixed hashes because True == 1 and False == 0, equal objects must hash to the same value, and all small integers hash to themselves (except -1), so it follows that True and False must also hash to 1 and 0 respectively. Larger integers and non-NaN floats are hashed in a more conventional fashion, and the hashing algorithm is defined in such a way that it reduces to an identity function for small integers, rather than having some arbitrary cutoff where an entirely different algorithm is invoked. See [1] for a full explanation of how this actually works. IMHO it might make sense to add some randomized salt to the numerical hash, but as far as I'm aware, nobody has actually managed to exploit it. You'd probably have to do some additional work to ensure that the salt is different depending on the width of the number (or else you'd have exactly the same overall set of hash collisions within the universe of ints), so that's probably not worth the complexity.

All other commonly-encountered objects, including str and all NaN-valued floats, either hash to an unpredictable value, hash to a value based on the hashes of their constituent parts (e.g. tuples), or do not hash at all. There is a weird environment variable to allow consistent hashing of str, solely because it accidentally happened to have consistent hashing once upon a time and a small number of applications may have unwisely depended on that functionality. With None, there is no compelling backcompat story, so there's no reason to extend this to any other type, including NoneType. There's also no way to prevent NaN hashing from being pointer-based, and IMHO if you're going to include None, you should at least be including that as well.

[1]: https://github.com/python/cpython/blob/f4c03484da59049eb6...

Python and hashing None

Posted Dec 4, 2022 1:52 UTC (Sun) by NYKevin (subscriber, #129325) [Link]

> IMHO it might make sense to add some randomized salt to the numerical hash, but as far as I'm aware, nobody has actually managed to exploit it. You'd probably have to do some additional work to ensure that the salt is different depending on the width of the number (or else you'd have exactly the same overall set of hash collisions within the universe of ints), so that's probably not worth the complexity.

It should also be noted that this would make it more difficult for end users to implement custom numerical types, because those are required to use the same hashing algorithm as the standard library, for compatibility reasons. So it is probably infeasible anyway.

Python and hashing None

Posted Dec 4, 2022 12:23 UTC (Sun) by SLi (subscriber, #53131) [Link]

It seems to me that what Lavi is asking for here is nothing that could be called a "guarantee" that holds between any different releases of Python, for example. I bet he would be reasonably happy if the hash changed once a week.

The stated use case seems to be "I ran a script and it broke; if I add a print to debug it, it behaves differently, because Python makes this harder for absolutely no benefit". And None _is_ a very common value for e.g. dataclasses; it even has its own Optional mypy type because of this. It seems mildly silly that optional types do not have deterministic hashes, but the major point here, IMO, is not the determinism of the hash but just not making things harder than they should be for no benefit. A bit like when you specify that a device should be able to withstand certain kind of electric spikes, you usually still won't go and add them to the input just because.

Python and hashing None

Posted Dec 1, 2022 14:32 UTC (Thu) by epa (subscriber, #39769) [Link]

It seems like the sort of thing that would have been handled much better under the old benevolent dictator model.

Python and hashing None

Posted Dec 1, 2022 9:28 UTC (Thu) by ballombe (subscriber, #9523) [Link]

If you do not like ASLR during test, you can disable it locally with 'setarch -R'.
This seems much simpler than changing python.

Python and hashing None

Posted Dec 1, 2022 18:28 UTC (Thu) by Nahor (subscriber, #51583) [Link]

As stated in the article:

>He suggested turning off ASLR [...], but warned that "it is still unsafe to assume that set iteration order will be predictable"

The fact that it currently works without ASLR is just a "happy accident", and there is no guarantee that it will remain true in the future.

I would also add that this workaround does work when the goal is to reproduce a bug report, unless the reporter also had ASLR disabled.

Python and hashing None

Posted Dec 1, 2022 18:40 UTC (Thu) by sbelmon (subscriber, #55438) [Link]

I'm surprised the discussion doesn't center more on security.

If the inputs are under attacker control (e.g., queries from the network), then any predictable/stable hash immediately leads to a DoS attack vector; just create zillions of things that hash to the same entry, and the hash table degrades to a link list or whatever.

Yes, there are languages like Java where the design bakes this in (Object.hashCode()), and that's broadly recognized as a bad idea.

Python and hashing None

Posted Dec 1, 2022 21:34 UTC (Thu) by rra (subscriber, #99804) [Link]

This is a bit irrelevant to having a stable hash for None, though, I think. That would apply to a stable hash for something like strings, but the whole point of a set is that an object can only be in there once, so fixing the hash for None cannot create the problem you describe.

Python and hashing None

Posted Dec 18, 2022 12:45 UTC (Sun) by sammythesnake (guest, #17693) [Link]

To agree with you, None hashes stably *within* a process, so it's a collision already if your crafted input can fill a hash table with None values - making that hash stable across processes wouldn't really make any difference there.

To *disagree* with you, if you know the hash of None and the algorithm for hashing a string, you could potentially craft *strings* that hash to the same value, potentially with interesting consequences.

Ironically, that would include making a hash-ordered collection (that includes stable-hashed-None) suddenly become non deterministic because None and various strings could be interleaved non-deterministically...

IMVHO (I don't really use python deeply enough to be affected by this) it would seem that making the hash of None depend only on PYTHONHASHSEED would make everyone happy (except those that object in principle - colour me indifferent) It's unstable by default, but stable for those who want it to be with no performance cost...

Python and hashing None

Posted Dec 2, 2022 19:29 UTC (Fri) by lucaswiman (subscriber, #157802) [Link]

`None` having a fixed hash would not allow a DOS attack, because of hash randomization. Even if you knew the hash of `None` in a set, you couldn't generate strings which would reliably collide with that hash. And you cannot generate a bunch of non-equal `None` values since it's a singleton. If you could generate strings which collide with a given hash, that would itself be a security problem since you could execute the same DOS attack regardless of the hash value of `None`.

That feature can be disabled for testing purposes with the PYTHONHASHSEED environment variable, but no one should be processing untrusted input with hash randomization turned off.

Python and hashing None

Posted Dec 2, 2022 20:15 UTC (Fri) by sbelmon (subscriber, #55438) [Link]

Good! Thanks for the info. I was under the impression that the overall goal went beyond None, but I must have misread :-)

Python and hashing None

Posted Dec 2, 2022 22:58 UTC (Fri) by floppus (guest, #137245) [Link]

That's true (though Python's design is very similar to Java's - I'm not sure if you're trying to draw a distinction there.)

But int, float, tuple, frozenset, and countless other built-in and non-built-in classes are already using deterministic hash functions. If there's a security issue, then tuples of integers having fully deterministic hashes is much more of an issue than whether None has different hashes in different Python processes.

I'd argue that if hash-collision-based DoS is a concern, then that's something that's most easily addressed by the consumers of __hash__() rather than by the producers. Let int, float, Decimal, None, Ellipsis, and all the rest return deterministic values, and let dict and set and tuple be responsible for mixing in salt if the application requires it.

Python and hashing None

Posted Dec 4, 2022 5:04 UTC (Sun) by NYKevin (subscriber, #129325) [Link]

> I'd argue that if hash-collision-based DoS is a concern, then that's something that's most easily addressed by the consumers of __hash__() rather than by the producers. Let int, float, Decimal, None, Ellipsis, and all the rest return deterministic values, and let dict and set and tuple be responsible for mixing in salt if the application requires it.

I can't think of a sane way to do that.

If you hash the value first, and then apply a salt afterwards, then your salt changes the value of the hash, but it does not change which values collide with each other. That is, if hash(x) == hash(y), then salt(hash(x)) == salt(hash(y)), and you have exactly the same DoS as before. So we can't apply salt to the output of hash(). We must salt the input.

But how do we do that? We don't know the type of the object to be hashed (dict and set are generic containers that are compatible with any type of element), and even if we did, it might not be straightforward to construct a "salted" version of it. At first, I thought we could do hash((x, salt)), but that doesn't actually work, because the tuple __hash__() implementation just hashes each of its elements and mixes the hash values together, so you would have the same problem as salt(hash(x)).

You could hard-code some kind of salting just for str, but at that point, it makes more sense to modify str.__hash__() instead (which is exactly the path that CPython chose to take).

Python and hashing None

Posted Dec 1, 2022 19:45 UTC (Thu) by xi0n (subscriber, #138144) [Link]

Given the galloping feature creep in the recent versions of Python, one could be forgiven for thinking that the so-called “gauntlet” may not be arduous enough, actually.

Python and hashing None

Posted Dec 2, 2022 10:04 UTC (Fri) by andrewsh (subscriber, #71043) [Link]

Now, on to the next issue: hashing NaN.

Python and hashing None

Posted Dec 2, 2022 19:33 UTC (Fri) by lucaswiman (subscriber, #157802) [Link]

I think all of the millions of nan values have consistent hash values when hash randomization is turned off.

Python and hashing None

Posted Dec 4, 2022 2:05 UTC (Sun) by NYKevin (subscriber, #129325) [Link]

Nope: https://github.com/python/cpython/blob/f4c03484da59049eb6...

For those that can't or don't want to read code on GitHub: It checks whether the value is finite, and then whether it is infinite, and if it's neither (i.e. if it's a NaN), it calls _Py_HashPointer() and returns the result as the hash. So if you have ASLR enabled, your NaNs are unpredictable, and there's no flag or environment variable you can pass to change that.

Python and hashing None

Posted Dec 9, 2022 19:46 UTC (Fri) by yonillasky (guest, #162612) [Link]

Raymond changed it to work this way fairly recently:
https://github.com/python/cpython/pull/25493/commits/b9cd...

The rationale seems to be that NaN != NaN, so you may have many NaN keys in a set or dict, and end up with quadratic performance that way. Given a constant hash, they would all collide.


Copyright © 2022, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds