OPAW: Optimal Bounds for Open Addressing Without Reordering

This paper by Martin Farach-Colton, Andrew Krapivin and William Kuszmaul shook the computer science interwebs last year as it proposed a much faster implementation for hash maps. An important disclaimer: unless you are way more bewandered in data structures than me, you’re forgiven to believe that this paper revolutionises mainstream hash map implementations like the one Java uses. My understanding is that it doesn’t. The important detail here is “open addressing” (I was today years old when I learned about this). The hash maps I grew up with are arrays of lists; a key is hashed to an array index and then a linear search in the list finds (or doesn’t) the value associated with the key. Not so hash maps with open addressing! This peculiar sub-species uses a hash function just like “old” hash maps, but stores a <key, value> pair directly in the hashed array slot. In case of a collision, one has to move to an (algorithmically dictated) alternative slot.

Apparently there was a long standing conjecture about the computational efficiency of open addressing hash maps which this paper revises to a much lower bound and it describes an algorithm that can be used as the basis for a reference implementation. Previously, as the hash array filled up, one would have to take longer and longer walks until a free slot was found. The computational complexity was Θ(1/δ): the “big theta” is a tight bound of the complexity, meaning more or less “exactly”. δ is the % empty capacity. The new algorithm clocks in at O(log(1/δ)) which is the upper bound. Quite impressive!

Two methods are presented for finding an empty slot / the value associated with the key: elastic hashing and funnel hashing. I found them similar enough in the big picture (and probably didn’t understand the differences sufficiently) that I’ll describe them together. The basic idea is to split the array into logical sub arrays of varying sizes. Imagine a parking house with multiple floors. The higher up you go, the fewer parking slots each floor has. The probing algorithm starts at the ground floor and tries to find a slot there. Counter-intuitively, the more slots are free on a floor, the pickier the algorithm becomes: when many slots are free, it statistically expects to quickly find a free slot; if that fails, it means that the hashing function isn’t good for this case, it abandons searching the current floor and moves on to the next floor. This pattern continues for a while, possibly moving on to even higher floors, but at some point the algorithm is fed up, drops back to the ground floor and semi-naively will try to find the next best slot. A couple of important qualifications: first the maths works out so that lower floors fill up first and upper floors, despite being smaller, tend towards lower occupancy rates. Second, only the elastic hashing method falls back to lower floors, the funnel hashing method moves up and up.

Elastic vs funnel hashing

Elastic hashing is the “star” of the paper because it hits the theoretical perfect limit for average speed.

Metric	Complexity	What it means
amortised expected	O(1)	If you fill the entire table, the average work per item is a small constant (like 2 or 3 probes).
worst-case expected	O(logδ−1)	Even for the unluckiest item in a 99% full table, you’ll only check about log(100)≈7 spots.
insertion time	O(logδ−1)	The time to put an item in is logarithmic relative to how full the table is.

Funnel hashing is slightly slower on average than elastic hashing, but it is much more consistent and easier to implement. It doesn’t quite hit the O(1) amortised mark, but it stays very close.

Metric	Complexity	What it means
amortized expected	O(logδ−1)	The average search is logarithmic. If the table is 99% full, expect ~7 probes.
worst-case	O(log²δ−1)	With very high probability, no single search will ever take more than roughly (log100)² probes.