This article is interesting but... weird. It seems to assume a bunch of things, and then states some conclusions based on that, but then also assumes the value of the conclusions is obvious.
> A perfect hash function is one that is collision-free. By implication, the hash must be at least as many bytes as the key and the function is theoretically reversible, though not always tractably so.
This is true because of the pigeonhole principle, but most people won't assume that the key size is equivalent to the domain size. Usually when using perfect hashing, you are hashing (much) fewer elements than the total range a key can represent.
A perfect hash function that uniquely assigns hash values to the eight items you need to store, but gives you back integers anywhere in the 32 bit range isn't super helpful. Or, at least, it's not obvious to me why it would be. They do state the algorithm can be scaled down to 8 or 16 bits, so I guess there's that.
> Because the hash is no smaller than the key, the primary use case is randomizing small values like integral types.
This just doesn't make sense to me. The key can be much larger than the hash, even with perfect hashing, as long as the number of keys in your data set is smaller than the maximum hash value.
The author says "primary use case" but what they really mean is that this isn't what most people would consider a perfect hash at all. It's just an integer permuter. In other words, here as an even faster implementation of what the author would describe as a perfect hash over small integer keys:
uint32_t hash(uint32_t key) { return key; }
This also guarantees there are no collisions so is a "perfect hash". It's just not a very useful one.
The author assumes the reader knows the value of randomizing your integer keys. I'm not sure what that value is in cases where you know you don't have collisions anyway.
> The author assumes the reader knows the value of randomizing your integer keys. I'm not sure what that value is in cases where you know you don't have collisions anyway.
If you're going to be doing writes of your data in order of an integer sequence number, and you're writing to a multi-node storage system that partitions/shards by primary-key range, then you're going to get worst-case performance if you use your integer sequence numbers directly as the primary keys, since you're queuing all your writes onto one shard.
Instead, you want to permute your integer sequence numbers so that successive integers become evenly spread across the range partitions. This (along with some write duplication and verify-on-read) is the foundation of Dynamo-based "scale-free" storage systems (DynamoDB+S3, Riak+CS, etc.)
It is trivial to generate a mostly-randomized bijective mapping of small integers to small integers with an xor and a little byte swapping, which is what the author does with a built-in AES instruction.
The OP's point is that this isn't a particularly interesting problem. The interesting problem is generating efficient bijective mappings of large integers to very small integers.
I wasn't attempting to argue the OP's point, merely answering the question posed: "what [is the] value [of randomizing your integer keys] in cases where you know you don't have collisions."
The identity function is a perfect hash function, no quotes needed. Several programming languages uses it as the default function for hashing integers. But it doesn't have a good avalanche which is important for some use cases. You could, for example, use it to make guessing urls harder.
Yeah the terminology is a bit misleading, it would be better to call this a "bijective" or "invertible" integer hash function, or mixer.
It's really nice work though! Invertible mixers are usually building blocks for more general hash function, for example MurmurHash3 uses this 64-bit mixer
A difference is that most common mixers are not collision-free. For some applications of runtime hashing using ordinary hash functions, where pseudo-randomness is important regardless, being able to prove the hash function is collision-free enables useful optimizations without any prior knowledge of the keys to be hashed.
The genesis of this was database internals, which have many use cases for strongly randomized bijective function for integral types. It is used in places where most people would use an ordinary hash function.
This particular hash is not trivially invertible, since internal AES state necessary to reverse it is discarded.
It should be obvious here that the domain is e.g. integral types on computers. The term is used in many places to denote collision-free quasi-randomization, which technically is a strict subset of perfect hash functions, and I chose it to align with that common usage.
> A perfect hash function is one that is collision-free. By implication, the hash must be at least as many bytes as the key
I don't understand this lemma. An indexed array is a degenerate example in which the "hash" (pointer) is presumably a lot smaller than the key (index).
BTW for those not familiar with a perfect hash function they are great for things like symbol tables that are immutable at runtime. You can make a very fast lookup that requires no probing (bucket size is always one) so you can also eliminate a lot of implementation both of code and data structure.
> An indexed array is a degenerate example in which the "hash" (pointer) is presumably a lot smaller than the key (index).
No, it isn't.
The "0" in arr[0] is (in most cases) a 32-bit signed integer. If you're using a signed 32-bit int to index your array, you're using a hash length of 32-bits. You won't be able to index more than 2.1 billion unique items without changing your data type. Even if you're not using all those bits, you're still using 32-bits to index them. The fact that you're storing something that may have more than 32 bits of information is irrelevant, because your array (a naive hash) doesn't consider them real values.
If you want to store a set of N elements in an array or perfectly hash a set of N elements, then the set of all possible indexes or hashes must have at least N elements.
It's the lemma I challenge, not the common case on certain of today's architectures. The use of the word "must" implies ∀, which is plainly not the case.
A really common pattern in many programs is to associate metadata with something using a unique key.
In cases where you can’t associate that metadata directly using a pointer (by using structures as keys embedding the pointer in the key) or a direct index into a metadata array, a perfect hash generated at build time is about as good as you’re going to get.
The other thing is that often this is easy to bolt on after the fact. As an example, in network devices it is common for a syslog mnemonic to be associated with an instance of a rate limiter. A good solution is to compile your log messages using an errmsg definition and compiler tool (which also allows you to build documentation) but this requires a lot of up-front planning and build orchestration and makes the code awkward in sone circumstances. An alternative is that you allow people to declare their mnemonic (and, for example, rate limit) in-place and then allow the logger library to use that to look up other properties that are joined from other sources (if there are any) and glean the structures that need to be created.
Since extracting the full set on mnemonics is necessary anyway (for tripwires, for documentation, ...) you can now build the hash function after-the-fact. This definitely simplified the build considerations (and avoids unnecessary rebuilds when, for example, only documentation changes).
The after-the-factness-but-with-low-overhead part is what’s nice about it. Yes, it’s better to structure things up front to make the system optimal but it’s nice to be able to get close.
Actually, I might have misunderstood you. In your example, the index in the array is treated like the hash of the item you've put in there, why are we suddenly considering pointers?
Coincidentally, I was just testing an actual perfect hash function generator https://github.com/rizkg/BBHash. It requires 2-3 bits per entry irrespective of key or table size. They show that they can build the table for 10^12 entries.
One problem is that identifying keys that aren't in the table can be as expensive as storing all the keys. I was thinking that it would be interesting to build two tables with different hash functions for the same set of keys and then record the mapping between them. This would double the table size and add n log n bits to record the relationship, but it would let you make probabilistic (but pretty confident) estimates of what keys weren't in the table, all with a fixed number of bits per key.
To clarify, if you map a key into the table with both functions, by construction it will map to a position in each table that are linked together. If a key maps to unlinked positions in each table then we know with probability ~(1-1/n) that the key isn't in the input to the table. When keys are big and n is big you could be very confident in this estimate and save a huge amount of space over typical structures to do this.
The algorithms presented below are much faster than most non-perfect hash functions on recent CPUs.
This is a very vague statement. I would like to see some numbers, as well as real-world use cases.
Last month there was a thread about perfect hashing in the postgres parser, but I wonder if a DFA / trie is better, since you don't have to always look at every byte.
This is a nice simple mixing function on newer Intel chips.
But since the XOR doesn't increase entropy, I think you can save a cacheline load of the 0xDEADBEEF constant by passing _mm_setzero_si128() instead of _mm_set1_epi32(0xDEADBEEF).
A good blog post about xor-shift-multiply reversible hash functions: https://nullprogram.com/blog/2018/07/31/
The inverse functions are easily found by computing the modular inverses of the odd multiplicative factors.
Real perfect hashing is more about hashing a non contiguous set of elements with no collisions, unlike the set of all integers that fit in n bits.
A different problem is Minimal Perfect Hashing. In addition to being injective, the function should have a minimal image of the input set: the set of keys is mapped to consecutive integers without vacant slots.
The implementations in optimal space are non-practical but some libraries construct such functions with a bit more memory (2-3bits/key), like emphf or BBhash (both on github). The latter use a series of 1-hash bloom filters where keys having collision on one layer are removed to be handled by deeper layers. The overall rank() of the bit where a key ultimately lands gives its hash.
As pointed out by some of you, this looks more like permutation of sequences. Also, since a symmetric cipher is used, I'm surprised the author didn't mention Black and Rogaway [1].
Their algorithm is a permutation (int -> int) that works on a domain of any size up to a limit. A typical application for this is encrypting credit card numbers so that the ciphertext still looks like a credit card number (non-trivial because the size of domain isn't a power of two) or efficiently shuffling sequences, randomly in appearance but repeatably if you know the seed.
For instance, this is used by Masscan to randomize the order in which IP addresses and ports are scanned [2]. I've built a Python package that could help you use this algorithm [3] (mostly for fun, but maybe that's useful, let me know :)).
And to think I was excited to finally get a good explanation of what people usually mean when they say "perfect hashing" (i.e, given a set of keys, construct a hashing function that will assign each key a unique hash so as to avoid collision, with a hash-range that has typically little overhead over the number of keys).
If someone has good resources on that, I'm still interested. I found it hard to fully internalize the stuff I've found when looking a while back. In particular, I'm confused about the (complexity, time) bounds of such a hash construction process.
As already mentionend, what he describes should be called a bijective mixing function.
Now when i am given n different values, i can mix them them with his function and will get n other pairwise different values. That helps me nothing in practive with regard to the task of perfect hashing.
Technically i would say, he describes in deed a perfect hashing function, but of questionable quality.
This article is about integer permutations, not perfect hashing. That said, integer permutations have many nice applications. One is in hash tables with integer keys: one can store the "hash code" in lieu of the key, since permutations are reversible.
A perfect hash function has no collisions. Typically, people are interested in perfect hashes constructed on types that themselves are not represented in the number of output bits but with an input cardinality that is less than or equal to the output's cardinality.
What the author is describing here is more commonly called a mixing function. There are many fast mixing functions. https://gist.github.com/badboy/6267743
I really dislike posts like this one. For technical stuff like this people should start out with definition so everyone know what the author means by, for example, "perfect hashing".
> By implication, the hash must be at least as many bytes as the key and the function is theoretically reversible
No. At least not by the standard definition[1]. In standard context, the problem of constructing a perfect hash function is, given (ahead of time) a list of items, you find a hash function that can map them to a smaller "range" without collision.
Encryption algorithms must be reversible, which means each of the internal sub-block operations they are composed from must also be reversible. AES is amenable to isolating reversibility in sub-block ranges.
It is easy to exhaustively test for collisions in the key space for smaller keys and to do thorough statistical tests for larger keys where enumerating all keys is not possible.
The first sentence of the article explains what a perfect hash function is : "A perfect hash function is one that is collision-free.". Essentially, a perfect hash function is one where no two domain elements map to the same range element.
If you want to explore it further, you can search for injective functions and one-to-one mapping.
An example of a function that produces collisions is the ABS function.
ABS( 2 ) = 2.
ABS( -2 ) = 2.
Two domain elements ( 2 and -2 ) map to the same range element ( 2 ).
> A perfect hash function is one that is collision-free. By implication, the hash must be at least as many bytes as the key and the function is theoretically reversible, though not always tractably so.
This is true because of the pigeonhole principle, but most people won't assume that the key size is equivalent to the domain size. Usually when using perfect hashing, you are hashing (much) fewer elements than the total range a key can represent.
A perfect hash function that uniquely assigns hash values to the eight items you need to store, but gives you back integers anywhere in the 32 bit range isn't super helpful. Or, at least, it's not obvious to me why it would be. They do state the algorithm can be scaled down to 8 or 16 bits, so I guess there's that.
> Because the hash is no smaller than the key, the primary use case is randomizing small values like integral types.
This just doesn't make sense to me. The key can be much larger than the hash, even with perfect hashing, as long as the number of keys in your data set is smaller than the maximum hash value.
The author says "primary use case" but what they really mean is that this isn't what most people would consider a perfect hash at all. It's just an integer permuter. In other words, here as an even faster implementation of what the author would describe as a perfect hash over small integer keys:
This also guarantees there are no collisions so is a "perfect hash". It's just not a very useful one.The author assumes the reader knows the value of randomizing your integer keys. I'm not sure what that value is in cases where you know you don't have collisions anyway.