Universal hashing is a randomized algorithm for selecting a hash function F with the following property: for any two distinct inputs x and y, the probability that F(x)=F(y) (i.e., that there is a hash collision between x and y) is the same as if F was a random function. Thus, if F has function values in a range of size r, the probability of any particular hash collision should be at most 1/r. There are universal hashing methods that give a function F that can be evaluated in a handful of computer instructions.
Contents |
Introduction
Hashing was originally used to implement hash tables, taking an input such as a string and returning an index into the table for an object corresponding to the input. Since then, hashing has found other uses. For example, two inputs might be compared by checking to see if their hash values are the same. In general, a hash function is a function from a set of possible inputs, U, to a set of outputs, which is usually taken to be
for some N.
However, hash functions are many-to-one mappings. That is, they are not injective. If we are very unfortunate, we might have a set of inputs which all map to the same value. Universal hashing is all about making it an extremely unlikely event that the possible hash values are not equally used.
Universal hashing
When trying to prove that a hashing algorithm works well, one needs to show that it successfully distributes the input values it is given and they aren't all mapped to the same value. But, for any given hash function h, we know that there are some input values for which it doesn't work well, so we cannot expect to prove anything of that sort for an explicitly given h.
The solution to this problem is to select h from some class of functions at random, ensuring that any given input is unlikely to be a bad input for our chosen hash. The obvious class of such functions is the set of functions from the set of possible input keys, K, to the set of outputs,
.
In this case, it is a theorem that the expected time required to perform any m operations (insert, delete, lookup) on a hash-table based on h is of order O(m)[1].
Unfortunately, this set is very large so picking a random function from it is computationally very expensive. Moreover, the function would also (almost certainly) be very complicated to describe, negating any speed benefits that the hash function might have given.
Fortunately, to prove the theorem above, the only property of the set of functions used was the fact that, for inputs
, the probability that h(x) = h(y) is at most 1 / N, a property that we can guarantee in practice.
Definition
Let K be a set of keys, V be a set of values and H be a set of functions from K to V. Then H is called a 2-universal family of hash functions if, for all x,y in K with
,
Note that this is the property that was described above[2].
A more stringent requirement is that, for all distinct x,y in K and x',y' in V,
If this holds then H is called a strongly 2-universal family. Without qualification, the latter definition is probably intended. This definition can be generalised to a k-universal family, picking k distinct elements of K and of V and requiring that the probability is | V | − k.
Examples
Suppose
and
where NK > NV. Then choose a prime p > NK. Define
and
with
and
. Then H is a 2-universal family of hash functions from K to V.
To see why, first observe that if h(x) = h(y), then
. This can happen in at most p(p − 1) / NV ways by first choosing one of the p choices for g(x) and then choosing one of the
choices for g(y) that are equivalent to our choice for g(x) modulo NV (where the equality follows because p is prime and p > NV). If g(x) = g(y), then it has nothing to do with a and b because
- g(x) = g(y)




.
Since
is a field, b has an additive inverse and
has a multiplicative inverse, so the above steps are valid.
Because there are p − 1 choices for a and p choices for b, the probability that two hashing functions in this family will hash to the same value when
is at most
, which means that this family of hash functions is 2-universal.
Uses
Universal hashing has numerous uses in computer science, for example in cryptography and in implementations of hash tables. Since the function is randomly chosen, an adversary hoping to create many hash collisions is unlikely to succeed.
See also
References
- ^ Carter and Wegman, p146 or Miltersen, p3
- ^ Carter and Wegman used the notation universal2 for this definition
- Knuth, Donald Ervin (1998). [The Art of Computer Programming], Vol. III: Sorting and Searching (2e ed.). Reading, Mass ; London: Addison-Wesley. ISBN 0-201-89685-0.
- Carter and Wegman (1977). "Universal Classes of Hash Functions". Journal of Computer and System Sciences 18 (2): 143–154. doi:.
- Miltersen, Peter Bro. "Universal Hashing" (PDF). Archived from the original on 24th June 2009. http://www.webcitation.org/5hmOaVISI.
This entry is from Wikipedia, the leading user-contributed encyclopedia. It may not have been reviewed by professional editors (see full disclaimer)

![\Pr_{h \in H}[h(x) = h(y)] \le \frac{1}{|V|} .](http://wpcontent.answers.com/math/5/d/a/5da2e492f1b2dec990da0441d4bb2750.png)
![\Pr_{h \in H}[h(x) = x' \mbox{ and } h(y) = y'] = \frac{1}{|V|^2} .](http://wpcontent.answers.com/math/1/4/e/14ed13c50662c26d32dcfb1a05a4477b.png)




