Share on Facebook Share on Twitter Email
Answers.com

Universal hashing

 
Wikipedia: Universal hashing

Universal hashing is a randomized algorithm for selecting a hash function F with the following property: for any two distinct inputs x and y, the probability that F(x)=F(y) (i.e., that there is a hash collision between x and y) is the same as if F was a random function. Thus, if F has function values in a range of size r, the probability of any particular hash collision should be at most 1/r. There are universal hashing methods that give a function F that can be evaluated in a handful of computer instructions.

Contents

Introduction

Hashing was originally used to implement hash tables, taking an input such as a string and returning an index into the table for an object corresponding to the input. Since then, hashing has found other uses. For example, two inputs might be compared by checking to see if their hash values are the same. In general, a hash function is a function from a set of possible inputs, U, to a set of outputs, which is usually taken to be \{1,\dots,N\} for some N.

However, hash functions are many-to-one mappings. That is, they are not injective. If we are very unfortunate, we might have a set of inputs which all map to the same value. Universal hashing is all about making it an extremely unlikely event that the possible hash values are not equally used.

Universal hashing

When trying to prove that a hashing algorithm works well, one needs to show that it successfully distributes the input values it is given and they aren't all mapped to the same value. But, for any given hash function h, we know that there are some input values for which it doesn't work well, so we cannot expect to prove anything of that sort for an explicitly given h.

The solution to this problem is to select h from some class of functions at random, ensuring that any given input is unlikely to be a bad input for our chosen hash. The obvious class of such functions is the set of functions from the set of possible input keys, K, to the set of outputs, \{1,\dots,N\}.

In this case, it is a theorem that the expected time required to perform any m operations (insert, delete, lookup) on a hash-table based on h is of order O(m)[1].

Unfortunately, this set is very large so picking a random function from it is computationally very expensive. Moreover, the function would also (almost certainly) be very complicated to describe, negating any speed benefits that the hash function might have given.

Fortunately, to prove the theorem above, the only property of the set of functions used was the fact that, for inputs x\neq y, the probability that h(x) = h(y) is at most 1 / N, a property that we can guarantee in practice.

Definition

Let K be a set of keys, V be a set of values and H be a set of functions from K to V. Then H is called a 2-universal family of hash functions if, for all x,y in K with x \neq y,

\Pr_{h \in H}[h(x) =  h(y)] \le \frac{1}{|V|} .

Note that this is the property that was described above[2].

A more stringent requirement is that, for all distinct x,y in K and x',y' in V,

\Pr_{h \in H}[h(x) = x' \mbox{ and } h(y) = y'] = \frac{1}{|V|^2} .

If this holds then H is called a strongly 2-universal family. Without qualification, the latter definition is probably intended. This definition can be generalised to a k-universal family, picking k distinct elements of K and of V and requiring that the probability is | V | k.

Examples

Suppose K=\{0,\dots,N_K - 1\} and V=\{0,\dots,N_V - 1\} where NK > NV. Then choose a prime p > NK. Define

 f(x)= g(x)\ \bmod{N_V} and
g(x)=((ax + b)\ \bmod{p})

with a,b \in \mathbb{Z}/p\mathbb{Z} and a \neq 0. Then H is a 2-universal family of hash functions from K to V.

To see why, first observe that if h(x) = h(y), then g(x) \equiv g(y) \pmod{N_V}. This can happen in at most p(p − 1) / NV ways by first choosing one of the p choices for g(x) and then choosing one of the \lfloor p/N_V \rfloor = \lceil p/N_V \rceil - 1 \leq (p-1)/N_V choices for g(y) that are equivalent to our choice for g(x) modulo NV (where the equality follows because p is prime and p > NV). If g(x) = g(y), then it has nothing to do with a and b because

g(x) = g(y)
\iff ax + b \equiv ay + b \pmod{p}
\iff ax + b + -b \equiv ay + b + -b \pmod{p}
\iff ax \equiv ay \pmod{p}
\iff a^{-1}ax \equiv a^{-1}ay \pmod{p}
\iff x \equiv y \pmod{p}.

Since \mathbb{Z}/p\mathbb{Z} is a field, b has an additive inverse and a \neq 0 has a multiplicative inverse, so the above steps are valid.

Because there are p − 1 choices for a and p choices for b, the probability that two hashing functions in this family will hash to the same value when x \neq y is at most \frac{\frac{p(p-1)}{n}}{p(p-1)} = \frac{1}{n}, which means that this family of hash functions is 2-universal.

Uses

Universal hashing has numerous uses in computer science, for example in cryptography and in implementations of hash tables. Since the function is randomly chosen, an adversary hoping to create many hash collisions is unlikely to succeed.

See also

References

  1. ^ Carter and Wegman, p146 or Miltersen, p3
  2. ^ Carter and Wegman used the notation universal2 for this definition

Search unanswered questions...
Enter a question here...
Search: All sources Community Q&A Reference topics
 
 

 

Copyrights:

Wikipedia. This article is licensed under the Creative Commons Attribution/Share-Alike License. It uses material from the Wikipedia article "Universal hashing" Read more