Share on Facebook Share on Twitter Email
Answers.com

Rolling hash

 
Wikipedia: Rolling hash

A rolling hash is a hash function where the input is hashed in a window that moves through the input.

A few hash functions allow a rolling hash to be computed very quickly -- the new hash value is rapidly calculated given only the old hash value, the old value removed from the window, and the new value added to the window -- similar to the way a moving average function can be computed much more quickly than other low-pass filters.

One of the main applications is the Rabin-Karp string search algorithm, which uses the rolling hash described below.

Another popular application is rsync program which uses Mark Adler's adler-32 checksum as its rolling hash.

Contents

Rabin-Karp rolling hash

The Rabin-Karp string search algorithm is normally used with a very simple rolling hash function that only uses multiplications and additions:

H = c1ak − 1 + c2ak − 2 + c3ak − 3 + ... + cka0 where a is a constant and c1,...,ck are the input characters.

In order to avoid manipulating huge H values, all math is done modulo n. The choice of a and n is critical to get good hashing; see linear congruential generator for more discussion.

Removing and adding chars simply involves adding or subtracting the first or last term. Shifting all chars by one position to the left requires multiplying the entire sum H by a. Shifting all chars by one position to the right requires dividing the entire sum H by a. Note that in modulo arithmetic, a can be chosen to have a multiplicative inverse a − 1 by which H can be multiplied to get the result of the division without actually performing a division.

Cyclic polynomial

Hashing by cyclic polynomial[1]—sometimes called Buzhash—is also simple, but it has the benefit of avoiding multiplications, using barrel shifts instead. It presumes that there is some hash function h from characters to integers in the interval [0,2L). This hash function might be simply an array or a hash table mapping characters to random integers. Let the function s be a cyclic binary rotation (or barrel shift): it rotates the bits by 1 to the left, pushing the latest bit in the first position. E.g., s(10011) = 00111. Let \oplus be the bit-wise exclusive or. The hash values are defined as

 H = s^{k-1}(h( c_1 )) \oplus s^{k-2}( h( c_2) )  \oplus \ldots \oplus  s( h( c_{k-1}) ) \oplus   h( c_k)

where the multiplications by powers of two can be implemented by binary shifts. The result is a number in [0,2L).

Computing the hash values in a rolling fashion is done as follows. Let H be the previous hash value. Rotate H once: H\leftarrow s(H). If c1 is the character to be removed, rotate it k times: sk(h(c1)). Then simply set

H\leftarrow H \oplus s^{k}(h( c_1 )) \oplus h(c_{k+1})

where ck + 1 is the new character.

Independence

At best, rolling hash values are pairwise independent[2]. Similarly, at best the (randomized) Rabin-Karp rolling hash values are independent. Hashing n-grams by cyclic polynomials achieves pairwise independence: simply keep the first Ln + 1 bits.

Computational complexity

All rolling hash functions are linear in the number of characters, but their complexity with respect to the length of the window (k) varies. Rabin-Karp rolling hash requires the multiplications of two k-bit numbers, integer multiplication is in O(k \log k 2^{O(\log^*k)})[3]. Hashing n-grams by cyclic polynomials can be done in linear time [2].

Software

Footnotes

  1. ^ Jonathan D. Cohen, Recursive hashing functions for n-grams, ACM Trans. Inf. Syst. 15 (3), 1997
  2. ^ a b Daniel Lemire, Owen Kaser: Recursive n-gram hashing is pairwise independent, at best, arXiv:0705.4676
  3. ^ M. Fürer, Faster integer multiplication, in: STOC ’07, 2007, pp. 57–66.

Search unanswered questions...
Enter a question here...
Search: All sources Community Q&A Reference topics
 
 

 

Copyrights:

Wikipedia. This article is licensed under the Creative Commons Attribution/Share-Alike License. It uses material from the Wikipedia article "Rolling hash" Read more