Problem Detail: The Rabin–Karp string matching algorithm requires a hash function which can be computed quickly. A common choice is $$ h(x_0ldots x_n) = sum_{i=0}^n b^i x_i, $$ where $b$ is prime (all computations are module $2^w$, where $w$ is the width of a machine word). Why is it important for $b$ to be prime?
Asked By : sdream
Answered By : Pseudonym
A quick recap first. We are looking for a pattern $P[1ldots m]$ in a string $S[1ldots n]$. The Rabin-Karp algorithm does this by defining a hash function $h$. We compute $h(P)$ (that is, the hash of the pattern), and comparing it to $h(S[1ldots m])$, $h(S[2ldots m+1])$ and so on. If we find a matching hash, then it is a potential matching substring. The efficiency of the algorithm depends on the ability to compute $h(S[r+1ldots s+1])$ efficiently from $h(S[rldots s])$. This is called a “rolling hash”. Note that any efficient rolling hash function will do, and it’s still Rabin-Karp. The question that you’re asking about is one particular choice of hash function, where you use: $$h(S[rldots s]) = sum_{i=r}^s S[i], p^{s-i} bmod q$$ where $p$ is a prime number with roughly the same order of magnitude as the size of the character set, and $q$ is another prime number which defines the cardinality of the range of the hash function, typically of the same order of magnitude as a machine word divided by the character set size. If I’m reading it correctly, you’re asking why $q$ has to be prime. In fact, this is a more general question. In a lot of the old (and current) literature on hashing, the advice is that the hash function should be taken modulo a prime number (e.g. hash tables should have a prime size). For a hash function to be as useful as possible, its range needs to be relatively uniform, even when its domain is not. Natural language text (say) doesn’t have a uniform frequency distribution, but hash values should be. If $q$ is a prime number, then a lot of other numbers are relatively prime to it, and in particular, the sum (especially if $p$ is also prime!). This makes the frequency distribution of the hash values more uniform, even though the hash function is relatively weak. It’s important to understand that we do this because the hash function is weak. If the hash function were stronger, taking the remainder when divided by a prime would not be necessary; you could, for example, take the remainder when divided by a power of two, which would be a much cheaper bit mask operation. However, it’s hard to design strong rolling hash functions which are cheap enough to be done for every input character in the Rabin-Karp algorithm. Something that’s worth pointing out that this “remainder of a prime” technique used to be common in many hashing applications, but this advice is inadvisable on modern hardware. It advice made sense once upon a time, because while final integer division instruction was always expensive, so were the operations that you used to compute your hash function, such as integer multiplication. On modern CPUs, it is much more expensive to do an integer division even than an integer multiplication. Modern carry-save adder multipliers are fully pipelined, so you can have several such instructions being executed at once. Modern dividers use the SPH or Goldschmidt algorithms, which are multi-cycle and impossible to pipeline. Goldschmidt dividers also tie up the multiplication unit, making the performance hit even greater. I’ve had programs where this division instruction was the bottleneck, and the annoying part was that it was hidden inside the standard library. On a modern CPU, it is worth using a more sophisticated hash function built out of fully pipelineable operations (e.g. multiplies or even table look-ups) and using hash tables which are powers of two, so the modulo operation is a bit mask. Do anything to avoid that division operation. Just not for Rabin-Karp.
Best Answer from StackOverflow
Question Source : http://cs.stackexchange.com/questions/28019