[Solved]: What exactly is a hash function?

Problem Detail: I have no idea how I managed to get this far in life without ever really grasping this but as it happens I’m still very confused on the concept of a hash function. I did some googling/wikipedia-ing, and here’s what I get:

  • hash tables are nice because the index can be stored in a smaller array
  • hash functions are kind of like uniform random number generators that always have the same seed but not really (and I don’t get why)

But here are the parts that are still very ambiguous to me:

  • Why use a hash function and not an RNG with a set seed and array size?
  • Why is it ok to have “collisions”, which from what I understand is two keys pointing to the same index? I mean don’t we rely on that stuff to be reliable?
  • Why do we mod with array size at the end? Can’t we just restrict the range of the hash function?
  • what is the difference between a hash table and a pointer?
  • If we know the number of unique keys, why not just use direct addressing? What is so great about this uniform distribution of keys?
  • in the case that the number of unique keys are not known, exactly how is that calculated in the end? It seems like something really clever is supposed to happen here but short of just counting the number of entries in the table (and somehow accounting for collisions) I don’t quite see it. Also, shouldn’t we count the number of unique keys before even writing the hash function, since we have to determine the array size?

Thanks for any insight!

Asked By : whyyes

Answered By : Yuval Filmus

A hash function is a pseudorandom function with a constant range. Ideally, one would like two central properties:

  1. The hash function should be easy (fast) to compute.
  2. The probability that two inputs $x,y$ hash to the same value is roughly $2^{-n}$, where $n$ is the output length.

The second property isn’t stated rigorously. There are several ways to state it rigorously:

  1. We can consider an ensemble (collection) of hash functions, and then the probability is over the choice of the hash function.
  2. We can consider some distribution on pairs $x,y$.

The main difference between hash functions and pseudorandom number generators is that a hash function gives a unique value for each input. This is important for applications such as hash tables and message verification:

  1. In hash tables, a hash function is used to choose the location at which an input is put. When you search for the same input, it is important that the hash function puts you in the same location. The pseudorandom property ensures that none of the cells gets to keep too many elements.
  2. In message verification, you are sent separately a message and its hash, and you verify the integrity of the message by comparing its hash to the one given to you. Here the pseudorandom property ensures that it is hard to cheat.

Now to address your specific questions: Why use a hash function and not an RNG with a set seed and array size? A hash function has the property that each input is hashed to the same value, which is important in some applications (see above). One way to guarantee the pseudorandom property is to have an ensemble of pseudorandom functions, and choose one at random; this is like choosing a seed at random. In practice, the seed is chosen at random once and for all, since this simplifies things. Choosing the random seed ahead of time is OK for many applications. The main drawback is that the adversary gets to see the seed ahead of time, and can use this knowledge to try to cheat. When the hash function is easy to “break”, this can actually cause problems. For example, it can allow Denial of Service attacks by sending some server some data which is all hashed to the same index, thus slowing things. In settings when there is no adversary (say, a hash table used in your own program), a hash function with a fixed seed doesn’t suffer from this drawback. Why is it ok to have “collisions”, which from what I understand is two keys pointing to the same index? I mean don’t we rely on that stuff to be reliable? Collisions are unavoidable, unless you know the input ahead of time. Indeed, if you have more potential inputs than possible outputs (which is usually the case for hash functions), then collisions are bound to occur. Consider the idealized hash function in which the hash of any given input is chosen completely randomly (but is fixed); such a function is not efficiently computable, but otherwise is an excellent hash function. Even such a function will have collisions. The goal in constructing a hash function is to imitate this behavior while keeping the function easily computable. Why do we mod with array size at the end? Can’t we just restrict the range of the hash function? You don’t want to construct a specialized hash function for each array size. Rather, you obtain such a hash function by taking a canonical hash function and then modding with the array size. Another way to look at it is that you are restricting the range of the hash function. You constructed your hash function with restricted range by taking a canonical hash function and reducing its output modulo the array size. What is the difference between a hash table and a pointer? A hash table is a data structure for storing entries (whatever they are), supporting operations such as insertion, deletion, and lookup. This is rather different from a pointer. If we know the number of unique keys, why not just use direct addressing? What is so great about this uniform distribution of keys? Even if you know the number of unique keys in advance, you don’t necessarily know what they are, and without this information you can’t use direct addressing. Even if you did have a list of unique keys in advance, it is not always easy to construct an efficient direct addressing scheme. Imagine that the input is the set of all binary trees on at most 20 vertices. How do you map such a binary tree to an integer? While such schemes are definitely possible, often it is faster to use a hash function; the overhead due to collisions is negligible compared to the cost of computing a direct addressing scheme. In the case that the number of unique keys are not known, exactly how is that calculated in the end? It seems like something really clever is supposed to happen here but short of just counting the number of entries in the table (and somehow accounting for collisions) I don’t quite see it. Also, shouldn’t we count the number of unique keys before even writing the hash function, since we have to determine the array size? If you don’t know ahead of time how many unique keys you have, then you use a dynamic hash table instead of a static one. A dynamic hash table can grow in size, though this is an expensive operation. The hash table is initialized at some fixed size, and once it gets too full, it grows in size; if it gets too empty, it shrinks. For an example, look up the Python implementation of dictionaries (dictionaries are the python name for hash tables).

Best Answer from StackOverflow

Question Source : http://cs.stackexchange.com/questions/41964

Leave a Reply