What exactly (and precisely) is “hash?”

Problem Detail: I have heard the word “hash” being used in different contexts (all within the world of computing) with different meanings. For example, in the book Learn Python the Hard Way, in the chapter on dictionaries it is said “Python calls them “dicts.” Other languages call them “hashes.”” So, are hashes dictionaries? The other common usage of the word is in relation to encryption. I have also heard (& read) people using the word “hash” as a specific function within high-level programing. So, what exactly is it? Can anyone (with time and who is knowledgeable) kindly explain the nitty-gritties of “hash (or hashes)?”

Asked By : gracedlamb

Answered By : usul

The Wikipedia article on hash functions is very good, but I will here give my take.

What is a hash?

“Hash” is really a broad term with different formal meanings in different contexts. There is not a single perfect answer to your question. I will explain the general underlying concept and mention some of the most common usages of the term. A “hash” is a function $h$ referred to as hash function that takes as input objects and outputs a string or number. The input objects are usually members of basic data types like strings, integers, or bigger ones composed of other objects like user defined structures. The output is a typically a number or a string. The noun “hash” often refers to this output. The verb “hash” often means “apply a hash function”. The main properties that a hash function should have are:

  1. It should be easy to compute and
  2. The outputs should be relatively small.

Example:

Say we want to hash numbers in the range from 0 to 999,999,999 to number between 0 and 99. One simple hash function can be $h(x) = x mod 100$.

Common additional properties:

Depending on use case we might want the hash function to satisfy additional properties. Here are some common additional properties:

  1. Uniformity: Often we want the hashes of objects to be distinct. Moreover we may want the hashes to be “spreading-out”. If I want to hash some objects down into 100 buckets (so the output of my hash function is a number from 0-99), then I am usually hoping that about 1/100 objects land in bucket 0, about 1/100 land in bucket 1, and so on.
  2. Cryptographic collision resistance: Sometimes this is taken even farther, for instance, in cryptography I may want a hash function such that it is computationally difficult for an adversary to find two different inputs that map to the same output.
  3. Compression: I often want to hash arbitrarily-large inputs down into a constant-size output or fixed number of buckets.
  4. Determinism: I may want a hash function whose output doesn’t change between runs, i.e. the output of the hash function on the same object will always remain the same. This may seem to conflict with uniformity above, but one solution is to choose the hash function randomly once, and not change it between runs.

Some applications

One common application is in data structures such as a hash table, which are a way to implement dictionaries. Here, you allocate some memory, say, 100 “buckets”; then, when asked to store an (key, value) pair in the dictionary, you hash the key into a number 0-99, and store the pair in the corresponding bucket in memory. Then, when you are asked to look up a key, you hash the key into a number 0-99 with the same hash function and check that bucket to see if that key is in there. If so, you return its value. Note that you could also implement dictionaries in other ways, such as with a binary search tree (if your objects are comparable). Another practical application is checksums, which are ways to check that two files are the same (for example, the file was not corrupted from its previous version). Because hash functions are very unlikely to map two inputs to the same output, you compute and store a hash of the first file, usually represented as a string. This hash is very small, maybe only a few dozen ASCII characters. Then, when you get the second file, you hash that and check that the output is the same. If so, almost certainly it is the exact same file byte-for-byte. Another application is in cryptography, where these hashes should be hard to “invert” — that is, given the output and the hash function, it should be computationally hard to figure out the input(s) that led to that output. One use of this is for passwords: Instead of storing the password itself, you store a cryptographic hash of the password (maybe with some other ingredients). Then, when a user enters a password, you compute its hash and check that it matches the correct hash; if so, you say the password is correct. (Now even someone who can look and find out the hash saved on the server does not have such an easy time pretending to be the user.) This application can be a case where the output is just as long or longer than the input, since the input is so short.

Best Answer from StackOverflow

Question Source : http://cs.stackexchange.com/questions/55471

Leave a Reply