CodingBison - Hash Tables: Introduction

A hash table is a popular data structure for providing a faster search -- a well-designed hash table can provide an average lookup complexity of O(1). A hash table data structure uses a hashing function that maps a given key to one of the indexes (also referred to as slots) in the hash-table.

For a hash-table, indices are identified using an integer and the hashing function converts the passed key (which can be string, number, etc) into an integer that points to an index in the hash table. In other words, if the hash-table has size S and if the first index is indexed as 0, then we can find the table index, y, such that y would be greater than or equal to 0 but but less than S.

An example of the hash() function definition that takes an integer key can be written using the mod (or the modulo) function. Thus, if M is 97 and if we want to store the key 23, then the index would be 23.

 y = hash(key) 
 where  0 <= y < S.

 Example:
 hash(x) = x mod M
 where  M is a constant.

If the key is a string, instead of an integer, even then we need to map the string to one of the indices of the hash table. We can do this in several ways. For example, a simple (but not necessarily the most optimal) way would be to add all the integer values for each character present in the string and then use the sum as an input to the modulo-based function. Here is an example that does that.

 #include <stdio.h>

 #define HASH_TABLE_SIZE 10

 int hash (char *string_key, int size) {
     int ret_hash = 0;
     char *c = string_key;

     while (*c != '\0') {
         ret_hash += (*(c++))/2;
     }   
     return (ret_hash % size);
 }

 int main (int argc, char *argv[]) {
     int i = 1;
     int hash_value;

     while (i < argc) {
         hash_value = hash(argv[i], 10);
         printf("The hash value for string '%s' is \t\t %d \n", 
             argv[i], hash_value);
         i++;
     }   
     return 0;
 }

Assuming that the output file of the above command is "hash", here are the hash mappings for various input strings. It appears that the output is distributed throughout the range. But then the above data set is too small to conclude this!

 user@codingbison $ gcc hash.c -o hash 
 user@codingbison $ ./hash "New York" "South Dakota" "North Carolina" "North Dakota" \
         "South Carolina" "Maryland" "New Mexico" "Rhode Island" "New Jersey"  \
         "New Hampshire" "Massachusetts"
 The hash value for string 'New York' is          1 
 The hash value for string 'South Dakota' is          9 
 The hash value for string 'North Carolina' is        4 
 The hash value for string 'North Dakota' is          1 
 The hash value for string 'South Carolina' is        2 
 The hash value for string 'Maryland' is          4 
 The hash value for string 'New Mexico' is        3 
 The hash value for string 'Rhode Island' is          3 
 The hash value for string 'New Jersey' is        6 
 The hash value for string 'New Hampshire' is         9 
 The hash value for string 'Massachusetts' is         4 
 user@codingbison $

Collisions

For hash tables, an import consideration is that it should be able to accommodate collisions. A collision happens when we have two or more elements that are indexed to the same slot. When collisions occur, we can insert the new key into the same index that is preoccupied by using a list/chain that sits in that index. Alternatively, we can look for another empty index in the hash table -- clearly, we need to keep looking for an empty index until we find one. The former case, where multiple entries can reside in the same index, is referred as Direct-addressing or Chain-based Hash table. The latter case, when each index can have at most one entry, is referred as Open-addressing based hash table.

Collisions can potentially increase the search complexity of hash tables and so a good hashing function must be designed to reduce these collisions. The worst-case scenario can occur when, for chain-based Hash table, all the keys get mapped to the same index. Thus, all the elements would sit in the same index in the form of a linked list and we would need to search all the elements. This inefficient behavior would mean a lookup complexity of O(N). For open-addressing system, the worst case can happen, if we are looking for the last element that gets inserted in the table -- we might need to keep looking for an empty index till we find the last remaining empty index -- the lookup complexity would once again be O(N).

It is easy to see that if the size of the hash table (S) is smaller than the number of entries (N), then collisions are bound to happen. For an ideal hashing function, if S where to be same as N, then each index would have one entry. In reality, this may not be true. Even if N is smaller than S, collisions can still occur. The ratio of N and S does provide an insight into an average number of collisions. This ratio is often known as load factor of the hash table. For a chain-based hash table, if N were to be k * S and we have an ideal hashing function, then each index would have exactly k entries. Thus, the run time complexity of a lookup would be O(k) which is same as that of O(1).

Load factor is an important consideration for estimating the amount of collisions in a hash table. In an ideal world, the load factor should be always less than 1.0. In fact, its recommended upper limit is around 0.7. If the universe of the keys that are to be mapped to the hash table are known in advance, then the size of the hash table can be fixed from the get-go! However, this may not always be possible. Hence, for a given size, software systems might see a gradual increase in their load factor. To keep their load factor low, these systems can grow the size of the hash table periodically. Typically, one can create a new larger hash table and then copy the key/value pair from the old table by mapping each key in the new table. The earlier mappings may not work since often the size of the hash table itself is taken into account in the hashing function.