CodingBison - Hash Tables: Chain Based (Advanced Implementation)

Direct-addressing or using chains (can be implemented as linked lists) at each index in the hash table, is a common way to deal with collisions. In other words, the first entry for a given index sits as the first entry in that index in the hash-table. Subsequent entries that get mapped to the same index are appended to this entry leading to a chain of entries all belonging to the same index.

The following figure shows a chain-based hash table. In this figure, some of the indexes of the hash-table have more than one entries since the hashing function returns the same index for all those elements. In that case, they use the "next" pointer to build a chain rooted at the common index of all those elements. For example, all the three entries of "South Dakota", "North Dakota", and "Massachusetts" have the same hashing index of 3.

    Index: 0            1               2              3             4            5
       -------------------------------------------------------------------------------
      |  Empty    |  'New York'  |  'Maryland'  | 'South Dakota' |  Empty  |    Empty |
       -------------------------------------------------------------------------------
                        |               |              |
                        | next          | next         | next
                        V               V              V
                 'New Hampshire'       NULL       'North Dakota'
                                                       |  
                                                       | next
                                                       V
                                                 'Massachusetts'

                             Figure: Chain based Hash Table

The performance of a direct-addressed hash table depends upon the design of its hashing function -- for a well-designed hashing function, collisions would be minimum. For a badly-designed hashing function, collisions can be large and in the worst case, all the entries can sit in the same index! Assuming a linked list implementation for accommodating collisions, one would have to traverse the entire linked list to find the last element, leading to a run-time complexity of O(N). To avoid this inefficiency, one can consider alternative ways of chaining elements. As an example, instead of a simple linked list, one can consider using a balanced binary search tree, for which the worst run-time complexity of O(logN) can be better than O(N).

Advanced Implementation

Let us now get started with a generic implementation of hash tables! This example provided focuses on elements that make the implementation production-style software. If you are looking for a simpler implementation for the same, we have provided one here: Simple Implementation for Chained Hash Tables. In fact, if you are trying to learn hash tables for the first time, we strongly recommend that you start with the above page.

Our implementation starts with a data-structure for elements sitting at each index (or slot) in the hash table. For each element, the structure provides support for three things: key, value, and a linked list to add more elements in the case of a collision. Equally important, the implementation offers APIs that have various hash table operations like insertion, lookup (search), deletion, etc.

For the sake of reuse and modularity, we divide our implementation into three files: (a) a header file that publishes the data-structures and the APIs, (b) a C file that has implementation of the published APIs, and (d) another C file that has the main() function and the file that invokes various hash table APIs. The last two files would need to include the header file to get definitions for both data-structures and the API signatures. Before we go further, here is the list of the APIs.

 hashtable_init():   Initialize the hashtable
 hashtable_free():   Free the entire hash table
 hashtable_insert(): Insert an element (or update existing one) with the passed key/value pair
 hashtable_delete(): Delete an existing element that matches the passed key
 hashtable_search(): Look for an element using the passed key

Let us first start with the header file ("hashtable_chain.h"). We provide it below.

 #ifndef __HASHTABLE_CHAIN_H___
 #define __HASHTABLE_CHAIN_H___

 typedef enum hashtable_key_type_t {
     HASHTABLE_KEY_TYPE_INTEGER = 1, 
     HASHTABLE_KEY_TYPE_STRING, 
 } hashtable_key_type;

 typedef struct hash_node_ {
     void   *key; 
     void   *value; 
     struct hash_node_ *next;
 } hash_node;

 typedef struct hashtable_t_ {
     int       key_type;
     int       table_size;
     hash_node **hashtable;
 } hashtable_t;

 void hashtable_init(hashtable_t *hashtable, hashtable_key_type key_type, int table_size); 
 void hashtable_free(hashtable_t *hashtable);
 int hashtable_insert(hashtable_t *hashtable, char *key, void *value);
 void hashtable_delete(hashtable_t *hashtable, void *delete_key); 
 int hashtable_search(hashtable_t *hashtable, void *search_key); 
 #endif /* __HASHTABLE_CHAIN_H___ */

The main two data structures defined in the above header file are hash_node and hashtable_t. It is the hash_node data structure that holds the key/value pair along with the "next" pointer for supporting the linked list. The hashtable_t data structure supports parameters that represent properties related to the overall hash table. These properties are key_type used for the hash table, the table_size for the hash table, and the hashtable. The hashtable_t encapsulates these properties and thus, allows each application to have its own parameters. Without that, these values would have a global scope and thus, the implementation cannot be reused by multiple applications. Accordingly, each APIs accepts a pointer to hashtable_t.

To accommodate any type of data for both key/value input, the implementation keeps both of them as "void *" type. However, our implementation currently allows the key to be either of integer type or string type. To do that, it defines an enum that supports both. Based on the type of key that the application would like to use, they can pass the value of enum and indicate if they would like the keying scheme to be an integer or a string. While we use the key as integer or string, it is easy to extend it to additional types.

Our next file ("hashtable_chain.c" provided below) contains the the actual implementation of the above APIs. Here is the implementation file.

 #include <stdio.h>
 #include <stdlib.h>
 #include <string.h>
 #include "hashtable_chain.h"

 int hashtable_get_index (hashtable_t *hashtable, void *key) {
     int ret_hash = 0;
     char *c; 

     if (!hashtable || !key) {
         return -1;
     }

     switch (hashtable->key_type) { 
         case HASHTABLE_KEY_TYPE_STRING:
             c = (char *)key;
             while (*c != '\0') {
                 ret_hash += *(c++);
             }
             return (ret_hash % hashtable->table_size);
         case HASHTABLE_KEY_TYPE_INTEGER:
             return ((*(int *)key) % hashtable->table_size);
         default:
             printf("Unsupported key-type");
             return 0;
     }
 }

 int hashtable_key_compare (hashtable_t *hashtable, const void *value1, const void *value2) {
     switch (hashtable->key_type) { 
         case HASHTABLE_KEY_TYPE_STRING:
             return (strcmp((char *)value1, (char *)value2));
         case HASHTABLE_KEY_TYPE_INTEGER:
             if ( *(int *)value1 < *(int *)value2 ) {
                 return -1;
             } else if ( *(int *)value1 > *(int *)value2 ) {
                 return 1;
             } else {
                 return 0;
             }
         default:
             printf("Unsupported key-type");
             return 0;
     }
 }

 void hashtable_init (hashtable_t *hashtable, hashtable_key_type key_type, int table_size) {
     if (!hashtable || !table_size) {
         return;
     }
     hashtable->hashtable = (hash_node **)calloc(table_size, (sizeof(hash_node*)));
     hashtable->key_type = key_type; 
     hashtable->table_size = table_size;
 }

 void hashtable_free (hashtable_t *hashtable) {
     hash_node *next_elem, *prev_elem; 
     int counter = 0;

     if (!hashtable || !hashtable->table_size) {
         return;
     }

     while (counter < hashtable->table_size) {
         if (hashtable->hashtable[counter] == NULL) {
             counter++;
             continue;
         } else {
             next_elem = hashtable->hashtable[counter];
             if (hashtable->key_type == HASHTABLE_KEY_TYPE_STRING) {
                 free(next_elem->key); 
             }
             next_elem = next_elem->next;

             while (next_elem) {
                 prev_elem = next_elem;
                 next_elem = next_elem->next;
                 if (hashtable->key_type == HASHTABLE_KEY_TYPE_STRING) {
                     free(next_elem->key); 
                 }
                 free(prev_elem);
             }
         }
         hashtable->hashtable[counter] = NULL;
         counter++;
     }
     free(hashtable->hashtable); 
     hashtable->table_size = 0;
 }

 int hashtable_insert (hashtable_t *hashtable, char *key, void *value) {
     hash_node *elem, *new_elem, *last_elem;
     int hash_index;

     /* Get the hash index */
     hash_index = hashtable_get_index(hashtable, key);
     if (hash_index == -1) {
         return -1;
     }

     /* Iterate through the linked list */
     for (elem = hashtable->hashtable[hash_index]; elem != NULL; elem = elem->next) {
         if (hashtable_key_compare(hashtable, elem->key, key) == 0) {
             /* An element exists with the same key. Update the value and return. */
             elem->value = value;
             return 0;
         }
         /* Get hold of the last element */
         if (elem->next == NULL) {
             last_elem = elem;
         }
     }

     new_elem = (hash_node *)malloc(sizeof(hash_node)); 
     if (hashtable->key_type == HASHTABLE_KEY_TYPE_STRING) {
         /* We add 1 char to hold NULL-terminator for the string */
         char *temp_key;
         temp_key = (char *)malloc((strlen(key) + 1)* sizeof(char) + 1); 
         memcpy(temp_key, key, strlen(key));
         new_elem->key = temp_key;
     } else if (hashtable->key_type == HASHTABLE_KEY_TYPE_INTEGER) {
         new_elem->key = key;
     } else {
         printf("%s Undefined keytype (%d)\n", __FUNCTION__, hashtable->key_type);
     }
     new_elem->value = value;
     new_elem->next = NULL;

     if (hashtable->hashtable[hash_index] == NULL) {
         /* This is the first entry in this index */ 
         hashtable->hashtable[hash_index] = new_elem; 
     } else {
         last_elem->next = new_elem;
     }
     return 0;
 }

 void hashtable_delete (hashtable_t *hashtable, void *delete_key) {
     hash_node *next_elem, *prev_elem; 
     int hash_index, counter = 0;

     /* Get the hash index */
     hash_index = hashtable_get_index(hashtable, delete_key);
     if (hash_index == -1) {
         return;
     }

     /* If the index is empty, return */
     if (hashtable->hashtable[hash_index] == NULL) {
         return;
     }

     /* See if the first elem itself has the key that we are looking for */
     if (hashtable_key_compare(hashtable,
             hashtable->hashtable[hash_index]->key, delete_key) == 0) {
         next_elem = hashtable->hashtable[hash_index]->next;
         if (hashtable->key_type == HASHTABLE_KEY_TYPE_STRING) {
             free(hashtable->hashtable[hash_index]->key);
         }
         free(hashtable->hashtable[hash_index]);
         hashtable->hashtable[hash_index] = next_elem;
         return;
     }

     /* We need to traverse the linked list to find what we are looking for */
     prev_elem = hashtable->hashtable[hash_index];
     next_elem = hashtable->hashtable[hash_index]->next;
     while (next_elem) {
         if (hashtable_key_compare(hashtable, next_elem->key, delete_key) == 0) {
             prev_elem->next = next_elem->next;
             if (hashtable->key_type == HASHTABLE_KEY_TYPE_STRING) {
                 free(next_elem->key); 
             }
             free(next_elem);

             return;
         }
         prev_elem = next_elem;
         next_elem = next_elem->next;
     }
     return;
 }

 int hashtable_search (hashtable_t *hashtable, void *search_key) {
     hash_node *elem; 
     int hash_index, counter = 0;

     /* Get the hash index */
     hash_index = hashtable_get_index(hashtable, search_key);
     if (hash_index == -1) {
         return -1;
     }

     /* Iterate through the linked list */
     for (elem = hashtable->hashtable[hash_index]; elem != NULL; elem = elem->next) {
         if (hashtable_key_compare(hashtable, elem->key, search_key) == 0) {
             return 1;
         }   
     }   
     return -1; 
 }

In the above file, the hashtable_get_index() is the key implementation function since it is this function that calculates the hash table index (slot) pertaining to a key. This function calculates the index based on the key_type. This function is used by various operations. When adding an element to the hash table, we need to get the index, where we need to add the element. When looking for an element in the hash table, we need to get the index first and then go to that index to check if the element exists. When deleting an element from the hash table, we need to get the index of the element being deleted.

Following hashtable_get_index(), we have hashtable_key_compare() that compares two keys. This function checks if the key size is an integer or a string. If the key type is string, then it relies on strcmp() to do the comparison . If the key type is integer, then it uses a simple integer comparison, instead.

The next two functions are hashtable_init() and the hashtable_free(). The hashtable_init() does various hash table level init operations like setting key_type, comparison function, allocating memory for the hashtable, etc. The hashtable_free() function is called to free up all the malloced resources. It should be called in the end, once we decide to delete the hash table itself.

The hashtable_insert() function in the above code allows us to add an element to the hash table. For insertion, we use the fact that the hash table itself is built using an array of "hash_node" pointers. To make it more efficient, for string-based keys, we do a malloc instead of storing values from the stack; using a static size for the value could mean unused memory if the string length of the values are less than the allocated limit. Needless to say, we need to call free() once we delete this node or once we are done with the table itself!

The next function hashtable_delete() does pretty much the opposite of hashtable_insert(). For deletion, we start by getting the hash table index for the input key. Once we have the index, we verify if the index is empty. If it is not, then we traverse the linked list to locate the element -- we use the hashtable_key_compare() function to compare the passed key to that of the current element. If found, we delete the element.

The next function (hashtable_search()) is at the heart of the hash table implementation since it is this function that supports search (lookup). Given an input key, we use the hashing function to get the index, then do a lookup (hopefully of O(1) complexity) to see if the element is present in that index. If the first element is not what we are searching for, then we traverse the linked list till we find the element. If we still could not find the element, then that element clearly does not exist in the hash table.

Next, we provide the file that contains the main() function and ties all of the pieces together ("main_hashtable_chain.c").

 #include <stdio.h>
 #include <stdlib.h>
 #include <string.h>
 #include "hashtable_chain.h"

 #define HASHTABLE_SIZE_FOR_ME 10

 void hashtable_print_for_me (hashtable_t *hashtable) {
     hash_node *elem; 
     int counter = 0;

     printf("Printing the Hash-table: \n");
     if (!hashtable || !hashtable->table_size) { 
         return;
     }   

     for (counter = 0; counter < hashtable->table_size; counter++) {
         if (hashtable->hashtable[counter] == NULL) {
             printf("[Index: %2d]: [Key: Empty]\n", counter);
         } else {
             printf("[Index: %2d]: [Key: '%s', Value: %d]", counter, 
                              (char *)hashtable->hashtable[counter]->key,
                              *(int *)hashtable->hashtable[counter]->value); 
             for (elem = hashtable->hashtable[counter]->next; 
                             elem != NULL; 
                             elem = elem->next) {
                 printf("\n             [Key: '%s', Value: %d]",
                             (char *)elem->key, 
                             *( int*)elem->value); 
             }
             printf("\n");
         }
     }
 }

 int main (int argc, char *argv[]) {
     hashtable_t hashtable;
     char *freeway_names[] = {"Dwight D. Eisenhower Highway", "Nimitz Freeway", 
                              "Bayshore Freeway", "Santa Ana Freeway", 
                              "Junipero Serra Freeway"};
     int freeway_numbers[] = {80, 880, 101, 5, 280};
     int len = sizeof(freeway_numbers)/sizeof(freeway_numbers[0]);
     int random_counter, found, i;

     /* Initialize the hash table */
     hashtable_init(&hashtable, HASHTABLE_KEY_TYPE_STRING, HASHTABLE_SIZE_FOR_ME); 

     /* Add all the keys */
     for (i = 0; i < len; i++) {
         hashtable_insert(&hashtable, freeway_names[i], (void *)&freeway_numbers[i]);
     }

     /* Print the hash table */
     hashtable_print_for_me(&hashtable); 

     /* Let us try to search one of the strings passed above */
     random_counter = random() % len;
     printf("\nLet us search the key: '%s'\n", freeway_names[random_counter]); 
     found = hashtable_search(&hashtable, freeway_names[random_counter]);
     printf("Found: %d \n", found);

     random_counter = random() % len;
     printf("\nLet us delete the key: '%s'\n\n", freeway_names[random_counter]);
     hashtable_delete(&hashtable, freeway_names[random_counter]);
     hashtable_print_for_me(&hashtable);

     /* We are done, let us free the resources */
     hashtable_free(&hashtable); 
 }

The above main() function starts by calling hashtable_init() to initialize the hash table. Using this API, it passes a pointer to the hash table, the key type (string), and the table size (in this case, 10). We keep size of the hash table very small since that would allow us to see collisions even with small number of entries. When using it for production, this value should be kept higher.

Once the init is done, the main() function uses two input arrays (freeway_names[] and freeway_numbers[]) as key/value pairs. Next, it uses hashtable_insert() to add these values to the hash table. Once we have added all the elements to the table, we print the entire hash table (which is scalable in this case since the table size is small). Next, we randomly pick one of the input strings and then use hashtable_search() to verify if it is present in the hash table. Likewise, we randomly pick one of the input strings and then use hashtable_delete() to delete it from the hash table. To verify the deletion, we print the hash table again.

Lastly, since the key/value types are "void *", the core implementation does not offer the print function -- it is left to the application to do the walk and print the elements. Accordingly, the main() function also adds its own print function (hashtable_print_for_me()). Please note that we do this in a sequential manner. For large amount of entries, using this approach to list all the elements stored might be inefficient because some of the indices would be empty and we would unnecessary have to traverse them. An astute reader must have figured out that we cannot use this approach to print all the stored values in a sorted manner. However, that is okay, since the objective of a hash table is not to keep elements in a sorted manner but to do a faster (random) lookup. We provide the hashtable_print_for_me() function merely to show how chained approach stores all the elements and to list out elements, if we are doing debugging.

With the code all ready to go, it is time to compile it and run it. To do that, we would need to compile both the implementation file ("hashtable_chain.c") and the file that contains the main() function ("main_hashtable_chain.c") file. Not including either one of them would lead to compilation error. From the output (see below), we see that even when we have the number of values to be added less than the hash table size, we still see collisions. Some of the indices end up seeing collisions and hence pay the price while a large fraction of the hash table entries remain empty! This demonstrates the significance of a good hashing function in designing an efficient hash table.

 user@codingbison $ gcc hashtable_chain.c main_hashtable_chain.c -o hashtable_chain
 user@codingbison $ ./hashtable_chain
 Printing the Hash-table: 
 [Index:  0]: [Key: 'Nimitz Freeway', Value: 880]
              [Key: 'Junipero Serra Freeway', Value: 280]
 [Index:  1]: [Key: Empty]
 [Index:  2]: [Key: 'Santa Ana Freeway', Value: 5]
 [Index:  3]: [Key: Empty]
 [Index:  4]: [Key: 'Bayshore Freeway', Value: 101]
 [Index:  5]: [Key: 'Dwight D. Eisenhower Highway', Value: 80]
 [Index:  6]: [Key: Empty]
 [Index:  7]: [Key: Empty]
 [Index:  8]: [Key: Empty]
 [Index:  9]: [Key: Empty]

 Let us search this input string: 'Santa Ana Freeway'
 Found: 1 

 Let us delete this input string: 'Nimitz Freeway'

 Printing the Hash-table: 
 [Index:  0]: [Key: 'Junipero Serra Freeway', Value: 280]
 [Index:  1]: [Key: Empty]
 [Index:  2]: [Key: 'Santa Ana Freeway', Value: 5]
 [Index:  3]: [Key: Empty]
 [Index:  4]: [Key: 'Bayshore Freeway', Value: 101]
 [Index:  5]: [Key: 'Dwight D. Eisenhower Highway', Value: 80]
 [Index:  6]: [Key: Empty]
 [Index:  7]: [Key: Empty]
 [Index:  8]: [Key: Empty]
 [Index:  9]: [Key: Empty]
 user@codingbison $