CodingBison

Chain-based hash tables use chains (can be implemented as linked lists) at each index to handle collisions; such hash tables are also known as Direct-addressing based hash tables. For these hash tables, the first entry for a given index sits as the first entry in that index in the hash-table. Subsequent entries that get mapped to the same index are appended to this entry leading to a chain of entries all belonging to the same index.

The following figure shows a chain-based hash table. In this figure, some of the indexes of the hash-table have more than one entries since the hashing function returns the same index for all those elements. In that case, they use the "next" pointer to build a chain rooted at the common index of all those elements. For example, all the three entries of "South Dakota", "North Dakota", and "Massachusetts" have the same hashing index of 3.

    Index: 0            1               2              3             4            5
       -------------------------------------------------------------------------------
      |  Empty    |  'New York'  |  'Maryland'  | 'South Dakota' |  Empty  |    Empty |
       -------------------------------------------------------------------------------
                        |               |              |
                        | next          | next         | next
                        V               V              V
                 'New Hampshire'       NULL       'North Dakota'
                                                       |  
                                                       | next
                                                       V
                                                 'Massachusetts'

                             Figure: Chain based Hash Table

The performance of a direct-addressed hash table depends upon the design of its hashing function -- for a well-designed hashing function, collisions would be minimum. For a badly-designed hashing function, collisions can be large and in the worst possible case, all the entries can sit in the same index. Assuming a linked list implementation for accommodating collisions, one would have to traverse the entire linked list to find the last element, leading to a run-time complexity of O(N). To avoid this inefficiency, one can consider alternative ways of chaining elements. For example, instead of a simple linked list, one can consider using a balanced binary search tree, for which the worst run-time complexity of O(logN) can be better than O(N).

Implementation

Let us get started with the implementation! In this section, we provide an implementation of a bare-bone hashing system with the intent to demonstrate the basic elements involved in design of a hash table. The reader is welcome to try different (and more optimal!) hashing functions or different implementations of the hash table itself.

The below-provided implementation contains both data-structure for the elements sitting inside the hash table as well as the functions for doing various operations on a hash-table like insertion, lookup (search), deletion, etc.

The implementation (provided below) begins by defining structure for each node in the hash table. Each node has a key, a value, and a next element, which acts as the linked-list "chain" between itself and the next element sitting at the same index; both key and value members are strings. For the sake of simplicity, we keep the value as NULL and focus on key-based operations. We keep size of the hash table very small since that would allow us to see collisions even with small number of entries. For industrial-grade software, this value should be kept higher.

 #include <stdio.h>
 #include <stdlib.h>
 #include <string.h>

 #define HASHTABLE_SIZE 10

 typedef struct hash_node {
     char *key; 
     char *value;
     struct hash_node *next;
 } hash_node;

 hash_node *hashtable[HASHTABLE_SIZE];

 int hashtable_get_index (char *string_key, int size) {
     int ret_hash = 0;
     char *c = string_key;

     while (*c != '\0') {
         ret_hash += *(c++);
     }
     return (ret_hash % size);
 }

 void hashtable_init (void) {
     int counter = 0;

     for (counter = 0; counter < HASHTABLE_SIZE; counter++) {
         hashtable[counter] = NULL;
     }
 }

 void hashtable_free (void) {
     hash_node *next_elem, *prev_elem; 
     int counter = 0;

     while (counter < HASHTABLE_SIZE) {
         if (hashtable[counter] == NULL) {
             counter++;
             continue;
         } else {
             next_elem = hashtable[counter];
             free(next_elem->key);
             next_elem = next_elem->next;

             while (next_elem) {
                 prev_elem = next_elem;
                 next_elem = next_elem->next;
                 free(prev_elem->key);
                 free(prev_elem);
             }
         }
         counter++;
     }
 }

 int hashtable_insert (char *key, char* value) {
     hash_node *elem, *new_elem, *last_elem;
     int hash_index;
     char *temp_val; 

     hash_index = hashtable_get_index(key, HASHTABLE_SIZE);
     if (hash_index == -1) {
         return -1;
     }

     /* Iterate through the linked list */
     for (elem = hashtable[hash_index]; elem != NULL; elem = elem->next) {
         if (strcmp(elem->key, key) == 0) {
             /* An element exists with the same key. Update the value and return. */
             elem->value = value;
             return 0;
         }
         /* Get hold of the last element */
         if (elem->next == NULL) {
             last_elem = elem;
         }
     }

     new_elem = (hash_node *)malloc(sizeof(hash_node)); 
     temp_val = (char *)malloc(strlen(key) * sizeof(char) + 1);
     memcpy(temp_val, key, strlen(key));
     new_elem->key = temp_val;
     new_elem->value = value;
     new_elem->next = NULL;

     if (hashtable[hash_index] == NULL) {
         /* This is the first entry in this index */ 
         hashtable[hash_index] = new_elem; 
     } else {
         last_elem->next = new_elem;
     }
     return 0;
 }

 void hashtable_delete (char *search_entry) {
     hash_node *next_elem, *prev_elem; 
     int hash_index;

     /* Get the hash key */
     hash_index = hashtable_get_index(search_entry, HASHTABLE_SIZE);

     /* If the index is empty, return */
     if (hashtable[hash_index] == NULL) {
         return;
     }

     /* See if the first elem itself has the key that we are looking for */
     if (strcmp(hashtable[hash_index]->key, search_entry) == 0) {
         next_elem = hashtable[hash_index]->next;
         free(hashtable[hash_index]->key);
         free(hashtable[hash_index]);
         hashtable[hash_index] = next_elem;
         return;
     }

     /* We need to traverse the linked list to find what we are looking for */
     prev_elem = hashtable[hash_index];
     next_elem = hashtable[hash_index]->next;
     while (next_elem) {
         if (strcmp(next_elem->key, search_entry) == 0) {
             prev_elem->next = next_elem->next;
             free(next_elem->key);
             free(next_elem);
             return;
         }
         prev_elem = next_elem;
         next_elem = next_elem->next;
     }
     return;
 }

 int hashtable_search (char *search_entry) {
     hash_node *elem; 
     int hash_index;

     /* Get the hash key */
     hash_index = hashtable_get_index(search_entry, HASHTABLE_SIZE);

     /* Iterate through the linked list */
     for (elem = hashtable[hash_index]; elem != NULL; elem = elem->next) {
         if (strcmp(elem->key, search_entry) == 0) {
             return 1;
         }
     }
     return -1;
 }

 void hashtable_print (void) {
     hash_node *elem; 
     int counter = 0;

     printf("====== Printing the Hash Table ============ \n");
     for (counter = 0; counter < HASHTABLE_SIZE; counter++) {
         if (hashtable[counter] == NULL) {
             printf("Index %2d:  Key: Empty \n", counter);
         } else {
             printf("Index %2d:", counter);
             for (elem = hashtable[counter]; elem != NULL; elem = elem->next) { 
                 printf("  Key: '%s'", elem->key); 
             }
             printf("\n");
         }
     }
 }

 int main (int argc, char *argv[]) {
     char *freeway_names[] = {"Dwight D. Eisenhower Highway", "Nimitz Freeway", 
                              "Bayshore Freeway", "Santa Ana Freeway", 
                              "Junipero Serra Freeway"};
     int len = sizeof(freeway_names)/sizeof(freeway_names[0]);
     int random_counter, found, i;

     /* Need to initialize the hash table */
     hashtable_init(); 

     /* Add all the keys */
     for (i = 0; i < len; i++) {
         hashtable_insert(freeway_names[i], NULL);
     }

     /* Print the hash table */
     hashtable_print(); 

     /* Let us try to search one of the strings passed above */
     random_counter = random() % len;
     printf("\nLet us search this input string: '%s'\n", freeway_names[random_counter]); 
     found = hashtable_search(freeway_names[random_counter]); 
     printf("Found: %d \n", found);

     random_counter = random() % len;
     printf("\nLet us delete this input string: '%s'\n\n", freeway_names[random_counter]);
     hashtable_delete(freeway_names[random_counter]);
     hashtable_print();

     /* We are done, let us free the resources */
     hashtable_free(); 
 }

In the above code, the first function is the hashtable_get_index() and this function is one of the key tasks for a hash table. Given an input string, we use the hashing function to get the index (or the slot) in the hash-table. This is used by various operations. When adding an element to the hash table, we need to get the index, where we need to add the element. When looking for an element in the hash table, we need to get the index first and then go to that index to check if the element exists. When deleting an element from the hash table, we need to get the index of the element that is being deleted.

The next two functions are hashtable_init() and the hashtable_free(). The hashtable_init() function initializes the elements of the hashtable and applications should call this before using other hashtable functions. The hashtable_free() function is called to free up all the malloced resources. It should be called in the end when we decide to delete the hash table itself.

The hashtable_insert() function allows us to add an element to the hash table. For insertion, we use the fact that the hash table itself is built using an array of "hash_node" pointers. To make it more efficient, we define the value stored as a pointer and before copying the passed string, we need to do a malloc; using a static size for the value could mean unused memory if the string length of the values are less than the allocated limit. Needless to say, we need to call free() once we delete this node or once we are done with the table itself!

The next function hashtable_delete() does pretty much the opposite of hashtable_insert(). For deletion, we (once again) start by getting the hash table index for the input string. Once we have the index, we verify if the index is empty. If it is not, then we traverse the linked list to locate the element and to delete it subsequently.

The next function (hashtable_search()) is at the heart of the hash table implementation since it is this function that supports search (lookup). Given an input string, we use the hashing function to get the index, then do a lookup (hopefully of O(1) complexity) to see if the element is present in that index. If the first element is not what we are searching for, then we traverse the linked list until we find the element. If we still do not find the element, then that element does not exist in the hash table.

The last hash table related function (hashtable_print()) prints elements in the hash-table. Please note that we do this in a sequential manner. For large amount of entries, using this approach to list all the elements stored might be inefficient because some of the indices would be empty and we would unnecessary have to traverse them. An astute reader must have figured out that we cannot use this approach to print all the stored values in a sorted manner. However, that is okay, since the objective of a hash table is not to keep elements in a sorted manner but to do a faster (random) lookup. We provide the hashtable_print() function merely to show how chained approach stores all the elements and to list out elements, if we are doing debugging.

In the end we use main() function to stitch all of the earlier functions together! Once the main() function is done with the init routine (hashtable_init()), it uses an input array of strings (freeway_names[]) and then insert these strings into the hash table. Once we have added all the elements to the table, we print the entire hash table (which is scalable in this case since the table size is small). Next, we randomly pick one of the input strings and then verify if it is present in the hash table. Likewise, we randomly pick one of the input strings and then delete it from the hash table. To verify the deletion, we print the hash table again.

With the code all ready to go, we go ahead and compile it and run it. From the output (see below), we see that even when we have the number of values to be added less than the hash table size, we still see collisions. Some of the indices end up seeing collisions and hence pay the price while a large fraction of the hash table entries remain empty! This demonstrates the significance of a good hashing function in designing an efficient hash table.

 user@codingree $ gcc hashtable_chain.c -o hashtable_chain
 user@codingree $ ./hashtable_chain
 ====== Printing the Hash Table ============ 
 Index  0:  Key: 'Nimitz Freeway'  Key: 'Junipero Serra Freeway'
 Index  1:  Key: Empty 
 Index  2:  Key: 'Santa Ana Freeway'
 Index  3:  Key: Empty 
 Index  4:  Key: 'Bayshore Freeway'
 Index  5:  Key: 'Dwight D. Eisenhower Highway'
 Index  6:  Key: Empty 
 Index  7:  Key: Empty 
 Index  8:  Key: Empty 
 Index  9:  Key: Empty 

 Let us search this input string: 'Santa Ana Freeway'
 Found: 1 

 Let us delete this input string: 'Nimitz Freeway'

 ====== Printing the Hash Table ============ 
 Index  0:  Key: 'Junipero Serra Freeway'
 Index  1:  Key: Empty 
 Index  2:  Key: 'Santa Ana Freeway'
 Index  3:  Key: Empty 
 Index  4:  Key: 'Bayshore Freeway'
 Index  5:  Key: 'Dwight D. Eisenhower Highway'
 Index  6:  Key: Empty 
 Index  7:  Key: Empty 
 Index  8:  Key: Empty 
 Index  9:  Key: Empty 
 user@codingbison $ 

Making Things More Generic!

While the above implementation is a pretty good start for understanding the inner workings of hash tables, we would like to note that we can certainly make it more generic. If we were to do that, then we would need to update the above program in several ways.

First, the above implementation assumes that the key would always be a string. A generic implementation should allow the key to be of other types as well and should allow any type of data to be stored in the value field. Third, the above implementation has all the code (data-structure definition, function definitions, and the main() function) all sitting in the same file. This single file cannot be reused by various applications. For that, we would need to split the file into a header file that publishes the data-structures and a C file that has implementation of the published APIs. Lastly, we would probably need a new structure to encapsulate hash table, which is a global and limits its scope. If needed, this structure can also carry various useful hash table related input like table_size, etc.

If you are curious, you can see the complete program for has these generic elements here: Advanced Implementation for Chained Hash Tables.





comments powered by Disqus