CodingBison - Hash Tables: Open Addressing (Advanced Implementation)

Open-addressing based hash tables avoid collisions by continuously probing till they find an empty index in the table. Given an input string/number, we find a hash table index. If the index in the hash table is full, then we find another index; if this index is also full, then we search for another index -- we keep finding newer indices till we find an empty index. Since all entries in the hash table sit in the table itself, the load factor for the hash table is always less than 1.0. In fact, if the load factor starts to approach 1.0, then the number of indices visited before one can find an empty index starts to increase and the performance begins to degrade. Typically, the load factor for an open-addressing scheme should be less than 0.80.

In the following figure,w e demonstrate an open-addressing based hash table. The figure adds a new key ("New Hampshire") into the table. It continues to probe till it finds an index that is empty.

                 'New Hampshire'     
                     |
                     |

   Index:  0         1               2              3             4         5
   ------------------------------------------------------------------------------
  |  Empty    |  'New York'  |  'Maryland'  | 'South Dakota' |  Empty  |  Empty  |
   ------------------------------------------------------------------------------

             Hashing returns 1, but that is occupied. So, need to find a new index.


                                              'New Hampshire'     
                                                    |
                                                    |

   Index:  0         1               2              3             4         5
   ------------------------------------------------------------------------------
  |  Empty    |  'New York'  |  'Maryland'  | 'South Dakota' |  Empty  |  Empty  |
   ------------------------------------------------------------------------------

             Next Hashing returns 3, but that is also occupied. So, need to find a new index.


                                                                        'New Hampshire'     
                                                                               |
                                                                               |

   Index:  0         1               2              3             4            5
   --------------------------------------------------------------------------------------
  |  Empty    |  'New York'  |  'Maryland'  | 'South Dakota' |  Empty  |  'New Hampshire'|
   --------------------------------------------------------------------------------------

             Next Hashing returns 5 and since it is not occupied, it can sit there. 


                             Figure: Open-addressing based Hash Table

One necessary hack to the above approach is the case when an entry is deleted in the middle of the probe sequence. In other words, for inserting an element, let us say we find that we have to go past a set of occupied indices before we find an empty node, 'e'. Now, let the probe sequence comprise of the following table indices: {m, i, j, p, e}. In this sequence, let us say all indices are occupied. Now, let us say one were to delete the element sitting at index, 'p'. This deletion can break the probe chain. This is because, if one were to search for the element sitting at index 'e', then one can go from 'm' to 'i' to 'j' and then the value at 'p' would be NULL. This NULL value would indicate that there are no more elements beyond this index in the probe sequence. As a result, one would never read the elements sitting at index 'e'. To avoid this, we add at the index 'p' a dummy deleted value and whenever the value of an index is equal to this dummy value, it means that it is not at the end of the chain and we should continue to look for other links in the probe sequence.

Advanced Implementation

Let us now provide an implementation for open-addressing. Our implementation is packaged as a header file, an implementation file, and a main file that uses the APIs defined in the header file. This allows us to reuse the implementation for several applications, which is not possible with the earlier monolithic implementation.

One side-note. If you are looking for a simpler implementation, then you can find that here: Simple Implementation for Open-Addressing Based Hash Tables. In fact, if you are trying to learn hash tables for the first time, we strongly recommend to start with the above page.

The implementation provides both data-structures and APIs for handling various hash table operations. In terms of APIs, here is the set of tasks that the implementation needs to provide.

 hashtable_init():   Initialize the hashtable with hashtable-level parameters
 hashtable_free():   Free the entire hash table
 hashtable_insert(): Insert (or update) an element with the passed key/value 
 hashtable_delete(): Delete an existing element that matches the passed key 
 hashtable_search(): Look for an element using the passed key

With that, let us begin with the header file. As we can see, the hash_node contains both key and value members and both of them are of "void *" type. The header file also adds a new enum hashtable_key_type that supports two types of keys: integer and string. If needed, one can easily extend it to add other types as well. Next, the newly added structure, hashtable_t, encapsulates various variables related to hash tables. Lastly, it also publishes the above APIs.

 #ifndef __HASHTABLE_OPEN_ADDRESSING_H___
 #define __HASHTABLE_OPEN_ADDRESSING_H___

 typedef enum hashtable_key_type_t {
     HASHTABLE_KEY_TYPE_INTEGER = 1, 
     HASHTABLE_KEY_TYPE_STRING, 
 } hashtable_key_type;

 typedef struct hash_node_ {
     void   *key; 
     void   *value; 
 } hash_node;

 typedef struct hashtable_t_ {
     hashtable_key_type  key_type;
     int                 table_size;
     hash_node           **hashtable;
     hash_node           *deleted_hash_node;
 } hashtable_t;

 void hashtable_init(hashtable_t *hashtable, hashtable_key_type key_type, int table_size); 
 void hashtable_free(hashtable_t *hashtable);
 int hashtable_insert(hashtable_t *hashtable, char *key, void *value);
 void hashtable_delete(hashtable_t *hashtable, void *delete_key); 
 int hashtable_search(hashtable_t *hashtable, void *search_key); 
 #endif /* __HASHTABLE_OPEN_ADDRESSING_H___ */

In the above header file, all of the APIs accept a pointer to the hashtable_t structure. This structure is used for encapsulating these two in a new data-structure, hashtable_t. The new data structure also packs a member (key_type) for specifying the type of key. This structure also contains another member deleted_hash_node -- we use this dummy pointer to mark nodes that get deleted in the middle of the probe sequence.

The next file in this trilogy is the implementation file ("hashtable_open_addressing.c") that contains implementation of the APIs published by the header file.

 #include <stdio.h>
 #include <stdlib.h>
 #include <string.h>
 #include "hashtable_open_addressing.h"

 int hashtable_get_index (hashtable_t *hashtable, void *key) {
     int ret_hash = 0;
     char *c; 

     if (!hashtable || !key) {
         return -1;
     }

     if (hashtable->key_type == HASHTABLE_KEY_TYPE_INTEGER) {
         return ((*(int *)key) % hashtable->table_size);
     }

     c = (char *)key;
     while (*c != '\0') {
         ret_hash += *(c++);
     }
     return (ret_hash % hashtable->table_size);
 }

 int hashtable_key_compare (hashtable_t *hashtable, const void *value1, const void *value2) {
     switch (hashtable->key_type) { 
         case HASHTABLE_KEY_TYPE_STRING:
             return (strcmp((char *)value1, (char *)value2));
         case HASHTABLE_KEY_TYPE_INTEGER:
             if ( *(int *)value1 < *(int *)value2 ) {
                 return -1;
             } else if ( *(int *)value1 > *(int *)value2 ) {
                 return 1;
             } else {
                 return 0;
             }
         default:
             printf("Unsupported key-type");
             return 0;
     }
 }

 void hashtable_init (hashtable_t *hashtable, hashtable_key_type key_type, int table_size) { 
     if (!hashtable || !table_size) {
         return;
     }
     hashtable->hashtable = (hash_node **)calloc(table_size, (sizeof(hash_node*)));
     hashtable->key_type = key_type; 
     hashtable->table_size = table_size;
     hashtable->deleted_hash_node = (hash_node *)malloc((sizeof(hash_node)));
 }

 void hashtable_free (hashtable_t *hashtable) {
     int counter = 0;

     if (!hashtable || !hashtable->table_size) {
         return;
     }

     while (counter < hashtable->table_size) {
         if ((hashtable->hashtable[counter] == NULL) ||
                   (hashtable->hashtable[counter] == hashtable->deleted_hash_node) ) {
             counter++;
             continue;
         } else {
             free(hashtable->hashtable[counter]->key);
             free(hashtable->hashtable[counter]);
         }
         hashtable->hashtable[counter] = NULL;
         counter++;
     }
     free(hashtable->deleted_hash_node); 
     free(hashtable->hashtable); 
     hashtable->table_size = 0;
 }

 int hashtable_insert (hashtable_t *hashtable, char *key, void *value) {
     hash_node *new_elem;
     int start_index, hash_index; 

     start_index = hash_index = hashtable_get_index(hashtable, key);
     if (hash_index == -1) {
         return -1;
     }

     printf("[%s] Probe sequence: %d ", __FUNCTION__, hash_index);
     while (hashtable->hashtable[hash_index] && 
             (hashtable->hashtable[hash_index] != hashtable->deleted_hash_node)) {
         /* If the key exists, then update the value and return */
         if (hashtable_key_compare(hashtable, 
                 hashtable->hashtable[hash_index]->key, key) == 0) {
             hashtable->hashtable[hash_index]->value = value;
             return 0;
         }
         hash_index++;
         if (start_index == hash_index) {
             /* Back to where we started. Return */
             return -1;
         }
         if (hash_index == hashtable->table_size)
             hash_index = 0;
         printf("%d ", hash_index);
     }
     printf("\n");

     new_elem = (hash_node *)malloc(sizeof(hash_node)); 
     if (hashtable->key_type == HASHTABLE_KEY_TYPE_STRING) {
         /* We add 1 char to hold NULL-terminator for the string */
         char *temp_key;
         temp_key = (char *)malloc((strlen(key) + 1)* sizeof(char) + 1); 
         memcpy(temp_key, key, strlen(key));
         new_elem->key = temp_key;
     } else if (hashtable->key_type == HASHTABLE_KEY_TYPE_INTEGER) {
         new_elem->key = key;
     } else {
         printf("%s Undefined keytype (%d)\n", __FUNCTION__, hashtable->key_type);
     }
     new_elem->value = value;

     hashtable->hashtable[hash_index] = new_elem; 
     return 0;
 }

 void hashtable_delete (hashtable_t *hashtable, void *delete_key) {
     int hash_index;

     /* Get the hash index */
     hash_index = hashtable_get_index(hashtable, delete_key);
     if (hash_index == -1) {
         return;
     }

     printf("[%s] Probe sequence: ", __FUNCTION__);
     /* Start the probe */
     while (hashtable->hashtable[hash_index]) {
         printf("%d ", hash_index);
         if ((hashtable->hashtable[hash_index] != hashtable->deleted_hash_node)  &&
             (hashtable_key_compare(hashtable, 
                     hashtable->hashtable[hash_index]->value, delete_key) == 0)) {
             free(hashtable->hashtable[hash_index]->value);
             free(hashtable->hashtable[hash_index]);
             hashtable->hashtable[hash_index] = hashtable->deleted_hash_node;
             printf("\n");
             return;
         }
         hash_index++;
         if (hash_index == hashtable->table_size) {
             hash_index = 0;
         }
     }
     printf("\n");
     return; 
 }

 int hashtable_search (hashtable_t *hashtable, void *search_key) {
     int hash_index;

     /* Get the hash index */
     hash_index = hashtable_get_index(hashtable, search_key);
     if (hash_index == -1) {
         return -1;
     }

     printf("[%s] Probe sequence: ", __FUNCTION__);
     /* Start the probe */
     while (hashtable->hashtable[hash_index]) {
         printf("%d ", hash_index);
         if (hashtable->hashtable[hash_index] == NULL) {
             return -1;
         } else {
             if (hashtable_key_compare(hashtable, hashtable->hashtable[hash_index]->key, 
                                         search_key) == 0) { 
                 return 1;
             }
         }
         hash_index++;
         if (hash_index == hashtable->table_size) {
             hash_index = 0;
         }
     }
     printf("\n");
     return -1;
 }

The first function in the above file is the hashtable_get_index() and this function is one of the key tasks of the hash table. Given an input key, the function looks at the key type. Currently, it supports integer and string; extending it to other types should be fairly straightforward. If the key is an integer, then it takes the modulus of that with respect to the table size. If the key is string, then it gets the sum of value of individual characters -- once the sum is calculated, once again it takes the modulus of that with respect to the table size. This function is used by various operations: insert, search, and delete. Since there is no need for applications to know the index, we do not add this to the header file.

Following hashtable_get_index(), we have hashtable_key_compare() that compares two keys. This function checks if the key size is an integer or a string. If the key type is string, then it relies on strcmp() to do the comparison . If the key type is integer, then it uses a simple integer comparison, instead.

The function hashtable_insert() starts by finding an index in the table and then inserting into the list, it keeps finding new entry until it finds an index that is neither NULL nor has the address of the dummy pointer as deleted_hash_node. We need this dummy pointer to handle deleted values (the deleted_hash_node field of the hashtable_t data-structure). For an index where we delete the entry, we simply store the address of this structure. Since the address of the structure would be unique, it would ensure that it never becomes same as any other valid stored element. Please note that hashtable_insert() also acts as an update function -- if the key/value exists, then this function simply updates the existing value to the new value.

We should note that with this approach, the location of the index after probing j times would be "index = hash(input_string, HASH_TABLE_SIZE) + j". In other words, the index is a function of both the input string and the number of probing. This approach belongs to a methodology known as "double hashing", where the hash table index is calculated using both the input string and the count of probing. In our case, we merely use j, but advanced techniques use j as an input for yet another hashing function. Thus, the key can be "index = hash1(input_string, HASH_TABLE_SIZE + hash2(j)".

Function hashtable_delete() follows a similar logic. Given the input key, we find the first index in the hash table. Next, we keep continuing along the probe till we locate the element we are looking for. Once we find it, we simply free all the resource and then make the index point to "deleted_hash_node".

For hashtable_search(), we search the hash table means by using the probe sequence till we find the element (we use compare_func() callback to do the comparison) that we are searching for. If in the sequence the probe table becomes NULL, then we simply return. We follow the same probing sequence as before: we get the first index using a hashing function and then keep incrementing the index by one for each successive probe.

The last file in this "trilogy" is the file that mimics an application. This file ("main_hashtable_open_addressing.c" provided below) has the main() function and that calls the new APIs provided by this implementation to trigger add, delete, and search functions.

 #include <stdio.h>
 #include <stdlib.h>
 #include <string.h>
 #include "hashtable_open_addressing.h"

 #define HASHTABLE_SIZE_FOR_ME 10

 void hashtable_print_for_me (hashtable_t *hashtable) {
     int counter = 0;

     printf("\n====== Printing the Hash Table ============ \n");
     while (counter < HASHTABLE_SIZE_FOR_ME) {
         if (hashtable->hashtable[counter] == NULL) {
             printf("[Index %2d]: [Key: Empty]\n", counter);
         } else if (hashtable->hashtable[counter] == hashtable->deleted_hash_node) {
             printf("[Index %2d]: [Key: Deleted]\n", counter);
         } else {
             printf("[Index %2d]: [Key: '%s', Val: %d]\n", counter, 
                             (char *)hashtable->hashtable[counter]->key,
                             *(int *)hashtable->hashtable[counter]->value); 
         }
         counter++;
     }
 }

 int main (int argc, char *argv[]) {
     hashtable_t hashtable;
     char *freeway_names[] = {"Dwight D. Eisenhower Highway", "Nimitz Freeway", 
                              "Bayshore Freeway", "Santa Ana Freeway", 
                              "Junipero Serra Freeway"};
     int freeway_numbers[] = {80, 880, 101, 5, 280};
     int len = sizeof(freeway_numbers)/sizeof(freeway_numbers[0]);
     int random_counter, found, i;

     /* Need to initialize the hash table */
     hashtable_init(&hashtable, HASHTABLE_KEY_TYPE_STRING, HASHTABLE_SIZE_FOR_ME);

     /* Add all the keys */
     for (i = 0; i < len; i++) {
         hashtable_insert(&hashtable, freeway_names[i], (void *)&freeway_numbers[i]);
     }

     /* Print the hash table */
     hashtable_print_for_me(&hashtable); 

     /* Let us try to search one of the strings passed above */
     random_counter = random() % len;
     printf("\nLet us search this input string: '%s' \n", freeway_names[random_counter]); 
     found = hashtable_search(&hashtable, freeway_names[random_counter]);
     printf("Found: %d \n", found);

     random_counter = random() % len;
     printf("\nLet us delete this input string: '%s' \n", freeway_names[random_counter]);
     hashtable_delete(&hashtable, freeway_names[random_counter]);
     hashtable_print_for_me(&hashtable);

     /* We are done, let us free the resources */
     hashtable_free(&hashtable); 
 }

This file starts with a print function (hashtable_print_for_me()) for the hash table. Printing the hash table is quite simple. One simply traverses each index in the hash table and prints it. Care is taken to not print an index if it is NULL or has been previously removed (which is indicated by it being a &deleted_hash_node pointer). Once we are done with all the operations, we call hashtable_free() to release all the malloced resources.

With all the the three files ready, let us now compile them and see the output. To do that, we now need to pass both C files ("hashtable_open_addressing.c" and "main_hashtable_open_addressing.c") to gcc. When using this with a new application, let us say "new_application.c", once again, we would need to "hashtable_open_addressing.c" with the new application file ("new_application.c"). As we can see, this allows us to reuse this code.

 user@codingbison $ gcc hashtable_open_addressing.c main_hashtable_open_addressing.c -o hashtable_open_addressing
 user@codingbison $ ./hashtable_open_addressing
 [hashtable_insert] Probe sequence: 5 
 [hashtable_insert] Probe sequence: 0 
 [hashtable_insert] Probe sequence: 4 
 [hashtable_insert] Probe sequence: 2 
 [hashtable_insert] Probe sequence: 0 1 

 ====== Printing the Hash Table ============ 
 [Index  0]: [Key: 'Nimitz Freeway', Val: 880]
 [Index  1]: [Key: 'Junipero Serra Freeway', Val: 280]
 [Index  2]: [Key: 'Santa Ana Freeway', Val: 5]
 [Index  3]: [Key: Empty]
 [Index  4]: [Key: 'Bayshore Freeway', Val: 101]
 [Index  5]: [Key: 'Dwight D. Eisenhower Highway', Val: 80]
 [Index  6]: [Key: Empty]
 [Index  7]: [Key: Empty]
 [Index  8]: [Key: Empty]
 [Index  9]: [Key: Empty]

 Let us search this input string: 'Santa Ana Freeway' 
 [hashtable_search] Probe sequence: 2 Found: 1 

 Let us delete this input string: 'Nimitz Freeway' 
 [hashtable_delete] Probe sequence: 0 1 2 

 ====== Printing the Hash Table ============ 
 [Index  0]: [Key: 'Nimitz Freeway', Val: 880]
 [Index  1]: [Key: 'Junipero Serra Freeway', Val: 280]
 [Index  2]: [Key: 'Santa Ana Freeway', Val: 5]
 [Index  3]: [Key: Empty]
 [Index  4]: [Key: 'Bayshore Freeway', Val: 101]
 [Index  5]: [Key: 'Dwight D. Eisenhower Highway', Val: 80]
 [Index  6]: [Key: Empty]
 [Index  7]: [Key: Empty]
 [Index  8]: [Key: Empty]
 [Index  9]: [Key: Empty]
 user@codingbison $

The output (provided above ) shows that when we have added 4 elements in the hash-table, we see that the probing sequence kicks in. When adding the fifth element, we get 0 as the first index. But 0 as an index already holds a value and so we start the probe and then check the next index which is 1. Since the index 1 is empty, we put the fifth element there.

It is important to note that as the number of elements stored increases, the length of probing increases. The number of probes for one of the instances increases to 5, which is half of the total number of indices in the table. If there were a large number of elements in the table, this can lead to a big overhead both when trying to locate an empty index or when trying to search an element! Thus, it is very important that the load factor for an open-addressing hash table should not be allowed to approach 1.0.