CodingBison - Hash Tables: Open-addressing

Open-addressing based hash tables avoid collisions by continuously probing till they find an empty index in the table. Given an input string/number, we find a hash table index. If the index in the hash table is full, then we find another index; if this index is also full, then we search for another index -- we keep finding newer indices till we find an empty index. Since all entries in the hash table sit in the table itself, the load factor for the hash table is always less than 1.0. In fact, if the load factor starts to approach 1.0, then the number of indices visited before one can find an empty index starts to increase and the performance begins to degrade. Typically, the load factor for an open-addressing scheme should be less than 0.80.

In the following figure,w e demonstrate an open-addressing based hash table. The figure adds a new key ("New Hampshire") into the table. It continues to probe till it finds an index that is empty.

                 'New Hampshire'     
                     |
                     |

   Index:  0         1               2              3             4         5
   ------------------------------------------------------------------------------
  |  Empty    |  'New York'  |  'Maryland'  | 'South Dakota' |  Empty  |  Empty  |
   ------------------------------------------------------------------------------

             Hashing returns 1, but that is occupied. So, need to find a new index.


                                              'New Hampshire'     
                                                    |
                                                    |

   Index:  0         1               2              3             4         5
   ------------------------------------------------------------------------------
  |  Empty    |  'New York'  |  'Maryland'  | 'South Dakota' |  Empty  |  Empty  |
   ------------------------------------------------------------------------------

             Next Hashing returns 3, but that is also occupied. So, need to find a new index.


                                                                        'New Hampshire'     
                                                                               |
                                                                               |

   Index:  0         1               2              3             4            5
   --------------------------------------------------------------------------------------
  |  Empty    |  'New York'  |  'Maryland'  | 'South Dakota' |  Empty  |  'New Hampshire'|
   --------------------------------------------------------------------------------------

             Next Hashing returns 5 and since it is not occupied, it can sit there. 


                             Figure: Open-addressing based Hash Table

One necessary hack to the above approach is the case when an entry is deleted in the middle of the probe sequence. In other words, for inserting an element, let us say we find that we have to go past a set of occupied indices before we find an empty node, 'e'. Now, let the probe sequence comprise of the following table indices: {m, i, j, p, e}. In this sequence, let us say all indices are occupied. Now, let us say one were to delete the element sitting at index, 'p'. This deletion can break the probe chain. This is because, if one were to search for the element sitting at index 'e', then one can go from 'm' to 'i' to 'j' and then the value at 'p' would be NULL. This NULL value would indicate that there are no more elements beyond this index in the probe sequence. As a result, one would never read the elements sitting at index 'e'. To avoid this, we add at the index 'p' a dummy deleted value and whenever the value of an index is equal to this dummy value, it means that it is not at the end of the chain and we should continue to look for other links in the probe sequence.

Implementation

Let us now see a simple implementation for open-addressing.

Our implementation provides both data-structure to represent each element in the hash table. The structure essentially contains two fields: key and value. This way, the user of this implementation can store values based on a key. For the sake of simplicity, we keep the value as NULL and focus on key-based operations. Next, we also have functions that provide insert (hashtable_insert()), delete (hashtable_delete()), search (hashtable_search()), and print (hashtable_print()) functionalities. In addition, we also have two function for initialization of the hash table (hashtable_init()) and freeing of the hash table (hashtable_free()).

We provide the implementation file below. Following that, we would provide text to explain various pieces of the implementation.

 #include <stdio.h>
 #include <stdlib.h>
 #include <string.h>

 #define HASHTABLE_SIZE 10

 typedef struct hash_node_ {
     char *key; 
     char *value;
 } hash_node;

 hash_node *hashtable[HASHTABLE_SIZE];

 /* Dummy entry for all the deleted nodes */
 hash_node deleted_hash_node; 

 int hashtable_get_index (char *string_key, int size) {
     int ret_hash = 0;
     char *c = string_key;

     while (*c != '\0') {
         ret_hash += *(c++);
     }
     return (ret_hash % size);
 }

 void hashtable_init (void) {
     int counter = 0;

     for (counter = 0; counter < HASHTABLE_SIZE; counter++) {
         hashtable[counter] = NULL;
     }
 }

 void hashtable_free (void) {
     int counter = 0;

     while (counter < HASHTABLE_SIZE) {
         if ((hashtable[counter] == NULL) ||
             (hashtable[counter] == &deleted_hash_node) ) {
             counter++;
             continue;
         } else {
             free(hashtable[counter]->key);
             free(hashtable[counter]);
         }
         counter++;
     }
 }

 int hashtable_insert (char *key, char *value) {
     hash_node *new_element;
     char *temp_val; 
     int hash_index = 0;
     int start_index;

     /* Get the hash key */
     start_index = hash_index = hashtable_get_index(key, HASHTABLE_SIZE);
     if (hash_index == -1) {
         return -1;
     }

     printf("[%s] Probe sequence: %d ", __FUNCTION__, hash_index);
     while (hashtable[hash_index] && (hashtable[hash_index] != &deleted_hash_node)) {
         /* If the key exists, then update the value and return */
         if (strcmp(hashtable[hash_index]->key, key) == 0) {
             hashtable[hash_index]->value = value;
             return 0;
         }
         hash_index++;
         if (start_index == hash_index) {
             /* Back to where we started. Return */
             return -1;
         }
         if (hash_index == HASHTABLE_SIZE) {
             hash_index = 0;
         }
         printf("%d ", hash_index);
     }
     printf("\n");

     new_element = (hash_node *)malloc(sizeof(hash_node)); 
     temp_val = (char *)malloc(strlen(key) * sizeof(char) + 1);
     memcpy(temp_val, key, strlen(key));
     new_element->key = temp_val;
     new_element->value = value;

     hashtable[hash_index] = new_element; 
     return 0;
 }

 void hashtable_delete (char *search_entry) {
     int hash_index;

     /* Get the hash key */
     hash_index = hashtable_get_index(search_entry, HASHTABLE_SIZE);
     if (hash_index == -1) {
         return; 
     }

     printf("[%s] Probe sequence: ", __FUNCTION__);
     /* Start the probe */
     while (hashtable[hash_index]) {
         printf("%d ", hash_index);
         if ((hashtable[hash_index] != &deleted_hash_node) &&
             (strcmp(hashtable[hash_index]->key, search_entry) == 0)) {
             free(hashtable[hash_index]->key);
             free(hashtable[hash_index]);
             hashtable[hash_index] = &deleted_hash_node;
             printf("\n");
             return;
         }
         hash_index++;
         if (hash_index == HASHTABLE_SIZE) {
             hash_index = 0;
         }
     }
     printf("\n");
     return;
 }

 int hashtable_search (char *search_entry) {
     hash_node *next_element; 
     int counter = 0;
     int hash_index;

     /* Get the hash key */
     hash_index = hashtable_get_index(search_entry, HASHTABLE_SIZE);

     printf("[%s] Probe sequence: ", __FUNCTION__);
     /* Start the probe */
     while (hashtable[hash_index]) {
         printf("%d ", hash_index);
         if (hashtable[hash_index] == NULL) {
             return -1;
         } else {
             if (strcmp(hashtable[hash_index]->key, search_entry) == 0) {
                 return 1;
             }
         }
         hash_index++;
         if (hash_index == HASHTABLE_SIZE) {
             hash_index = 0;
         }
     }
     printf("\n");
     return -1;
 }

 void hashtable_print (void) {
     int counter = 0;

     printf("\n====== Printing the Hash Table ============ \n");
     for (counter = 0; counter < HASHTABLE_SIZE; counter++) {
         if (hashtable[counter] == NULL) {
             printf("index %d:\t key: Empty \n", counter);
         } else if (hashtable[counter] == &deleted_hash_node) {
             printf("index %d:\t key: Deleted \n", counter);
         } else {
             printf("index %d:\t key: '%s'\n", counter, hashtable[counter]->key);
         }
     }
 }

 int main (int argc, char *argv[]) {
     char *freeway_names[] = {"Dwight D. Eisenhower Highway", "Nimitz Freeway", 
                              "Bayshore Freeway", "Santa Ana Freeway", 
                              "Junipero Serra Freeway"};
     int len = sizeof(freeway_names)/sizeof(freeway_names[0]);
     int random_counter, found, i;

     /* Need to initialize the hash table */
     hashtable_init(); 

     /* Add all the keys */
     for (i = 0; i < len; i++) {
         hashtable_insert(freeway_names[i], NULL);
     }  

     /* Print the hash table */
     hashtable_print(); 

     /* Let us try to search one of the strings passed above */
     random_counter = random() % len;
     printf("\nLet us search this input string: '%s' \n", freeway_names[random_counter]); 
     found = hashtable_search(freeway_names[random_counter]); 
     printf("Found: %d \n", found);

     random_counter = random() % len;
     printf("\nLet us delete this input string: '%s' \n", freeway_names[random_counter]);
     hashtable_delete(freeway_names[random_counter]);
     hashtable_print();

     /* We are done, let us free the resources */
     hashtable_free(); 
 }

In the above code, the first function is the hashtable_get_index() and this function is one of the key tasks of a hash table. Given an input string, we use the hashing function to get the index (or the slot) in the hash-table. This is used by various operations: insert, search, and delete.

Two important functions are hashtable_init() and hashtable_free(). The hashtable_init() initializes all the elements of the hash table. It is a bad idea to start using the hashtable without calling this init function! The hashtable_function() frees all the nodes present in the hash table. We should call it once we are done using the hash table.

The function hashtable_insert() starts by finding an index in the table and then inserting into the list, it keeps finding new entry until it finds an index that is neither NULL nor has the address of a special pointer; we keep this pointer as deleted_hash_node. We need this dummy pointer to handle deleted values. For an index where we delete the entry, we simply store the address of this structure. Since the address of the structure would be unique, it would ensure that it never becomes same as any other valid stored element. Once we locate the index, we add the node. Also, if the node already exists, then hashtable_insert() simply updates the value of the node.

We should note that with this approach, the location of the index after probing j times would be "index = hash(input_string, HASH_TABLE_SIZE) + j". In other words, the index is a function of both the input string and the number of probing. This approach belongs to a methodology known as "double hashing", where the hash table index is calculated using both the input string and the count of probing. In our case, we merely use j, but advanced techniques use j as an input for yet another hashing function. Thus, the key can be "index = hash1(input_string, HASH_TABLE_SIZE + hash2(j)".

Printing the hash table is fairly simple (hashtable_print()). One simply traverses each index in the hash table and prints it. We should take care not to print an index if it is NULL or has been previously removed (which is indicated by it being a &deleted_hash_node pointer).

For hashtable_search(), we search the hash table means by using the probe sequence till we find the element that we are searching for. If in the sequence the probe table becomes NULL, then we simply return. We follow the same probing sequence as before: we get the first index using a hashing function and then keep incrementing the index by one for each successive probe.

Function hashtable_delete() follows a similar logic. Given the input key, we find the first index in the hash table. Next, we keep continuing along the probe till we locate the element we are looking for. Once we find it, we simply free all the resource and then make the index point to "&deleted_hash_node".

In the end, we provide the main() function that call various hash table operations. Once it is done with these operations, it calls hashtable_free() to release all the malloced resources. Next, we compile the above program and run it. Here is the output.

 user@codingbison $ gcc hashtable_open_addressing.c  -o hashtable_open_addressing
 user@codingbison $ ./hashtable_open_addressing
 [hashtable_insert] Probe sequence: 5 
 [hashtable_insert] Probe sequence: 0 
 [hashtable_insert] Probe sequence: 4 
 [hashtable_insert] Probe sequence: 2 
 [hashtable_insert] Probe sequence: 0 1 

 ====== Printing the Hash Table ============ 
 index 0:     key: 'Nimitz Freeway'
 index 1:     key: 'Junipero Serra Freeway'
 index 2:     key: 'Santa Ana Freeway'
 index 3:     key: Empty 
 index 4:     key: 'Bayshore Freeway'
 index 5:     key: 'Dwight D. Eisenhower Highway'
 index 6:     key: Empty 
 index 7:     key: Empty 
 index 8:     key: Empty 
 index 9:     key: Empty 

 Let us search this input string: 'Santa Ana Freeway' 
 [hashtable_search] Probe sequence: 2 Found: 1 

 Let us delete this input string: 'Nimitz Freeway' 
 [hashtable_delete] Probe sequence: 0 

 ====== Printing the Hash Table ============ 
 index 0:     key: Deleted 
 index 1:     key: 'Junipero Serra Freeway'
 index 2:     key: 'Santa Ana Freeway'
 index 3:     key: Empty 
 index 4:     key: 'Bayshore Freeway'
 index 5:     key: 'Dwight D. Eisenhower Highway'
 index 6:     key: Empty 
 index 7:     key: Empty 
 index 8:     key: Empty 
 index 9:     key: Empty 
 user@codingbison $

The output (provided below) shows that when we have added 4 elements in the hash-table, we see that the probing sequence kicks in. When adding the fifth element, we get 0 as the first index. But 0 as an index already holds a value and so we start the probe and then check the next index which is 1. Since the slot with index 1 is empty, we put the fifth element there.

It is important to note that as the number of elements stored increases, the length of probing increases. The number of probes for one of the instances increases to 5, which is half of the total number of indices in the table. If there were a large number of elements in the table, this can lead to a big overhead both when trying to locate an empty index or when trying to search an element! Thus, it is very important that the load factor for an open-addressing hash table should not be allowed to approach 1.0.

Making Things More Generic!

The above implementation is a good start to understand how probe-based hash tables work. However, for it to work in a production-style setup, we would need to replace some of its simplistic pieces. Let us briefly consider what it would take to do that.

To make the implementation more generic, we would need to do the following changes. First things first, while the earlier one allowed us to store keys as strings, a generic implementation should support storing other key types as well and should also provide a framework such that it can be extended to additional key types. Second, a generic implementation should be packaged as a header file and an implementation file. This modularity would allow us to reuse the implementation for several applications. Third, the earlier implementation has several global variables like hashtable and deleted_hash_node. A generic implementation should encapsulate these in a new data-structure with a scope that is limited only to the calling application and not global.

To keep things simple, we have (intentionally) not incorporated these pieces into the above code. If you would like to see an implementation that is more generic, then you can visit our following page: Advanced Implementation for Open-Addressed Based Hash Tables.