CodingBison

Programming languages provide certain predefined data types like integer (for storing whole numbers), float (for storing decimal numbers), char/string (for storing alphabets and text) etc. To make these data types more useful, programming languages also provide certain operations that can be performed on them; for instance addition, multiplication, division, etc.

You may ask: why do we need data-structures? As data grows, maintaining the data becomes more and more difficult. All the basic operations like adding new data, deleting existing data, retrieving data in entirety, retrieving a specific data from the whole chunk becomes more and more difficult.

Accordingly, with the kind of data we encounter everyday, the built-in simple data types are often not good enough! So, depending on the data we operate on, we need to build customized data types. This is where data-structures come into the picture. A data-structure is essentially an efficient/organized way of storing and accessing data.

To understand data-structures better, let us use an an example. Let us say, we have the following list of employees for a hypothetical organization.


Table: List of Employees in an Organization (Example)
emp_idnamerolelocation
1RonAdminOttawa
2ChinnEngineerBeijing
3RaghuEngineerBangalore
4MichaelManagerSydney
----------------------
----------------------
9999SteveCEOSan Jose

We can use the above example to understand how data is stored in the memory. Any kind of data in a computer's memory is stored sequentially. If we were to store this data as is, then this is how it would look.

      -------------------------------------------------------------------------------
     | 1 | Ron | Admin | Ottawa | 2 | Ching | Engineer | Beijing | 3 |..... San Jose.|
      -------------------------------------------------------------------------------

                             Figure: Storing data sequentially 

Now, imagine if some one tries to retrieve/update the information of the employee whose ID 2015. How would one do that? From the starting point, one would need to look at every field in the memory sequentially, looking for the field type emp_id 2015. In the flat memory, we do not know 1,2,3 .. 9999 belongs to emp_id and next few fields gives us the info about that employee. So, we would have to walk a long way in the "memory lane" (pun intended) before we get to the employee record for the employee with emp_id as 2015. Not exactly efficient!

Let us now organize the data slightly better and see if it makes a difference. For that, we can use an array data-structure. Array is a simple (usually, built-in) data-structure that satisfies most of our requirements. It is a data type that is supported by most of the programming languages. For our data, we can store a pointer to the employee data in an array. The value (emp_id) stored at index would be the key and with that, retrieving the data for a particular key (emp_id) would happen at constant time.

                         --------------------------------
                         | 1 | 2 | 3 | 4 | -------- | 9999 |
                         --------------------------------
            Array Indices: 0   1   2   3              9998
                           |   |
                           |   |
                           |   |
                           |   |
                           |  -------------------------------
                           |  | 2, Chinn, Engineer, Beijing |
                           |  -------------------------------
                           |
                          -------------------------
                          | 1, RON, Admin, Ottawa |
                          -------------------------

                             Figure: Storing data as an Array 

Of course, if you are thinking why bother with pointers, then you can always define a custom data-structure and define the array to hold these custom data-structures. With that, using the pointers would not be necessary and the solution would start to look better. For more on how you can build an array with data-structures, please visit our data structures in C section.

In essence, by using an appropriate data-structure and organizing the given data in a custom way, we can overcome the drawbacks of storing the data sequentially in a flat memory. We cannot always rely one particular data-structure for all kinds of data. Choosing the right data-structure is very important and it depends on the requirements we have. One of the most important tasks of a software engineer is to come up with efficient custom-made data structures for various applications.

Basic Operations

In terms of basic operations, a data-structure would (usually) need to support the following operations: (1) adding an element, (2) deleting an element, and (3) retrieving/modifying an element. Since a data-structure can accommodate any kind of data, selecting a data-structure can depend on various factors: memory usage, more frequently used operation, cost of an operation, etc.

Before we go further, let us touch upon the basic metric used to evaluate data-structure and their associated algorithms that support the above operations: time-complexity (running time). To calculate time-complexity, we can look a the pseudo-code (or an actual code, if that is available) and analyze the compute operations for the basic operations for that data-structure. Time-complexity is expressed in terms of a notation, called the Big O notation. This notation is based on the input size (let us say, N) of the data and expresses the time we would need in terms of N. Equally important, Big O notation excludes any constant factors that are associated with the number of steps. For example, the Big O notation is same, if it takes 2*N steps or 10*N steps and the value would be O(N).

Here are some of the commonly encountered Big O time-complexity types.


Table: Common Time-Complexity Values in Big O Notation
Time-ComplexityValueDescription
O(1)Constant timeTime taken by the algorithm is constant and does not change with the size of the data set
O(logN) or O(NlogN)Logarithmic timeTime taken increases at logN (or NlogN) rate
O(N)Linear timeTime taken increases linearly as the data size increases
O(N^2)Quadratic timeTime taken increases as square of the data size

Requirement Scenarios

Let us move on and see some requirement scenarios where using an appropriate data-structure will make our life easy.

Requirement: Memory is not a concern. Operations should take constant time.

In the above example, where we know the index of each employee in the array,we can have a constant (lookup) time. But note that this is a special case -- we cannot always know the data index of a record in every use-case. Hence, Ideally we would like to know the index of an array for a given input. A data-structure that does exactly this is called "Hash table". Another important/integral part of "Hash table" data-structure is "Hash function". A hash function is the one which gives the index/key for a given input. Hash tables can be implemented in many ways, one of them is by using an array. We will study hash tables and hash function in detail in the later sections.

Requirement: Memory is the main concern. Operations need not take constant time.

Though arrays provide great flexibility, their disadvantage is that the size and the memory that is consumed by an array should be pre-defined -- this can lead to memory wastage if the data is not as large as expected. This can be addressed using the concept of "dynamic allocation", where memory is allocated as and when the data arrives. With dynamic allocation, we can make sure the memory is not wasted. Linked list is one of the data-structures where the memory is allocated dynamically, but the operations (add, delete, modify) are expensive. More information about linked list is available in the Linked List section of the data-structures module.

Requirement: Memory is a concern. Operations should be efficient.

In Big O notation, the time taken by the operations on a linked list is O(N), where N is the size of the list. In other words as the data size increases, the time also increases linearly. We also have other data-structures where the memory is allocated dynamically and the operations are also done more efficiently than a linked list. Binary trees, AVL etc are some of the examples. Unlike linked list, the operations are done in log(n) time. The binary search tree data-structure is explained in detail in a later section.

We have a wide variety of data-structure available and every data-structure will have to support the basic operations like add, delete, modify. Usually any data-structure comes with certain operations, and commonly referred as an ADT (Abstract data type). Implementation of these operations are hidden, hence these data-structures are called ADT. Here are few examples of ADT. We will see a lot more of these in further sections.


Table: Common Data-Structure Operations
Abstract Data Types (ADT)Operations
ListAdd, delete, get
StackPush, pop etc
Queueinsert, delete etc
Treeinsert, delete, traverse

Data-Structures in C

C provides the ability to build data-structures by combining the built-in data types (like int, char, etc) to create an encapsulation that is customized for the application requirement. In fact, C data-structures are so versatile that they can include other data-structures as well! Due to this ability of binding together storage of various types, data-structures play an important role in C programming.

We can use the "struct" keyword to define a data-structure in C. Following this keyword, we need to provide the name of the structure. Next, we should list data of various types, each separated by a semi-colon; each of these individual data types are referred to as data-structure members. In the end, we enclose them all within braces. Do not forget the semi-colon at the end of the structure definition; if you do forget, then the compiler would remind you!

 struct employee {
     int    employee_id;
     char   employee_name[100];
     short  employee_role;
     char   employee_location[200];
 };

The above structure contains members (or fields) that store various information about employees. It uses an integer for storing employee ID, character arrays to store both name and their location, and stores a short to store the role type. The role type could also be an C enum type. Thus, with this, we can have one data type (the new employee type) to represent one employee. This allows a much tighter encapsulation and has the advantage of almost being an inbuilt type.

In this module, we introduce various data-structures and accompany them with their implementations in C programming language. If you are not familiar with C or you would like to refresh your C skills, please feel free to visit our C module.

For each of our data-structure, we provide two implementations. The first implementation is a simple one, which is what we would recommend if you are trying to learn the basics for the first-time. Typically, the first implementation has a simpler storage -- an integer or a string. In addition, we also provide an advanced implementation for each. You would find our discussion of an advanced implementation towards the end of each page (and a link that would take you there). The advanced implementation is provided in terms of a header file, an implementation, and an application file that mimics the user of the data-structure infra. Further more, the advanced implementation would typically have a "void *" storage so that it allows applications to store any types of data.





comments powered by Disqus