NUS HomeData Struct+
Home | Up


Last updated on: 10 November 2008 06:01:34 PM

1. Heap - the best Priority Queue
2. Binary Search Tree
3. Dictionary/Hash Table
4. Union-Find Disjoint Set / Set STL
5. Augmenting Data Structures

6. More sophisticated data structure

I define "advanced data structure" as follows: even if you do not use any of these, you can still solve your problem, however, in some cases, using this advanced data structure will make your solution more efficient.

1. Heap - the best Priority Queue

A heap is a complete binary tree that fulfill the Heap property.

Heap property: Every node's value is less than both of its children's values (Min Heap tree). Mathematically represented as: For all i, i != root, A[Parent(i)] <= A[i]. For Max Heap, the heap property is just the reverse from the Min Heap: A[Parent(i)] >= A[i].

Maintaining the heap property is called Heapify.

For this section, assume that we use Min Heap.


Figure 1. A Min Heap

Heap Representation

If the tree is filled in from left to right level by level (recall the definition of Complete Tree), then the heap can be stored in an array, which is just the top level from left to right, the second level, and so on.

The heap given in Figure 1 would be: 3 5 9 6 12 13 10 8 11

In this representation, the node at position i has a left children at 2i and right children at 2i+1 (assuming 1-based indexing), and the parent of i is truncate (i/2).

We can calculate this index very quickly using binary representation.

To calculate 2i, we can simply shift the binary representation of i one bit left
To calculate 2i+1, we can simply shift the binary representation of i one bit left and add 1 at the rightmost bit
To calculate truncate(i/2), we can simply shift the binary representation of i one bit right

How to add and remove things to a Heap?

To add an element, put it at the end of the array (leaf) and try to restore the Heap property (see above). Now, as long as it's below its current parent, swap it with its parent. For example, to add the number 4, the heap array would go through the following states:

index | 1  2  3  4  5  6  7  8  9 10, key=4
step1
| 3  5  9  6 12 13 10  8 11  4 => i=10, Parent(i)=5 => key(4)<12
step2 | 3  5  9  6
12 13 10  8 11 12 => A[10]=A[5]
step3 | 3 
5  9  6 12 13 10  8 11 12 => i=5 , Parent(i)=2 => key(4)<5
step4 | 3 
5  9  6  5 13 10  8 11 12 => A[5] =A[2]
step5 |
3  5  9  6  5 13 10  8 11 12 => i=2 , Parent(i)=1 => key( 4)>3
step6 | 3 
4
  9  6  5 13 10  8 11 12 => A[2] =key

Pseudo code for Heap-Insert

Heap-Insert(A,key)
  heap-size(A)++
  i <- heap-size[A]
  while (i>1 and A[Parent(i)] > key)
    A[i] <- A[Parent(i)]
    i <- Parent(i)
  A[i] <- key

Deleting an element is also relatively easy. Take the last element in the array and replace the element you wish to delete with it and then restore the Heap property (see above). While one of its children is less than it, pick the children with smaller index and swap with it. For example, to delete 11:

index |  1  2  3  4  5  6  7  8, to i=1 (11)
step1
| 11  5  9  6 12 13 10  8 => exchange A[i] with A[heap-size(a)]
step2
8  5  9  6 12 13 10 11 => compare i=1 with children (8>5 || 8<9)
step3
5  8  9  6 12 13 10  8 => exchange A[1] with A[2]
step4 |  5  
8  9  6 12 13 10  8 => compare i=2 with children (8>6 || 8<12)
step5 |  5 
6  9  8 12 13 10 11 => exchange A[2] with A[4]
step6 |  5  6  9 
8 12 13 10 11
=> compare i=4 with its children (8<11), stop

Pseudo code for Heap-Delete

Heap-Delete(A,i)
  exchange (A[i],A[heap-size(A)])
  while (i<heap-size(A) and (A[i]>A[leftChild(i)] or A[i]>A[rightChild(i)]))
    if (A[i]>A[leftChild(i)])
      exchange(A[i],A[leftChild(i)])
    else
      exchange(A[i],A[rightChild(i)])

How to alter a value?

To alter a value upwards, change the value, and swap with its parents as long as necessary. To alter a value downwards, change the value, and swap with the smaller of its two children as long as necessary.

Why Heap?

Heap is a very efficient Priority Queue.
Heap can answer these problems very fast (Assume for Min Heap):

  1. What is the minimum key?
    the root. O(1)

  2. Delete the minimum key!
    remove root, exchange a leaf with root, restore Heap property. O(log N)

  3. I want to insert a value?
    insert at the leaf, restore Heap property. O(log N)

  4. Is the heap empty?
    root empty? O(1)

Heap Sort

Because Heap is quite efficient, we can use Heap to do a quick sorting in O(N*log N) time. Simply put all elements to a heap, then repeatedly restore the heap property. The final configuration will be a sorted heap.

Heap Sort can sort in place, this beats Merge Sort algorithm which needs additional array. The O(N*log N) worst case running time also beats standard sorting algorithms such as Insertion, Selection or Bubble sort.
However, even though Heap Sort worst case running time is O(N*log N), Quick Sort is generally outperforms Heap Sort for average case, therefore I always prefer Quick Sort than any other sorting algorithms.

You should only use Heap + Heap Sort if and only if you want to create an efficient Priority Queues. Do not force yourself using Heap Sort only to achieve O(N*Log N) worst case running time since Heap implementation is much more complicated than other simple data structure.

Heap Sort Implementation

First we build a Heap using Build-Heap algorithm that runs in O(N) time, then restore Heap Property (Heapify - O(log N)) for each nodes of Heap A[1..(N-1)].

Note: Some call Heapify as Percolate Up/Down, and Build-Heap as fixHeap, anyway this is the implementation of Heap Sort according to Introduction to Algorithms book and I prefer this way :-)

Heapify

Heapify(A,i)
  i <- LeftChild(i)
  r <- RightChild(i)

  if l <= heap-size(A) and A[l] > A[i] then
    largest <- l
  else
    largest <- i

  if r <= heap-size(A) and A[r] > A[largest] then
    largest <- r

  if largest != i then
    exchange(A[i],A[largest])
    Heapify(A,largest)

Heapify sample run

Initial Heap, call Heapify(A,3)

             27
     17              *3
  16     13      10       1
 5  7  12   4   8   9   0

Call Heapify(A,6)

             27
     17              10
  16     13      *3       1
 5  7  12   4   8   9   0

Call Heapify(A,13), base case, Heapify stop, creating this Final Heap

             27
     17              10
  16     13       9       1
 5  7  12   4   8   3   0

Build-Heap

Build-Heap(A)
  heap-size[A] <- length(A)
  for i <- truncate(length(A)/2) downto 1
    Heapify(A,i)

Build-Heap sample run

* is the current index
Initial Semi-Heap, i=4

              5
      3              17
 *10     84      19       6 
22   9

Semi-Heap 2, i=3

              5
      3             *17
  22     84      19       6 
10   9

Semi-Heap 3, i=2

              5
     *3              19
  22     84      17       6 
10   9

Semi-Heap 4, i=1

             *5
     84              19
  22      3      17       6 
10   9

Final Heap

             84
     22              19
  10      3      17       6 
 5   9

HeapSort

This interesting property of Heap can be used to build another O(n log n) sorting algorithm, Heap Sort. Details of Heap Sort can be found in sorting section.


TIPS: UTILIZING C++ PRIORITY QUEUE STL

Seems it will be hard for us to implement the heap data structure correctly during contest time... fortunately, C++ has priority_queue STL, which by default (if I'm not mistaken), use heaps as the underlying data structure. So, why bother creating your own heap data structure if what you need is a generic priority_queue.

Here is a quick sample of using C++ priority_queue STL, refer to Internet for a more complete reference.


#include <stdio.h>

// this is where priority_queue implementation resides
#include <queue>

// use this to avoid specifying "std::" everywhere
using namespace std;

void main() {
  // just do this, write priority_queue<the type you want,
  // in this case, integer> and the PQ name
  priority_queue<int> pq;

  // try inserting 7 different integers, not ordered
  pq.push(3); pq.push(1); pq.push(2);
  pq.push(7); pq.push(6); pq.push(5);
  pq.push(4);

  // it will come out sorted... see that we must do
  // top() and pop(), we can't combine them
  while (!pq.empty()) {
    printf("%d ",pq.top());
    pq.pop();
  }
  printf("\n");
}

References:
-. CLRS chapter 6, Heapsort

2. Binary Search Tree

Binary Search Tree (BST) enable you to search a collection of objects (each with a real or integer value) quickly to determine if a given value exists in the collection.

Basically, a binary search tree is a node-weighted, rooted binary ordered tree. That collection of adjectives means that each node in the tree might have no child, one left child, one right child, or both left and right child. In addition, each node has an object associated with it, and the weight of the node is the value of the object.

The binary search tree also has the property that each node's left child and descendants of its left child have a value less than that of the node, and each node's right child and its descendants have a value greater or equal to it.


Binary Search Tree

The nodes are generally represented as a structure with four fields, a pointer to the node's left child, a pointer to the node's right child, the weight of the object stored at this node, and a pointer to the object itself. Sometimes, for easier access, people add pointer to the parent too.

Why are Binary Search Tree useful?

Given a collection of n objects, a binary search tree takes only O(height) time to find an objects, assuming that the tree is not really poor (unbalanced), O(height) is O(log n). In addition, unlike just keeping a sorted array, inserting and deleting objects only takes O(log n) time as well. You also can get all the keys in a Binary Search Tree in a sorted order by traversing it using O(n) inorder traversal.

Variations on Binary Trees

There are several variants that ensure that the trees are never poor. Splay trees, Red-black trees, B-trees, and AVL trees are some of the more common examples. They are all much more complicated to code, and random trees are generally good, so it's generally not worth it.

Tips: If you're concerned that the tree you created might be bad (it's being created by inserting elements from an input file, for example), then randomly order the elements before insertion.


TIPS: List of some useful C++ STL

Some STL that may be useful for programming contest: vector, deque, list, slist, bit_vector, set, map, multiset, multimap, stack, queue, priority_queue, bitset.

You may want to refer to this website for complete detail.

References:
-. CLRS chapter 12, Binary Search Tree
-. CLRS chapter 13, Red Black Tree

3. Dictionary / Hash Table

A dictionary, or hash table, stores data with a very quick way to do lookups. Let's say there is a collection of objects and a data structure must quickly answer the question: 'Is this object in the data structure?' (e.g., is this word in the dictionary?). A hash table does this in less time than it takes to do binary search.

The idea is this: find a function that maps the elements of the collection to an integer between 1 and x (where x, in this explanation, is larger than the number of elements in your collection). Keep an array indexed from 1 to x, and store each element at the position that the function evaluates the element as. Then, to determine if something is in your collection, just plug it into the function and see whether or not that position is empty. If it is not check the element there to see if it is the same as the something you're holding,

For example, presume the function is defined over 3-character words, and is (first letter + (second letter * 3) + (third letter * 7)) mod 11 (A=1, B=2, etc.), and the words are 'CAT', 'CAR', and 'COB'. When using ASCII, this function takes 'CAT' and maps it to 3, maps 'CAR' to 0, and maps 'COB' to 7, so the hash table would look like this:

0: CAR
1
2
3: CAT
4
5
6
7: COB
8
9
10

Now, to see if 'BAT' is in there, plug it into the hash function to get 2. This position in the hash table is empty, so it is not in the collection. 'ACT', on the other hand, returns the value 7, so the program must check to see if that entry, 'COB', is the same as 'ACT' (no, so 'ACT' is not in the dictionary either). In the other hand, if the search input is 'CAR', 'CAT', 'COB', the dictionary will return true.

Why are Hash Tables useful?

Hash tables enable, with a little bit of memory cost, programs to perform lookups in almost constant work. Generally, the program must evaluate the function and then possibly compare the looked up element to an entry in the table.

Collision Handling

This glossed over a slight problem that arises. What can be done if two entries map to the same value (e.g., I wanted to add 'ACT' and 'COB')? This is called a collision. There are couple ways to correct collisions, but this document will focus on one method, called chaining.

Instead of having one entry at each position, maintain a linked list of entries with the same hash value. Thus, whenever an element is added, find its position and add it to the beginning (or tail) of that list. Thus, to have both 'ACT' and 'COB' in the table, it would look something like this:

0: CAR
1
2
3: CAT
4
5
6
7: COB -> ACT
8
9
10

Now, to check an entry, all elements in the linked list must be examined to find out the element is not in the collection. This, of course, decreases the efficiency of using the hash table, but it's often quite handy.

Hashers Training

There are two basic things you need to consider to avoid collisions. The first is easy: have a large hash table. Assuming a reasonable hash function, this reduces collisions just for probabilistic reasons.

The more subtle, and often forgotten, thing to do to avoid collisions is pick a good hash function. For example, taking the three letter prefix as the hash value for a dictionary would be very bad. Under this hash function, the prefix 'CON' would have a huge number of entries. Pick a function where two elements are unlikely to map to the same value:

  1. Create a relatively huge value and mod it with the size of your table (this works especially well if your hash table is a prime size).

  2. Primes are your friends. Multiply by them.

  3. Try to have small changes map to completely different locations.

  4. You don't want to have two small changes cancel each other out in your mapping function (a transposition, for example).

  5. This is a whole field of study, and you could create a 'Perfect Hash Function' that would give you no collisions, but, for your purposes, that's entirely too much work. Pick something that seems fairly random, and see if it works; it probably will.

Hash Table variations

It is often quite useful to store more information that just the value. One example is when searching a small subset of a large subset, and using the hash table to store locations visited, you may want the value for searching a location in the hash table with it.

Even a small hash table can improve runtime by drastically reducing your search space. For example, keeping a dictionary hashed by the first letter means that if you wanted to search for a word, you would only be looking at words that have the same first letter.

Tips: Use C++ STL map for dictionary :D

References:
-. CLRS chapter 11, Hashing

4. Union-Find Disjoint Set

Sometimes we want to represent a set. That is... something like this: item a,b,c belong to set 1, item d,e belong to set 2... You are asked the following questions:

is a and b belong to the same set? Yes
is a and d belong to the same set? No
I want to union set 1 and set 2, can you do it quickly?...

Whenever you need to deal with sets... remember the term: Union-Find. This is a very sophisticated data structure for dealing with set. I won't elaborate much in this page, but I suggest you to read CLR chapter 22 for a very complete detail. Here, I will just give you a working C/C++ implementation for Union-Find data structure. This version use integer as set identifier.

Use make_set(x) to create the initial set with id x, use find_set(x) to find which set item x located, and use union_set(x,y) to merge two sets containing item x and y.

#define MAX 1000 // adjust this properly (the max size of your set)

/* UNION-FIND library */
int p[MAX],rank[MAX];

// make a new set ID x
void make_set(int x) {
  p[x] = x;
  rank[x] = 0;
}

// don't call link directly, rather, use union_set
void link(int x,int y) {
  if (rank[x] > rank[y])
    p[y] = x;
  else {
    p[x] = y;
    if (rank[x] == rank[y])
      rank[y] = rank[y] + 1;
  }
}

// find the set ID of item x
int find_set(int x) {
  if (x != p[x])
    p[x] = find_set(p[x]);
  return p[x];
}

// union two set containing item x and item y
// see, this one calls find_set first!
void union_set(int x,int y) {
  link(find_set(x),find_set(y));
}


TEST YOUR UNION-FIND KNOWLEDGE

Solve UVa problems related with Union-Find:

459 - Graph Connectivity
793 - Network Connections
10227 - Forests
10507 - Waking up brain
10583 - Ubiquitous Religion
10608 - Friends

or any other problems that indirectly need the usage of set data structure. :)

Tips: Use C++ STL multimap for set :D

References:
-. CLRS chapter 21, Data Structures for Disjoint Set

5. Augmenting Data Structures

Sometimes, the operations need to be performed more effectively. Some of them can be supported by tweaking the textbook data structure to support the required operations. I don't have much knowledge yet, please see the reference for more details.

References:
-. CLRS chapter 14, Augmenting Data Structures

6. More sophisticated data structure

They are, but not limited to: Red Black Tree, Splay Tree, B-Tree, Binomial Heap, Fibonacci Heap, etc
These will be written later when I have time :$

References:
-. CLRS chapter 13, Red Black Tree
-. CLRS chapter 18, B-Trees
-. CLRS chapter 19, Binomial Heaps
-. CLRS chapter 20, Fibonacci Heaps

What's Next?

With these advanced data structures, we can solve several difficult problems efficiently. :).
You almost have all the required knowledge to be a competitive programmers.

Next topic is on several miscellaneous stuffs that you may be interested. Stay tuned :D


This document, 11_advanced_data_structures.html, has been accessed 14268 times since 26-Jul-04 20:31:59 SGT. This is the 7th time it has been accessed today.

A total of 9418 different hosts have accessed this document in the last 1591 days; your host, 38.103.63.56, has accessed it 1 times.

If you're interested, complete statistics for this document are also available, including breakdowns by top-level domain, host name, and date.