NUS HomeSearch
Home | Up


Last updated on: 10 November 2008 06:01:28 PM

1. Searching as fundamental problem in Computer Science
2. Selection Problem
   a. Expected Linear Time Select (Practical)
   b. Worst Case Linear Time Select (Theoretical interest only)
3. Dictionary Problem
   a. Hash Table
   b. Resolving Collision 1: Separate Chaining
   c. Resolving Collision 2: Open Addressing
   d. Issues in Hashing

References:
-. CLRS chapter 9, Median and Order Statistic
-. CLRS Chapter 12, Hash table

1. Searching as fundamental problem in Computer Science

Searching is very important in computer science. It is very closely related to sorting. We usually search after sorting the data. Remember Binary Search? This is the best example of an algorithm which utilizes both Sorting and Searching.

Search algorithm depends on the data structure used. If we want to search a graph, then we have a set of well known algorithms such as DFS, BFS, etc... click here to read more about searching a graph.

2. Selection Problem

Let's define the meaning of 'Order Statistics'
The ith order statistic in a set of n elements is the ith smallest element, thus:
Minimum number is the 1st order statistic
Maximum number is the nth order statistic
Median is the n/2 order statistic (there are 2 medians if n even)

Searching the minimum or maximum number of an arbitrary array can be done in n-1 comparisons. This is the lower bound, i.e. we cannot do better than this (if you can, then there must be something wrong with your algorithm or you use a different assumption in the input data)

But, in order to find other order statistic is difficult.
You can use a naive method in order O(N^2),
or you can sort the array first O(N*log N), then apply binary search O(log N) ~~ O(N*log N)

2.a. Expected Linear Time Selection

Key idea: use Partition (Randomized version) from Quick Sort

You only need to check 1 sub array after each pass ("Decrease and Conquer")
By Master theorem, this will yield O(N) expected linear time.

q = RandomizedPartition(A,p,r);
After performing this RandomizedPartition, element A[q] will be in it's correct position.
and the order statistic of q is q-p+1 (note: p may not start from 0)
there are only 3 possibility:
A[q] is the desired order statistic, or
the desired order statistic is inside left sub partition, or
the desired order statistic is inside right sub partition

Therefore, we can design RandomizedSelect like this:

RandomizedSelect(A,p,r,i) {
  if (p == r) then return A[p]; // assume if p == r, then i must be 1
  q = RandomizedPartition(A,p,r);
  k = q - p + 1;
  if (i == k) then return A[q];
  if (i < k) then
    return RandomizedSelect(A,p,q-1,i);
  else
    return RandomizedSelect(A,q+1,r,i-k); // why i-k? think =)
}

 
If you are interested with the analysis, you can refer to CLR chapter 10. This algorithm run in expected linear time, worst case O(N^2), best case O(N).

Note:

Do you realize that this algorithm actually swaps elements in the original array? (due to the usage of RandomizedPartition ... and do you realize that these swaps put pivots in the correct final position, thus making the array more sorted...)

However, sometimes this is not a desired side effect... something like: "I want to know element with rank i, but please don't disturb my original array..."

To remedy this, you can do O(n) copy of the original array to another temporary array. Then when doing the RandomizedSelect, you perform the algorithm on the temporary array, thus preserving the original array.

O(n) + Expected linear time is still linear time =), plus 1 more O(n) array to store temp values...

2.b. Worst Case Linear Time Selection

Key idea: Same as above, but ensure that the RandomizedPartition use good pivot.

Select(i,n,A) {
1. Divide n elements into groups of 5 // I'm unsure, why 5??
2. Find median of each group
3. Use Select() recursively to find median x of the Ceil(N/5) medians
4. Partition the n elements around x. Let k = rank(x)
5. if (i==k) then return x
6. if (i < k) then
7.   use Select() recursively to find ith smallest item in first partition
8. else
9.   use Select() recursively to find (i-k)th smallest item in last partition
}

Line 4-9 is exactly the same as RandomizedSelect. Only line 1-3 differs. From this 3 lines, we can get a better approximate for the median (note: NOT the exact median). By partitioning around this "median", we can ensure that we can equally divide the partition into half. A worst-case analysis (refer to CLR chapter 10) shows that this algorithm is O(N) in worst case.

However, due to the complexity of the first 3 lines, this algorithm is usually not practical to be implemented. (Maybe I'm wrong, but I found that it is not easy to implement the first 3 lines...)

3. Dictionary Problem

A lot of problems in real life need a data structure that can Insert, Delete and Search element very fast, preferably in O(1) ~~ expected constant time, such as:
-. Symbol tables in compiler
-. Internet Search Engines
-. Computer Dictionary (such as Babylon, www.babylon.com), etc

3.a. Hash Table

Note: Please refer to the section of hash table data structure. This section view hash table as a tool to solve Dictionary Problem.

Formally:
Given a table T and a record x with key and data, do:
  Insert(T,x)
  Delete(T,x)
  Search(T,x)
all in O(1) expected constant time

First, we have to assume that the keys are distinct (otherwise this can't be a key - by Database Design definition) and the range of keys is integer from [0..m-1]. We also define a as load factor where a = n/m where n = number of keys and m = number of available slots.

The naive way is to map the whole [0..m-1] to array T[0..m-1], this is called Direct Addressing. Unfortunately we can't do this since m can be very big and we don't have memory that big. And if we have all the memories in the world, we don't have time to process that big memory anyway...

So, we define a notion of hashing, where we map keys to smaller range.
Formally, hash function is a function that maps Universal keys U to a set of small range {0,..,m-1}, such that each element with key k will be stored in T[h(k)].

What happen if the hash function maps the key K1 and K2 into into the same value, i.e. h(k1) == h(k2)? This problem is called collision

3.b. Resolving Collision 1: Separate Chaining

Key idea: put elements that hash to the same slot in a linked list
in Separate chaining,
a is the average length of each chain

Analysis:
Unsuccessful search is O(1+
a) on average
Successful search is O(1+
a/2) == O(1+a) on average

Separate Chaining works best if n=O(m), i.e. the number of keys n is proportional to the number of slots m in the table, in this case a = 1.

Separate Chaining can be used even if a > 1, and separate chaining asymptotically better than 2nd approach (Open Addressing), but each of this approach has good and bad side anyway.

3.c. Resolving Collision 2: Open Addressing

Key idea:
-. All elements are stored in the hash table itself,
-. If the slot is already filled during insertion, then try another slot until an open slot is
    found (this is called probing, can be linear, quadratic or double hash)
-. To search, follow the same probe sequence as insert,
    if reach element with the correct key, return it
    if reach empty slot, then this element is not found
-. How To delete?

Analysis:
Unsuccessful search is O(1/(1-
a)) on average
Successful search is O(1/
a * ln (1/(1-a))) on average

Open Addressing can't be used if a > 1. In fact, if this load factor approaches 1, the performance of open addressing degrades so badly...

Again... for the complete analysis, refer to CLR.

3.d. Issues in Hashing

We have to choose the best hash key for a given Dictionary Problem, this is a very crucial Database Design issues, refer to your database book for "key selection criteria".

After choosing the best hash key, we must determine the best hash function too. We want a hash function that can distribute keys uniformly into slots and independent on input data. Refer to CLR or other books regarding good hash function.

What's Next?

After sorting and searching, let's move on to a more advanced problem solving techniques, Greedy paradigm and Dynamic Programming in the next two chapters.


This document, 6_searching.html, has been accessed 8257 times since 26-Jul-04 20:32:28 SGT. This is the 2nd time it has been accessed today.

A total of 5146 different hosts have accessed this document in the last 1594 days; your host, 38.103.63.56, has accessed it 2 times.

If you're interested, complete statistics for this document are also available, including breakdowns by top-level domain, host name, and date.