|
Last updated on:
10 November 2008 06:01:28 PM
1.
Searching
as fundamental problem in Computer Science
2. Selection Problem
a. Expected Linear Time Select
(Practical)
b. Worst Case Linear Time
Select (Theoretical interest only)
3. Dictionary Problem
a. Hash Table
b. Resolving Collision
1: Separate Chaining
c. Resolving Collision
2: Open Addressing
d. Issues in Hashing
References:
-. CLRS chapter 9, Median and Order
Statistic
-. CLRS Chapter 12, Hash table
1.
Searching
as fundamental problem in Computer Science
Searching
is very important in computer science. It is very closely related to sorting.
We usually search after sorting the data. Remember
Binary Search? This is the best example
of an algorithm which utilizes both Sorting and Searching.
Search algorithm
depends on the data structure used. If we want to search a graph, then we
have a set of well known algorithms such as DFS, BFS, etc... click here
to read more about searching a graph.
2. Selection Problem
Let's define the meaning of 'Order Statistics'
The ith order statistic in a set of n elements is the ith smallest element,
thus:
Minimum number is the 1st order statistic
Maximum number is the nth order statistic
Median is the n/2 order statistic (there are 2 medians if n even)
Searching the minimum or maximum number of
an arbitrary array can be done in n-1 comparisons. This is the lower bound,
i.e. we cannot do better than this (if you can, then there must be something
wrong with your algorithm or you use a different assumption in the input
data)
But, in order to find other order statistic
is difficult.
You can use a naive method in order O(N^2),
or you can sort the array first O(N*log N), then apply binary search O(log
N) ~~ O(N*log N)
2.a.
Expected Linear Time Selection
Key idea: use Partition (Randomized version)
from Quick Sort
You only need to check 1 sub array after each
pass ("Decrease and Conquer")
By Master theorem, this will yield O(N) expected linear time.

q = RandomizedPartition(A,p,r);
After performing this RandomizedPartition, element A[q] will be in it's
correct position.
and the order statistic of q is q-p+1 (note: p may not start from 0)
there are only 3 possibility:
A[q] is the desired order statistic, or
the desired order statistic is inside left sub partition, or
the desired order statistic is inside right sub partition
Therefore, we can design RandomizedSelect
like this:
RandomizedSelect(A,p,r,i) {
if (p == r) then return A[p]; // assume if p == r, then i must be 1
q = RandomizedPartition(A,p,r);
k = q - p + 1;
if (i == k) then return A[q];
if (i < k) then
return RandomizedSelect(A,p,q-1,i);
else
return RandomizedSelect(A,q+1,r,i-k); // why i-k? think =)
}
If you are interested with the analysis, you can refer to CLR chapter 10.
This algorithm run in expected linear time, worst case O(N^2), best case
O(N).
Note:
Do you realize that this algorithm actually
swaps elements in the original array? (due to the usage of RandomizedPartition
... and do you realize that these swaps put pivots in the correct final
position, thus making the array more sorted...)
However, sometimes this is not a desired side
effect... something like: "I want to know element with rank i, but please
don't disturb my original array..."
To remedy this, you can do O(n) copy of the
original array to another temporary array. Then when doing the RandomizedSelect,
you perform the algorithm on the temporary array, thus preserving the original
array.
O(n) + Expected linear time is still linear
time =), plus 1 more O(n) array to store temp values...
2.b.
Worst Case Linear Time Selection
Key idea: Same as above, but ensure that the
RandomizedPartition use good pivot.
Select(i,n,A) {
1. Divide n elements into groups of 5 // I'm unsure, why 5??
2. Find median of each group
3. Use Select() recursively to find median x of the Ceil(N/5) medians
4. Partition the n elements around x. Let k = rank(x)
5. if (i==k) then return x
6. if (i < k) then
7. use Select() recursively to find ith smallest item in first
partition
8. else
9. use Select() recursively to find (i-k)th smallest item in
last partition
}
Line 4-9 is exactly the same as RandomizedSelect.
Only line 1-3 differs. From this 3 lines, we can get a better approximate
for the median (note: NOT the exact median). By partitioning around this
"median", we can ensure that we can equally divide the partition into half.
A worst-case analysis (refer to CLR chapter 10) shows that this algorithm
is O(N) in worst case.
However, due to the complexity of the first
3 lines, this algorithm is usually not practical to be implemented. (Maybe
I'm wrong, but I found that it is not easy to implement the first 3 lines...)
3. Dictionary
Problem
A lot of problems in real life need a data
structure that can Insert, Delete and Search element very fast, preferably
in O(1) ~~ expected constant time, such as:
-. Symbol tables in compiler
-. Internet Search Engines
-. Computer Dictionary (such as Babylon,
www.babylon.com), etc
3.a. Hash Table
Note: Please refer to the section of
hash table data structure. This
section view hash table as a tool to solve Dictionary Problem.
Formally:
Given a table T and a record x with key and data, do:
Insert(T,x)
Delete(T,x)
Search(T,x)
all in O(1) expected constant time
First, we have to assume that the keys are
distinct (otherwise this can't be a key - by Database Design definition)
and the range of keys is integer from [0..m-1]. We also define
a as load
factor where a
= n/m where n = number of keys and m = number of available slots.
The naive way is to map the whole [0..m-1]
to array T[0..m-1], this is called Direct Addressing. Unfortunately
we can't do this since m can be very big and we don't have memory that big.
And if we have all the memories in the world, we don't have time to process
that big memory anyway...
So, we define a notion of hashing, where we
map keys to smaller range.
Formally, hash function is a function that maps Universal keys U to a set
of small range {0,..,m-1}, such that each element with key k will be stored
in T[h(k)].
What happen if the hash function maps the
key K1 and K2 into into the same value, i.e. h(k1) == h(k2)? This problem
is called collision
3.b.
Resolving Collision 1:
Separate Chaining
Key idea: put elements that hash to the same
slot in a linked list
in Separate chaining, a
is the average length of each chain
Analysis:
Unsuccessful search is O(1+a)
on average
Successful search is O(1+a/2)
== O(1+a)
on average
Separate Chaining works best if n=O(m), i.e.
the number of keys n is proportional to the number of slots m in the table,
in this case a
= 1.
Separate Chaining can be used even if
a
> 1, and separate chaining asymptotically better than 2nd approach (Open
Addressing), but each of this approach has good and bad side anyway.
3.c.
Resolving Collision 2:
Open Addressing
Key idea:
-. All elements are stored in the hash table itself,
-. If the slot is already filled during insertion, then try another slot
until an open slot is
found (this is called probing, can be linear, quadratic or double
hash)
-. To search, follow the same probe sequence as insert,
if reach element with the correct key, return it
if reach empty slot, then this element is not found
-. How To delete?
Analysis:
Unsuccessful search is O(1/(1-a))
on average
Successful search is O(1/a
* ln (1/(1-a)))
on average
Open Addressing can't be used if
a > 1.
In fact, if this load factor approaches 1, the performance of open addressing
degrades so badly...
Again... for the complete analysis, refer
to CLR.
3.d. Issues in
Hashing
We have to choose the best hash key for a
given Dictionary Problem, this is a very crucial Database Design issues,
refer to your database book for "key selection criteria".
After choosing the best hash key, we must
determine the best hash function too. We want a hash function that can distribute
keys uniformly into slots and independent on input data. Refer to CLR or
other books regarding good hash function.
What's Next?
After sorting and searching, let's move on
to a more advanced problem solving techniques, Greedy paradigm and Dynamic
Programming in the next two chapters.
This document, 6_searching.html, has been accessed 8257 times since 26-Jul-04 20:32:28 SGT.
This is the 2nd time it has been accessed today.
A total of 5146 different hosts have accessed this document in the
last 1594 days; your host, 38.103.63.56, has accessed it 2 times.
If you're interested, complete statistics for
this document are also available, including breakdowns by top-level
domain, host name, and date.
|