NUS HomeSort
Home | Up


Last updated on: 10 November 2008 06:01:29 PM

1. Introduction to Sorting
2. Comparison-based sorting algorithms
   a. Bubble Sort
   b. Insertion Sort
   c. Selection/Minimal/Maximal Sort
   d. Merge Sort
   e. Quick Sort and tips & trick using C/C++ qsort library
   f. Heap Sort
3. Linear-time Sorting
   a. Lower bound of comparison-based sort is
W(n log n)
   b. Counting Sort
   c. Radix Sort
   d. Bucket Sort

References:
-. CLRS chapter 1 [Insertion Sort,Merge Sort],
-. CLRS Chapter 6 [Heapsort],
-. CLRS Chapter 7 [Quicksort],
-. CLRS Chapter 8 [Sorting in Linear Time]

1. Introduction to Sorting

Definition of a sorting problem:

Input:
A sequence of N numbers (a1,a2,...,aN)
Output:
A permutation (a1',a2',...,aN') of the input sequence such that a1' <= a2' <=... <= aN'

Sorting is very important in computer science, and of course, in the context of this website, very important for programming contests. Why? because sorted items are much more easier to be searched. That's why these two terms, "sorting" & "searching" usually come together. After sort, then search..., or formally said, sorting is commonly used as the initialization step for another algorithm.

Things to be considered in Sorting

These are the difficulties in sorting that can also happen in real life:

Size of the list to be ordered is the main concern. Sometimes, the computer memory is not sufficient to store all data. You may only be able to hold part of the data inside the computer at any time, the rest will probably have to stay on disc or tape. This is known as the problem of external sorting. However, rest assured, that almost all programming contests problem size will never be extremely big such that you need to access disc or tape to perform external sorting... (such hardware access is usually forbidden during contests).

Another problem is the stability of the sorting method. Example: suppose you are an airline. You have a list of the passengers for the day's flights. Associated to each passenger is the number of his/her flight. You will probably want to sort the list into alphabetical order. No problem... Then, you want to re-sort the list by flight number so as to get lists of passengers for each flight. Again, "no problem"... - except that it would be very nice if, for each flight list, the names were still in alphabetical order. This is the problem of stable sorting.

To be a bit more mathematical about it, suppose we have a list of items {xi} with xa equal to xb as far as the sorting comparison is concerned and with xa before xb in the list. The sorting method is stable if xa is sure to come before xb in the sorted list.

Finally, we have the problem of key sorting. The individual items to be sorted might be very large objects (e.g. complicated record cards). All sorting methods naturally involve a lot of moving around of the things being sorted. If the things are very large this might take up a lot of computing time -- much more than that taken just to switch two integers in an array. In this kind of case the best approach is the leave the actual data alone and instead work with a list of 'markers' for the data. You shuffle the markers, not the data itself. If you really want to, you can reorder the actual data after you have determined its ordering.

2. Comparison-based sorting algorithms

Comparison-based sorting algorithms involves comparison between two object a and b to determine one of the three possible relationship between them: less than, equal, or greater than. These sorting algorithms are dealing with how to use this comparison effectively, so that we minimize the amount of such comparison. Lets start from the most naive version to the most sophisticated comparison-based sorting algorithms.

Note: I assume that we will use '0' as the first index of an array, not '1' !!!

a. Bubble Sort

Speed: O(n^2), extremely slow
Space: The size of initial array
Coding Complexity: Simple

This is the simplest and (unfortunately) the worst sorting algorithm. This sort will do double pass on the array and swap 2 values when necessary.

Bubble sort - basic idea

Repeatedly sweep through array, exchanging adjacent pairs of elements that are out of order.

Bubble sort pseudo code (Note: there are some variant of the pseudo code given below, however, the implementation detail is up to you).

BubbleSort(A)
  for i <- length[A]-1 down to 1
    for j <- 0 to i-1
      if (A[j] > A[j+1]) // change ">" to "<" to do a descending sort
        temp <- A[j]
        A[j] <- A[j+1]
        A[j+1] <- temp

Slow motion run of Bubble Sort (Bold == sorted region):

5 2 3 1 4
2 3 1 4 5
2 1 3 4 5
1 2 3 4 5
1 2 3 4 5
1 2 3 4 5
>> done


TEST YOUR BUBBLE SORT KNOWLEDGE

Solve UVa problems related with Bubble sort:

299 - Train Swapping
612 - DNA Sorting

10327 - Flip Sort

b. Insertion Sort

Speed: O(n^2), extremely slow, but slightly faster than bubble sort in partially sorted array
Space: The size of initial array
Coding Complexity: Simple

Insertion sort - basic idea

Maintain "sorted region" at beginning of array.
Repeatedly "grow" sorted region by inserting first element from unsorted region into array.

This algorithm will run better in a partially sorted array, because it doesn't do swapping if the elements are already partially sorted.

Insertion sort pseudo code:

InsertionSort(A)
  for j <- 1 to length[A]-1
    key <- A[j] // insert A[j] into the sorted sequence A[1..j-1]
    i <- j - 1

    while (i>=0 and A[i]>key)
      A[i+1] <- A[i]
      i <- i - 1

    A[i+1] <- key

Insertion sort in JAVA:

 static void InsertionSort() {
   for (int j=1;j<=totalElements-1;j++) {
    int key = A[j]; // insert A[j] into the sorted sequence A[1..j-1]
    int i = j - 1;

    while (i>=0 && A[i]>key) {
      A[i+1] = A[i];
      i = i - 1;
    }

    A[i+1] = key;
  }

Slow motion run of Insertion Sort (Bold == position of index j):

5 2 4 6 1 3
2 5 4 6 1 3
2 4 5 6 1 3
2 4 5 6 1 3
1 2 4 5 6 3
1 2 3 4 5 6 >> done

c. Selection/Minimal/Maximal Sort

Speed: O(n^2), extremely slow
Space: The size of initial array
Coding Complexity: Simple

Selection sort - basic idea

Move the smallest element to a[0],
Move the second smallest to a[1],

Move the (n-1)st smallest to a[n-2].

Another approach in sorting, yet it is still a slow O(n^2) algorithm.

Selection sort pseudo code:

SelectionSort(A)
  for i <- length[A]-1 downto 0
    for j <- 0 to i
      if (A[j]>MAX) // for descending, change this to MIN
        MAX=A[j]
        MAX_ID=j

    temp=A[i]
    A[i]=A[MAX_ID]
    A[MAX_ID]=temp

Slow motion run of Selection Sort (Bold == sorted region):

5 1 3 2 4
4 1 3 2 5
1 3 2 4 5
1 2 3 4 5
1 2 3 4 5
1 2 3 4 5
>> done

d. Merge Sort

Speed: O(n Log n), similar speed with Quick Sort, pure n log n algorithm
Space: 2*the size of initial array
Coding Complexity: Complex, using Divide & Conquer approach

Merge Sort - basic idea

Sort first half of the array.
Sort second half of the array.
Merge the result.

The bad side of this sorting algorithm is this algorithm need 2*the size of initial array to perform it's task. This can be very inefficient from memory-Point of View.

Slow motion run of Merge Sort (Bold == merge step - where the actual sorting happen):

Original Array: 5 1 3 2 4

LHS: 5 1 3 -> LHS: 5 1 -> LHS: 5 -> MERGE: 1 5
                       \> RHS: 1 /          \
           \> RHS: 3 ---------------------> MERGE: 1 3 5
RHS: 2 4 -> LHS: 2                                  \
         \> RHS: 4 -> MERGE: 2 4 --------------------------> MERGE: 1 2 3 4 5


Note: Merge Sort can be slightly modified to count inversion index (i.e. bubble sort swaps) in O(n log n) time. When the right partition goes in first in Merge step, that means all the remaining items in left partition < first element in right partition (i.e. you have to swap with all of them).

e. Quick Sort

Speed: O(n log n), one of the best sorting algorithm.
Space: The size of initial array
Coding Complexity: Complex, using Divide & Conquer approach

One of the best sorting algorithm known. Quick sort use Divide & Conquer approach and partition each subset. Partitioning a set is to divide a set into a collection of mutually disjoint sets. This sort is much quicker compared to "stupid but simple" bubble sort. Quick sort was invented by C.A.R Hoare. A good text regarding Quick Sort can be found in [CLR], chapter 8.

Quick Sort - basic idea

Partition the array in O(n)
Recursively sort left array in O(log2 n) best/average case

Recursively sort right array in O(log2 n) best/average case

Quick Sort facts:

Worst case time: O(n^2).
Average case (if all orderings equally likely): O(n log n).
Average case (if random pivot, worst-case input): O(n log n).
Fast in practice due to in-site sorting unlike merge sort.
There are many pivot selection algorithm for Quick Sort (different execution time).

Quick sort pseudo code:

QuickSort(A,p,r)
  if p < r
    q <- Partition(A,p,r)
    QuickSort(A,p,q)
    QuickSort(A,q+1,r)

Quick sort in JAVA, using Old Partition Algorithm:

  static int Partition(int l,int r) {
    int x = A[l]; // pivot
    int i = l-1; // set outside boundary first
    int j = r+1; // set outside boundary first
    
    while (true) {
      while (true) {
        j--;
        if (A[j] <= x) break;
      }
        
      while (true) {
        i++;
        if (A[i] >= x) break;
      }
      if (i<j) {
        int temp=A[i];
        A[i]=A[j];
        A[j]=temp;
      }
      else
        return j;
    }
  }

  static void QuickSort(int l,int r) { 
    if (l < r) {
       int q = Partition(l,r);
       QuickSort(l,q);
       QuickSort(q+1,r);
    }
  }
      
 Slow motion run of Quick Sort, using standard pivot method (select the first item as pivot):

4 7 1 2 3 10 8 - Pivot=4
3 7 1 2 4 10 8
3 2 1 7 4 10 8 -> first pass of Partition, split at 1

3 2 1 - Pivot=3                7 4 10 8 - Pivot=7
1 2 3 - split at 2             4 7 10 8 - split at 7

1 2 - Pivot=1    3-base case   4-base case   7 10 8 - Pivot 7
1 2 - split at 1                             7 10 8 - Pivot 10

1-2-base case                                7-base case   10 8 - Pivot 10
                                                           8
10 - split at 8
                                                           8-10-base case
Final result: 1 2 3 4 7 8 10


Quick Sort for C/C++ User

C/C++ standard library <stdlib.h> contains qsort function.

This is not the best quick sort implementation in the world but it fast enough and VERY EASY to be used... therefore if you are using C/C++ and need to sort something, you can simply call this built in function:

qsort(<arrayname>,<size>,sizeof(<elementsize>),compare_function);

The only thing that you need to implement is the compare_function, which takes in two arguments of type "const void", which can be cast to appropriate data structure, and then return one of these three values:

negative, if a should be before b
0, if a equal to b
positive, if a should be after b

With these three possibilities, you can virtually compare anything, starting from simply sort a list of integers, sort a list of strings, to a more complex stuff such as sorting floating point numbers, sorting records based on key, and finally... multi field sorting :)... a very advanced technique for sorting in my opinion :)

1. Comparing a list of integers

simply cast a and b to integers
if x < y, x-y is negative, x == y, x-y = 0, x > y, x-y is positive
x-y is a shortcut way to do it :)
reverse *x - *y to *y - *x for sorting in decreasing order

int compare_function(const void *a,const void *b) {
  int *x = (int *) a;
  int *y = (int *) b;
  return *x - *y;
}

2. Comparing a list of strings

for comparing string, you need strcmp function inside string.h lib.
strcmp will by default return -ve,0,ve appropriately... to sort in reverse order, just reverse the sign returned by strcmp

#include <string.h>

int compare_function(const void *a,const void *b) {
   return (strcmp((char *)a,(char *)b));
}

3. Comparing floating point numbers

In comparing floating points, you must be careful. You can't simply use the technique listed in number 1 above, why? because compare_function returns an integer. Simply substracting two floating points and cast them as integer will causes precision errors. For example: 0.5 and 0.6, if you simply use number 1 technique above, you'll return 0.5 - 0.6 which is -0.1, which when converted to integer yields 0... wrong answer rite...

int compare_function(const void *a,const void *b) {
  double *x = (double *) a;
  double *y = (double *) b;
  // return *x - *y; // this is WRONG...
  if (*x < *y) return -1;
  else if (*x > *y) return 1;
  return 0;
}

4. Comparing records based on a key

Sometimes you need to sort a more complex stuffs, such as record. Here is the simplest way to do it using qsort library

typedef struct {
  int key;
  double value;
} the_record;

int compare_function(const void *a,const void *b) {
  the_record *x = (the_record *) a;
  the_record *y = (the_record *) b;
  return x->key - y->key;
}

5. Multi field sorting, advanced sorting technique :)

Sometimes sorting is not based on one key only... For example sorting birthday list... First you sort by month, then if the month ties, sort by date (obviously), then finally by year...

For example I have an unsorted birthday list like this:

24 - 05 - 1982 - Steven
24 - 05 - 1980 - Cecilia Cheung
31 - 12 - 1999 - End of 20th century
01 - 01 - 0001 - Start of modern calendar

I will have a sorted list like this:

01 - 01 - 0001 - Start of modern calendar
24 - 05 - 1980 - Cecilia Cheung
24 - 05 - 1982 - Steven
31 - 12 - 1999 - End of 20th century

To do multi field sorting like this, traditionally one will choose multiple sort using sorting algorithm which has "stable-sort" property. This is outdated...

The better way to do multi field sorting is to modify the compare_function in such a way that you break ties accordingly... I'll give you an example using birthday list again.

typedef struct {
  int day,month,year;
  char *name;
} birthday;

int compare_function(const void *a,const void *b) {
  birthday *x = (birthday *) a;
  birthday *y = (birthday *) b;
  if (x->month != y->month) // months different
    return x->month - y->month; // sort by month
  else { // months equal..., try the 2nd field... day
    if (x->day != y->day) // days different
      return x->day - y->day; // sort by day
    else // days equal, try the 3rd field... year
      return x->year - y->year; // sort by year
  }
}

TEST YOUR MULTI FIELD SORTING KNOWLEDGE

Solve UVa problems related with multi field sorting:

10194 - Football (aka Soccer)

Note about Partition:

Partition is the heart of Quick Sort algorithm. In [CLR], we've learnt about the first Partition algorithm used by C.A.R. Hoare when he first invented this algorithm. This partition algorithm is still O(N) but not really cache friendly, since it start swapping from leftmost & rightmost.

In [CLRS], the 2nd edition of the book, we are introduced with a better way to do Partition. First, we select an element as a pivot (black), then we grow blue and red region such that all elements in blue <= pivot and all elements in red >= pivot


Source: CS3230 Lecture Note 2

Pseudo code:
1). Set i = p-1 and j = p
2). Increment j until A[j] <= pivot
3). Increment i; swap A[i] and A[j]
4). Repeat steps 2) to 3) until j = r
5). Swap A[i+1] and A[r]
6). Return i+1 // this is q - final position of pivot

This variant of Partition algorithm is considered better because it is more cache friendly by maximizing locality, since when an element is in the wrong partition, it will be swapped with the nearest element that is also in the wrong partition. i.e., this Partition algorithm grows blue and red region from left, where in previous algorithm, the red partition is located on the rightmost. The new one also easier to analyze since it only uses one comparison function in line 2.

f. Heap Sort

Speed: O(N*log N)
Space: The size of initial array
Coding Complexity: Complex, using advanced data structure, a Heap
.

Refer to Data Structures section regarding Heap.

Heap Sort - basic idea

Build the Heap from array A
We know the root of the Heap is the highest value, swap with the last element
Now we are sure that the last element has the highest value
Then we decrease the heap size by 1
And do Heapify in O(log N) without considering the last element
Repeat this process until Heap size become 1.

Quick sort pseudo code:

HeapSort(A)
  Build-Heap(A)
  for i <- length(A) downto 2
    exchange(A[1],A[i])
    heap-size[A]--
    Heapify(A,1)

Heap Sort sample run

Trace the following Heap step by step
Initial, unsorted array A={5,13,2,25,7,17,20,8,4}

              5
     13               2
  25      7      17      20 
 8   4

After Build-Heap

             25
     13              20
   8      7      17       2
 5   4

* denotes the sorted region
Pass 1

              4
     13              20
   8      7      17       2
 5   25*

Pass 2

              5
     13              17
   8      7       4       2
20*  25*

Pass 3

              2
     13               5
   8      7       4      17*
20*  25*

Pass 4

              4
      8               5
   2      7      13*      17*
20*  25*

Pass 5

              4
      7               5
   2      8*     13*      17*
20*  25*

Pass 6

              2
      4               5
   7*     8*     13*      17*
20*  25*

Pass 7

              2
      4               5*
   7*     8*     13*      17*
20*  25*

Pass 8 - Finish

              2
      4*              5*
   7*     8*     13*      17*
20*  25*

Sorted array = {2,4,5,7,8,13,17,20,25}

3. Linear-time Sorting

a. Lower bound of comparison-based sort is W(n log n)

The sorting algorithms that we see above are comparison-based sort, they use comparison function such as <, <=, =, >, >=, etc to compare 2 elements. We can model this comparison sort using decision tree model, and we can proof that the shortest height of this tree is W(n log n). Details can be found in CLRS chapter 8.

In the next section, we will see three non-comparison based sorting algorithm which able to runs in linear time for special cases.

b. Counting Sort

For Counting Sort, we assume that the numbers are in the range [0..k], where k is at most O(n). We set up a counter array which counts how many duplicates inside the input, and the reorder the output accordingly, without any comparison at all. Complexity is O(n+k). Refer to CLRS chapter 8 for more details.

c. Radix Sort

For Radix Sort, we assume that the input are n d-digits number, where d is reasonably limited.

Radix Sort will then sorts these number digit by digit, starting with the least significant digit to the most significant digit. It usually use a stable sort algorithm to sort the digits, such as Counting Sort above.

Example:

input:
321
257
113
622

sort by third (last) digit:
321
622
113
257
after this phase, the third (last) digit is sorted.

sort by second digit:
113
321
622
257
after this phase, the second and third (last) digit are sorted.

sort by second digit:
113
257
321
622
after this phase, all digits are sorted.

For a set of n d-digits numbers, we will do d pass of counting sort which have complexity O(n+k), therefore, the complexity of Radix Sort is O(d(n+k)). However, we need to balance the n and d factor wisely. Refer to CLRS chapter 8 for more details.

d. Bucket Sort

For Bucket Sort, we assume that the input distribution is in the range [0,1). Set up n buckets that contain range [i/n..i+1/n) for each i = 0 to n. Then do a linear scan through the input and put each elements to appropriate buckets. If the input are uniformly distributed, we can expect that each bucket will hold very small number of elements, which can be sorted easily using any sorting algorithm, say, Insertion Sort. The overall complexity will be just O(n) in this case. Refer to CLRS chapter 8 for more details.

What's Next?

You now know how to do sorting, no matter how complex the things that you need to sort is (using C qsort library). Now you'll learn the complement of sorting, which is searching, at the next chapter.


This document, 5_sorting.html, has been accessed 23664 times since 28-Feb-28 23:35:10 SGT. This is the 3rd time it has been accessed today.

A total of 14931 different hosts have accessed this document in the last -6685 days; your host, 38.107.191.105, has accessed it 1 times.

If you're interested, complete statistics for this document are also available, including breakdowns by top-level domain, host name, and date.