| 
   1 
   | 
  
  
  
  
   - Week 2 			Min-Yen KAN
 
   - (Self-Study Module)
 
   
   
   | 
 
 
  | 
   2 
   | 
  
  
  
  
   - To encode English text, we need 26 lower case letters, 26 upper case
       letters, and a handful of punctuation
 
   - We can get by with 64 characters (6 bits) in all
 
   - Each character is therefore 6 bits wide
 
   - We can do better, provided:
 
   
    - Some characters are more frequent than others
 
    - Characters may be different bit widths, so that for example, e use only
        one or two bits, while x uses several
 
    - We have a way of decoding the bit stream
 
    
     - Must tell where each character begins and ends
 
     
    
   
   
   | 
 
 
  | 
   3 
   | 
  
  
  
  
   - 	A = 0
 
              B = 100 
              C = 1010 
              D = 1011 
              R = 11 
   - 	ABRACADABRA = 01001101010010110100110
 
   - This is eleven letters in 23 bits
 
   - A fixed-width encoding would require 3 bits for five different letters,
       or 33 bits for 11 letters
 
   - Notice that the encoded bit string can be decoded!
 
   
   
   | 
 
 
  | 
   4 
   | 
  
  
  
  
   - In this example, A was the most common letter
 
   - In ABRACADABRA:
 
   
    - 5 As	code for A is 1 bit long
 
    - 2 Rs	code for R is 2 bits long
 
    - 2 Bs	code for B is 3 bits long
 
    - 1 C	code for C is 4 bits long
 
    - 1 D	code for D is 4 bits long
 
    
   
   
   | 
 
 
  | 
   5 
   | 
  
  
  
  
   - For each encoding unit (letter, in this example), associate a frequency
       (number of times it occurs)
 
   
    - You can also use a percentage or a probability
 
    
   - Create a binary tree whose children are the encoding units with the
       smallest frequencies
 
   
    - The frequency of the root is the sum of the frequencies of the leaves
 
    
   - Repeat this procedure until all the encoding units are in the binary
       tree
 
   
   
   | 
 
 
  | 
   6 
   | 
  
  
  
  
   - Assume that relative frequencies are:
 
   
    - A: 40
 
    - B: 20
 
    - C: 10
 
    - D: 10
 
    - R: 20
 
    
   - Smallest number 
 
              are 10 and 10  
              (C and D), 
   
   
   
   | 
 
 
  | 
   7 
   | 
  
  
  
  
   - C and D have already been used, and the new node above them (call it C+D)
       has value 20
 
   - The smallest values
 
              are B, C+D,  
              and R,  
              all of which  
              have value 20 
   
   
   
   | 
 
 
  | 
   8 
   | 
  
  
  
  
   - The smallest values is R, while A and B+C+D all have value 40
 
   - Connect R to
 
              either of the  
              others 
    
    
   
   
   | 
 
 
  | 
   9 
   | 
  
  
  
  
   - Connect the final two nodes
 
   
   
   | 
 
 
  | 
   10 
   | 
  
  
  
  
   - Assign 0 to left branches, 1 to right branches
 
   - Each encoding is a path from the root
 
   - A = 0
 
              B = 100 
              C = 1010 
              D = 1011 
              R = 11 
   - Each path terminates at a leaf
 
   - Do you see why encoded strings are decodable?
 
   
   
   | 
 
 
  | 
   11 
   | 
  
  
  
  
   - A = 0			B = 100
 
              C = 1010			D = 1011 
              R = 11 
    
    
   - No bit string is a prefix of any other bit string
 
   - For example, if we added E=01, then A (0) would be a prefix of E
 
   - Similarly, if we added F=10, then it would be a prefix of three other
       encodings (B=100, C=1010, and D=1011)
 
   - The unique prefix property holds because, in a binary tree, a leaf is
       not on a path to any other node
 
   
   
   | 
 
 
  | 
   12 
   | 
  
  
  
  
   - Is encoding practical for long texts or short ones?
 
   
    - Short: impractical
 
    
     - To decode it, you would need the code table
 
     - The code table is bigger than the message
 
     
    - Long: practical
 
    
     - The encoded string is large relative to the code table
 
     
    
   
   
   |