| Week 2 Min-Yen KAN | |
| (Self-Study Module) |
| To encode English text, we need 26 lower case letters, 26 upper case letters, and a handful of punctuation | |||
| We can get by with 64 characters (6 bits) in all | |||
| Each character is therefore 6 bits wide | |||
| We can do better, provided: | |||
| Some characters are more frequent than others | |||
| Characters may be different bit widths, so that for example, e use only one or two bits, while x uses several | |||
| We have a way of decoding the bit stream | |||
| Must tell where each character begins and ends | |||
| A = 0 B = 100 C = 1010 D = 1011 R = 11 |
|
| ABRACADABRA = 01001101010010110100110 | |
| This is eleven letters in 23 bits | |
| A fixed-width encoding would require 3 bits for five different letters, or 33 bits for 11 letters | |
| Notice that the encoded bit string can be decoded! |
| In this example, A was the most common letter | ||
| In ABRACADABRA: | ||
| 5 As code for A is 1 bit long | ||
| 2 Rs code for R is 2 bits long | ||
| 2 Bs code for B is 3 bits long | ||
| 1 C code for C is 4 bits long | ||
| 1 D code for D is 4 bits long | ||
| For each encoding unit (letter, in this example), associate a frequency (number of times it occurs) | ||
| You can also use a percentage or a probability | ||
| Create a binary tree whose children are the encoding units with the smallest frequencies | ||
| The frequency of the root is the sum of the frequencies of the leaves | ||
| Repeat this procedure until all the encoding units are in the binary tree | ||
| Assume that relative frequencies are: | ||
| A: 40 | ||
| B: 20 | ||
| C: 10 | ||
| D: 10 | ||
| R: 20 | ||
| Smallest number are 10 and 10 (C and D), |
||
| connect those | ||
| C and D have already been used, and the new node above them (call it C+D) has value 20 | ||
| The smallest values are B, C+D, and R, all of which have value 20 |
||
| Connect any two of these |
||
| The smallest values is R, while A and B+C+D all have value 40 | |
| Connect R to either of the others |
|
| Connect the final two nodes |
| Assign 0 to left branches, 1 to right branches | |
| Each encoding is a path from the root | |
| A = 0 B = 100 C = 1010 D = 1011 R = 11 |
|
| Each path terminates at a leaf | |
| Do you see why encoded strings are decodable? |
| A = 0 B = 100 C = 1010 D = 1011 R = 11 |
|
| No bit string is a prefix of any other bit string | |
| For example, if we added E=01, then A (0) would be a prefix of E | |
| Similarly, if we added F=10, then it would be a prefix of three other encodings (B=100, C=1010, and D=1011) | |
| The unique prefix property holds because, in a binary tree, a leaf is not on a path to any other node |
| Is encoding practical for long texts or short ones? | |||
| Short: impractical | |||
| To decode it, you would need the code table | |||
| The code table is bigger than the message | |||
| Long: practical | |||
| The encoded string is large relative to the code table | |||