Skip to content

Floating-Point Representation

As you can see, fixed-point representation has limited range. This limitation is the result of having the binary point in a fixed location. The alternative is called floating-point numbers where the binary point is not in a fixed location. This allows us to represent very large or very small numbers in terms of magnitude.

The basic idea for the representation is not totally new. This is basically the scientific notation in binary. It is standardised as IEEE 754 floating-point representation.

Representation

The scientific notation can be represented as

<sign> <mantissa> × <base><exponent>

Before we go into the details, let us consider a few simple scientific notations that we (hopefully) have learnt in high-school (and hopefully still remember).

  • 1.23 = +0.123 × 101
    • <sign>: +
    • <base>: 10
    • <mantissa>: 0.123
    • <exponent>: 1
  • 23000000000000000000000 = +0.23 × 1023
    • <sign>: +
    • <base>: 10
    • <mantissa>: 0.23
    • <exponent>: 23
  • 0.00000000000000000000000000000000000005 = +0.5 × 10-37
    • <sign>: +
    • <base>: 10
    • <mantissa>: 0.5
    • <exponent>: -37
  • -0.00000000000000000000000000002397 = -0.2387 × 10-18
    • <sign>: -
    • <base>: 10
    • <mantissa>: 0.2397
    • <exponent>: -18

Since computer works in binary, we need to adapt this scientific notation into binary. This can be done by doing all of the following.

  • Represent the <mantissa> as a normalised binary instead of normalised decimal.
  • Use the value of 2 as the <base>.
  • Represent the <exponent> as binary that allows negative number.

<sign> <mantissa> × 2<exponent>

Floating-Point

Since the base is always 2, we can exclude that from our binary representation. As such, the representation can be summarised as in the image on the right. There are two basic formats:

float (32-bits):

  • <sign>: 1-bit
  • <mantissa>: 23-bit (normalised)
  • <exponent>: 8-bit (Excess-127)

double (64-bits):

  • <sign>: 1-bit
  • <mantissa>: 52-bit (normalised)
  • <exponent>: 11-bit (Excess-1023)

We will focus on single precision but an extension to double precision can be done easily.

Memorise

An easy way to memorise this is to remember only two values:

  1. The total number of bits.
  2. The number of bits for the <exponent>.

Hopefully you should already be very familiar with the first after experiencing C language for a while now. The rest can be inferred from these two numbers. Consider the single precision number: total of 32-bits and 8-bit <exponent>.

  • <sign>: Always 1-bit, there are only 2 states (+ or -), so 1-bit is always enough.
  • <mantissa>: Since the total is 32-bits and that we have 8-bit <exponent>, we can compute the mantissa as 32 - 8 - 1 = 23-bits. We subtract 1-bit for the <sign>.
  • <exponent>: We already know that it takes 8-bits. What we need to compute is the excess representation. Since we try to evenly distribute positive and negative number, we have two choices: Excess-(2n-1) or Excess-(2n-1-1). In this case, we take the latter choice as we favour positive numbers to negative numbers. So, we can infer that it should be Excess-(28-1) which is Excess-127.

Normalised Mantissa

Before we go into the example, there is a trick that is done to virtually extend the number of bits that can be encoded in <mantissa> by 1-bit. First, note that unlike decimal scientific notation, binary only has two symbols: 0 and 1. As such, we have either 0._ or 1._.

To increase the number of bits encoded by 1, we simply assume that we will always have the mantissa in the form of 1.<recorded mantissa>. This assumption that we will always have 1._, is called the normalised mantissa. Additionally, we do not have to record this hidden bit in our <recorded mantissa>.

Normalised Mantissa

Note that we cannot do this normalisation using decimal scientific notation because there are too many choices beside 0. In fact, this only works in binary scientific notation since we only have two symbols.

Excess Exponent

Another part that need some explanation is the use of the excess representation for the exponent. Consider the kind of operations that we often do in a program. Most programs perform additions (or subtractions) rather than multiplications (or division).

Try it out on your own to add two numbers in scientific notation. For instance, try adding 1.23 and 23000000000000000000000 in floating-point representation. You will find that the exponent often follows the larger one. Hence, comparing two exponents is an important operation in floating-point representation.

Example

Consider the number (-6.5)10. How do we represent this number in float? First, we convert the number into binary.

(-6.5)10 = (-110.1)2

Second, we normalise this binary scientific notation while keeping the hidden bit.

(-110.1)2 = -1.1012 × 22

Next, we find the <sign>, <exponent> and <mantissa>.

  • <sign>: 1 (i.e., negative)
  • <exponent>:
    • Initial value: (2)10
    • Excess-127: (10000001)2 (equivalent to 2 + 127 = 129)
  • <mantissa>:
    • Initial value: (1.101)2
    • Normalised: (101)2
    • Append 0 until the required number of bits: (10100000000000000000000)2

Combining everything, we get:

<sign> <exponent> <mantissa>
1 10000001 10100000000000000000000

We often also wants to write it as hexadecimal for easy reading.

(1 10000001 10100000000000000000000)2 = C0D0000016

Exercises

Decimal to Floating-Point

Convert -36.0312510 to IEEE 754 single-precision floating-point format. Write your answer in hexadecimal.

0xC2102000

Steps

First, we convert the number into binary.

(-36.0312510)10 = (-100100.00001)2

Second, we normalise this binary scientific notation while keeping the hidden bit.

(-100100.00001)2 = -1.00100000012 × 25

Next, we find the <sign>, <exponent> and <mantissa>.

  • <sign>: 1 (i.e., negative)
  • <exponent>:
    • Initial value: (5)10
    • Excess-127: (10000100)2 (equivalent to 5 + 127 = 132)
  • <mantissa>:
    • Initial value: (1.0010000001)2
    • Normalised: (0010000001)2
    • Append 0 until the required number of bits: (00100000010000000000000)2

Convert to hexadecimal

(1 10000100 00100000010000000000000)2

= (1100 0010 0001 0000 0010 0000 0000 0000)2

= (C 2 1 0 2 0 0 0)16

= (C2102000)16