Representing floating point in hardware is a great challenge when we’re limited by area and precision. The general way of representing floating point is by using IEEE 754 Floating Point standard. But the amount of storage needed for that standard is way too big for our needs, so we came up with our own standard in storing fixed point numbers.
The diagram above is used so that explanation could be easier later.
We usually use the format below to represent fixed point numbers:
x: Number of Bits for Whole Number
y: Number of Bits for Decimal Number
- We need a total of x + y bits to represent this format
- Q2.3 has 2 whole number bits and 3 decimal number bits and the decimal would be right after the 3rd number from the LSB.
- This requires 5 bits to represent
- The diagram above has the format Q3.4
- It requires 7 bits to represent
This is the same as above but there’s an extra sign bit to represent signed numbers.
So, we need a total of x + y + 1 bits to represent this type of number.
- SQ2.3 would require 6 bits to represent and the decimal would be right after the 3rd number from the LSB(same as above).
Divide and Conquer Method in Solving Division and N-th Root of a Number
- In solving division, we have A/B = C.
- We move B to the other side of the equation ending up with A = CB.
- Our goal is to guess what C is.
- The approach in guessing C would be by assuming that the MSB of C as 1.
- Then, we multiply the “assumed C” with B and compare the result with A.
- If the result is less than A, the MSB of C is indeed 1, otherwise, we’ll set the MSB of C as 0.
- Thus we can proceed to the next bit until we reach the last bit.
- There’re 2 ways where we can stop this operation:
- When A = CB is achieved during our assumptions.
- When we reach the LSB of C.
The diagram above is for better understanding on the idea that I’m trying to explain.
N-th Root of a Number
- This is really similar to the methods used in division.
- We have A^(1/B) = C.
- We move the Power to the other side of the equation ending up with A = C^B.
- Now, we start guessing for C again.
- The method used for guessing would be the same as what we used for division.
Operating with Fixed Point Numbers
- Operating with multiplication wouldn’t be a problem because the number of bits required to represent whole number and decimal number adds up.
- E.x. Q2.3 * Q4.5 = Q6.8
- There’s a problem with addition and subtraction instead, we have to shift the decimal based on our operation.
- Usually the shifting of the decimal in this situation is dependent on the precision needed by the design.
- All bits used to represent whole number should remain unchanged since a change in the number of bits here will affect the result significantly.
- The decimal number bits can be altered anyway the designer wants as more bits provides better resolution but ends up with larger area or vice versa.
- Adding two fixed point numbers would end up with a maximum of max(fixed point number 1, fixed point number 2) + 1 bits result.
- E.x. Q3.4 + Q2.3 = Q4.x
- Subtracting two fixed point numbers would end up with max(fixed point number 1, fixed point number 2) number of bits for the result as there wouldn’t be a carry required for subtraction.
- E.x. Q3.4 – Q2.3 = Q3.x
Issues with Fixed Point Numbers
- Usually, designs would use 16 bit to represent fix point numbers.
- We’ve to keep track on operations like continuous multiplications as it would blow up the number of bits required to represent the whole number of the result.