can someone explain how Binary floating point addition works

Thread Starter

ThatComputerGuy

Joined Jun 13, 2018
8
hi,

i am currently making a 16 bit CPU in Logisim, i am currently working on the ALU, it has all the basic functions, and works just fine, but i can't go any higher than 65535, or lower than -65565, i also can't calculate numbers between numbers, like the number 0.2.

i can add and subtract in binary, no problem but i just can't understand floating point, so if anyone could teach me how it works and give me an example, i'd be very happy :)

thanks in advance,

ThatComputerGuy
 

MrChips

Joined Oct 2, 2009
22,089
Floating point (FP) is like scientific notation.
For example, 1234.56 could be written as 1.23456 x 10^3
and 0.000123456 would be 1.23456 x 10^-4

Hence you now need two fields, one for the exponent, for example 3 or -4, and another for the mantissa, 1.23456

In binary, the mantissa is always reduced to the form 1.XXXXXX.
This process is called normalization.
Since the most significant bit is always 1, we can ignore this (i.e. it is always implied) and hence have space for one more bit.

How do you add two FP numbers?
The two numbers must have the same exponent before you can add.
To do this, the number with the smaller exponent must be shifted right (divide by two) and add 1 to the exponent. Do this until the exponents match. Obviously, what is happening here is you lose precision in the smaller number.
When the smaller number has been adjusted to match the larger number, you can go ahead and add the two mantissa.

All of this sounds like a lot of work. Yes, it is. The solution is to resort to pre-written FP SW libraries or use HW co-processors to do the hard work for you.

A very acceptable option in many applications is to do fixed-point arithmetic using integers.
For example, if you have to create an application that displays temperatures to one or two decimal place, you can store all your values at 100x the actual value and scale the results later.
 

BobTPH

Joined Jun 5, 2013
2,584
A floating point arithmetic unit will be much more complex than your entire CPU. Software floating point is the usual solution for simple processors. For instance, pretty much all microcontrollers.

Bob
 

WBahn

Joined Mar 31, 2012
26,295
Floating point (FP) is like scientific notation.
For example, 1234.56 could be written as 1.23456 x 10^3
and 0.000123456 would be 1.23456 x 10^-4

Hence you now need two fields, one for the exponent, for example 3 or -4, and another for the mantissa, 1.23456

In binary, the mantissa is always reduced to the form 1.XXXXXX.
This process is called normalization.
Since the most significant bit is always 1, we can ignore this (i.e. it is always implied) and hence have space for one more bit.
The problem with this is that it means that you can't represent zero.

There were floating point representations for which this was the case. But most people agreed that this was unacceptable and so they found ways to represent zero somehow.

The IEEE-754 standard deals with this by specifying that the smallest exponent pattern represents the same exponent as the next smallest, but that the most significant bit is now assumed to be 0. The is called "denormalization" or "graceful underflow".

Which only goes to underscore that you probably do NOT want to get into developing hardware that deals with this.


First option: Use someone else's floating point coprocessor (if one exists that is usable with your processor)
Second option: Use someone else's IEEE-754 emulation libraries (or non-IEEE-754 if you have to).
Third option: Write your own floating point emulation library.
Fourth option: Develop our own floating point hardware.

For an MCU, your best bet is probably going to be Option #2.
 

Thread Starter

ThatComputerGuy

Joined Jun 13, 2018
8
A floating point arithmetic unit will be much more complex than your entire CPU. Software floating point is the usual solution for simple processors. For instance, pretty much all microcontrollers.

Bob
Can you maybe tell me more about this software floating point?
 
Top