Addition of IEE754 half precision floating point numbers

heimdall01 · Sep 19, 2016

Hi everyone , i'm trying to understand how to add two numbers in IEE754 half precision format , i haven't found a good reference/tutorial on this , so i've tried to follow the 32 bit floating point addition procedure , but at some point i'm stucked , what i know is:

Half precision (16 bit) IEEE 754 floating point number bit configuration is:

1 bit for sign 5 bit for exponent bit and 10 bit for mantissa.
i have two numbers:

1110001000000001
0000001100001111

so i rewrited them as

1 11000 1000000001
0 00000 1100001111

i have to convert them in normalized binary scientific notation (i don't know if i'm right):
- Convert the exponent of the first number in decimal: 24
- Subtract bias (that is 15) so i have: 24-15 = 9 so the exponent is 9
so the mantissa is : 1.000000001 * 2^9

But i don't know how to continue , can someone explain me the correct process for solving this , or link me a good reference/title of a book for understanding this procedure? i'm searching on Google and trying to understand on different books , but i can't clarify this procedure to myself.

MrAl · Sep 19, 2016

Hello,

Way back maybe in the 1990's i made routines that handle numbers in a format similar to that.
What i remember is that to do addition the exponents had to be checked and if different one number had to be 'shifted' to match the other number. By shifted i mean you can either shift the number or just start adding using an offset. Some numbers wont add because for example 1e19 and 1e0 the 1e0 is too insignificant so we are left with just 1e19 again. If there is a carry the exponent gets adjusted.

atferrari · Sep 19, 2016

Not sure if precisely what you are looking for but Microchip published several libraries with code. Maybe 15 years ago +++.

At that time they seemed worried to explain things in detail and were well commented.

I know there was an error in the code for integer math, not floating ones.

heimdall01 · Sep 19, 2016

Ok , but if check the exponent of second number, it is : 0000 sp if i subract 15 from that it is -15 , so what i have to do for make the two exponent match? (i do not have to implement that in code, it''s an exercise on paper that i'm trying to understand ) so what i have to do to match their exponents? i know i have to take the smaller exponent: -15 and make it match with 9 , i have to shift the comma?

heimdall01 · Sep 19, 2016

atferrari said:
Not sure if precisely what you are looking for but Microchip published several libraries with code. Maybe 15 years ago +++.

At that time they seemed worried to explain things in detail and were well commented.

I know there was an error in the code for integer math, not floating ones.

i'm not looking for libraries , it is an exercise that i'm doing on paper , but i can't find any reference/books/tutorial that explains how to add two numbers in IEEE754 half precision format , sorry if my question is a bit confusing.

MrChips · Sep 19, 2016

In order to ADD two FP numbers the exponents must be the same.
To begin, we assume the FP numbers are normalized.
Next we make the exponents equal by increasing the lower of the two exponents,
i.e. every time we increase the exponent by 1, we divide the matissa by 2 (i.e. shift right by 1 bit), assuming the exponent is in radix-2.
Once the exponents are the same, we can go ahead and add the mantissa portions.

heimdall01 · Sep 19, 2016

MrChips said:
In order to ADD two FP numbers the exponents must be the same.
To begin, we assume the FP numbers are normalized.
Next we make the exponents equal by increasing the lower of the two exponents,
i.e. every time we increase the exponent by 1, we divide the matissa by 2 (i.e. shift right by 1 bit), assuming the exponent is in radix-2.
Once the exponents are the same, we can go ahead and add the mantissa portions.

So assuming that i have these two numbers:

1 11000 1000000001

0 00000 1100001111

if i convert the exponent of the first number in decimal , it is equal to 24 , and the second is 0 , so in order to make the exponent equal , in this case i have to shift the second number to the right of 23 positions?

MrAl · Sep 19, 2016

Hi,

You can start by reducing the problem to less binary digits. If you can get 8 bits to work then you can get 16 bits or 32 bits to work or whatever you want.

I did forget to mention one thing, and that is that the format i am used to always has a leading 1, with the exponent adjusted to make up for any shifting needed to get that 1 there. That means if you add a positive number to a negative number you might end up with 0001 for example (reduced format) and you will have to adjust the exponent to get that 1 into the MSB position. Since that has to be shifted left 3 times the exponent has to be -3, so we would have something like 1000 with an exponent of -3 however that is bring represented.
So before the addition you shift one number if needed then do the addition bit by bit, and after the addition you adjust if necessary.

MrAl · Sep 19, 2016

heimdall01 said:
So assuming that i have these two numbers:

1 11000 1000000001

0 00000 1100001111

if i convert the exponent of the first number in decimal , it is equal to 24 , and the second is 0 , so in order to make the exponent equal , in this case i have to shift the second number to the right of 23 positions?

As i said before, if you shift out of range of the other number within the storage space, you ignore the second number. That is because of finite register size truncation.

If you happen to shift just one binary digit outside of the storage space, then you have the option to round. Since rounding is another issue, you should probably look into that. You may want to round up because 0.1 binary represents 1/2, or you may want to round every other result, or you may want to round using a random process.
When i did mine i just rounded up but it depends on what you want to use this system for.

heimdall01 · Sep 19, 2016

MrAl said:
As i said before, if you shift out of range of the other number within the storage space, you ignore the second number. That is because of finite register size truncation.

If you happen to shift just one binary digit outside of the storage space, then you have the option to round. Since rounding is another issue, you should probably look into that. You may want to round up because 0.1 binary represents 1/2, or you may want to round every other result, or you may want to round using a random process.
When i did mine i just rounded up but it depends on what you want to use this system for.

First i want to apologize if i'm boring you, i'm just trying to understand applying this to an exercise , and thank you for your patience.
I just want to check if i understood:

- if i want to add two ieee 754 numbers i have to make exponents equal
so with the numbers of the question above i will do:

1 11000 1000000001
Decimal value of the exponent is 24 , and the number is negative

0 00000 1100001111
Decimal value of the exponent is 0? an this is a positive number

Now i need to make 0 equal to 24?
so i shift the number of 23 position and it becomes:

0.000000000000000000000000000001100001111? is that possible?

MrChips · Sep 19, 2016

Don't forget that the exponent is also signed (or maybe offset). You have to check the IEEE format.

MrAl · Sep 19, 2016

heimdall01 said:
First i want to apologize if i'm boring you, i'm just trying to understand applying this to an exercise , and thank you for your patience.
I just want to check if i understood:

- if i want to add two ieee 754 numbers i have to make exponents equal
so with the numbers of the question above i will do:

1 11000 1000000001
Decimal value of the exponent is 24 , and the number is negative

0 00000 1100001111
Decimal value of the exponent is 0? an this is a positive number

Now i need to make 0 equal to 24?
so i shift the number of 23 position and it becomes:

0.000000000000000000000000000001100001111? is that possible?

Hi,

Look at it in base 10 first.

If we have 1.2e2 and 1.0e3 what do we have to do to add them?
1.2e2 is 120
1.0e3 is 1000
If we tried to add them in the columns they appear, we'd get 2200 whcih cant be right.
We have to shift 120 to the right one digit so it becomes 0120, then we can add the columns:
0120
1000
1120 answer

Now if our storage register space allows 123000 and 100000 and we have to shift that 100000 to the right 6 times, we end up with 000000.1 and that takes too many digits, so if we truncate we get:
123000+000000=123000

and that is because our storage does not allow that extra digit that would have been 123000.1 because that ".1" takes too much space. We lose accuracy like that but that is the way finite register space works.

So in your example if you have only 10 bits to work with and you have to shift the other number over 23 places, you end up shifting it right out of the register and into no where's land, so you get zero, and zero plus anything else equals that anything else again with no change.

You should probably do some easier examples first. For example add:
00000 1000000000
00001 1000000000

In order to get the first number to have the same exponent, we have to shift it over 1 digit:
00001 0100000000

and now we can add:
00001 0100000000
00001 1000000000
00001 1100000000 answer

Another:
00000 1000000000
01001 1000000000

with shifted first number:
01001 0000000001
01001 1000000000
01001 1000000001 answer

and another:
00000 1000000000
01010 1000000000

with shift:
01010 0000000000 (1 at the end lost to truncation)
01010 1000000000
01010 1000000000 answer

With subtraction:
1 01000 10000000000
0 01001 10000000000

with shift:
1 01001 0100000000
0 01001 1000000000
0 01001 0100000000
and here we have a leading zero so we shift left once to get:
0 01000 1000000000 answer

This looks correct but check it over to make sure, but in any case that's the basic idea.

And also, you dont have to actually shift the number as we did above, unless you really want to do it that way. the faster way is to just shift the starting bit selector. So with these two:
00000 1000000000 [1]
00001 1000000000 [1]

the "bit selectors" are both 1 above, and since we "shift" the top number we change it to:
00001 1000000000 [2]
00001 1000000000 [1]

now when your code accesses the top number it does not retrieve the leading '1' until the loop index gets to 2. There are other ways to do that though if you care to.

I am going by memory here, but these are the general ideas.

WBahn · Sep 19, 2016

First off, half-precision is not intended for actual computation, but rather for data storage. It was developed by NVidia and incorporated into the IEEE-754 standard much later. At least that is my understanding. Actual computations are done by converting it to single-precision first. But if you are just doing things on paper, then that it beside the point.

You don't need to worry about the exponent bias (which is 15). But you need to make the exponents match before you can add the significands. You do this by taking the smaller number and progressively multiplying and dividing it by two (to maintain the value). You multiply it by two by incrementing the exponent and you divide it by two by right shifting the significand. But one thing you have completely overlooked is the implied leading 1 in the normalized representation. That 1 becomes explicit as soon as you denormalize it. That means that you will likely end up adding/subtracting a normalized and a denormalized value, which is not hard, you just have to do the bookkeeping. The easiest way to deal with this is to denormalize both values up front, do the math, and then renormalize at the end. But you need to take care to preserve the precision when you do this, which requires at least one additional bit.

Note that, internally in Intel processors, floating point values are operated on in an 80-bit format regardless of the precision in which they are stored making the concern over the additional bit superfluous.

MrAl · Sep 19, 2016

WBahn said:
First off, half-precision is not intended for actual computation, but rather for data storage. It was developed by NVidia and incorporated into the IEEE-754 standard much later. At least that is my understanding. Actual computations are done by converting it to single-precision first. But if you are just doing things on paper, then that it beside the point.

You don't need to worry about the exponent bias (which is 15). But you need to make the exponents match before you can add the significands. You do this by taking the smaller number and progressively multiplying and dividing it by two (to maintain the value). You multiply it by two by incrementing the exponent and you divide it by two by right shifting the significand. But one thing you have completely overlooked is the implied leading 1 in the normalized representation. That 1 becomes explicit as soon as you denormalize it. That means that you will likely end up adding/subtracting a normalized and a denormalized value, which is not hard, you just have to do the bookkeeping. The easiest way to deal with this is to denormalize both values up front, do the math, and then renormalize at the end. But you need to take care to preserve the precision when you do this, which requires at least one additional bit.

Note that, internally in Intel processors, floating point values are operated on in an 80-bit format regardless of the precision in which they are stored making the concern over the additional bit superfluous.

Hi,

That one additional bit is the 'carry' bit.

Back when i did mine, i used a then unheard of number of bits, like 256 and something like 16 bits or more for the exponent alone. I didnt want to have to worry about precision for some calculations that involved lots of numbers like in a matrix. I think i still have the routine somewhere (it was in asm) but i've done other things since then that work the same that dont require asm coding and vector calls. It got a lot easier with the higher level languages. Those asm routines were cool though and i found that taking the square root in binary is actually much easier than in base 10. I also found i needed a base 10 number cruncher to calculate the display digits unless i stored a bunch of constants...cant remember that part very well now though. I was then working with a computer who's date could only go up to 1987 without a serious rewrite of the disk operating system, which i ended up doing later anyway

WBahn · Sep 19, 2016

MrAl said:
Hi,

That one additional bit is the 'carry' bit.

There are two contexts for the "additional bit" I was referring to, and neither of them is a carry bit. It might have been in YOUR floating point representation, but not in the IEEE-754 representation.

The IEEE-754 standard normalizes the significand so that there is exactly one non-zero digit to the left of the radix point (the same way that we normalize decimal numbers when use scientific notation). Since there is only one non-zero digit in binary, that means that the digit to the left of the radix point has to be a 1. So why store it? Hence the representation uses an implied 1 that is not stored, but is very, very, very much a part of the value being represented.

If the a W-bit representation using W_E bits for the exponent, then the representation can be partitioned into three unsigned integers, S:E:M, in which S is 1 bit, E is W_E bits, then the value represented is encoded as follows:

\(
value \; = \; (-1)^S \, \(1 \, + \, \frac{M}{2^{W_M}}\) \, 2^{\(E \, - \, bias\)}
\)

Where

\(
W_M \; = \; W \; - \; (W_E \; + \; 1)
\)
and

\(
bias \; = \; 2^{(W_E \; - \; 1)} \; - \; 1
\)

This is for the normalized representations, which are those patterns for which

\(
0 \; < \; E \; < \; 2^{W_E} \; - \; 1
\)

If E = 0, then the denormalized representation is used, which permits the exact representation of zero (something that the normalized representation is not capable of precisely because of the implied 1).

\(
value \; = \; (-1)^S \, \(\frac{M}{2^{W_M}}\) \, 2^{\(1 \, - \, bias\)}
\)

If E = (2^W_E) - 1, then the 'value' is one of the three possible flag values. If M = 0, then the value is either positive or negative infinity (as determined by the sign bit). If M <> 0, then the value is NaN (not a number).

MrAl · Sep 19, 2016

Hi again,

It is interesting though that if we wanted to do this with an integer addition, we'd have to use the carry flag and that's actually good because without that we'd need an N+1 register length which would really mean we could only add N-1 bits.

It's been a long time since i did these routines though, but they were interesting to do.

atferrari · Sep 19, 2016

heimdall01 said:
i'm not looking for libraries , it is an exercise that i'm doing on paper , but i can't find any reference/books/tutorial that explains how to add two numbers in IEEE754 half precision format , sorry if my question is a bit confusing.

From what I recall every application note had a good explanation preceding the code. And then the comments. I learnt things from there. Feel free to ignore them.

Think of them as kind of the reference you asked for.

WBahn · Sep 19, 2016

MrAl said:
Hi again,

It is interesting though that if we wanted to do this with an integer addition, we'd have to use the carry flag and that's actually good because without that we'd need an N+1 register length which would really mean we could only add N-1 bits.

It's been a long time since i did these routines though, but they were interesting to do.

How would you use the carry flag? You have two numbers and, for simplicity, let's say that they are both normalized because both had the same exponent. You have an implied leading 1 on both of them. When you add the portion that you DO have stored, it may or may not produce a carry INTO the portion that is NOT stored. But the portion that is NOT stored WILL produce a DIFFERENT carry. If one of them is not normalized (which will be the case if the exponents are not the same), then you may or may not produce a carry INTO the portion that is not stored and you may or may not produce a carry OUT of the portion that is not stored. So how are you going to use the carry flag?

MrAl · Sep 19, 2016

Hello again,

I am not sure what you mean, maybe you can provide a quick example or two using say just 4 bits to keep it simple.

The carry bit in many CPU's is made for just that, when there is a carry out of the most significant place. Similarly the borrow bit is used for subtractions.

We never add two numbers that are not of the same base, just as we dont for a^3+a^4, but we do for a^3+a^3.

In the integer mode, we would have a number like 1001 and 1000 for example, where the addition results in a carry out of the MSB so we get:
0 1001
0 1000
1 0001

where the left most bit is the carry bit. Makes sense doesnt it?
Most likely with the system you are talking about we have to include the leading 1 first.
I would also guess that if you had an implied lead 1 then that would be for both numbers, and that would always produce a leading 10 always for addition, of which the second digit may change later. I'll wait on this though.

WBahn · Sep 19, 2016

MrAl said:
Hello again,

I am not sure what you mean, maybe you can provide a quick example or two using say just 4 bits to keep it simple.

Let's use an 8-bit representation having three bits for the exponent.

I want to add the values represented by the patterns 00101010 (0x2A) and 01010101 (0x55).

The bias is 3, so the value represented by the first pattern is

S = 0b = 0
E = 010b = 2 => exponent is -1
M = 1010b => 10

V1 = (-1)^0 * (1 + 10/(2^4)) * (2^(-1)) = 26/32 = 13/16

The second pattern is

S = 0b = 0
E = 101b = 5 => exponent is 2
M = 0101b => 5

V1 = (-1)^0 * (1 + 5/(2^4)) * (2^(2)) = 21/4

Adding the two together yields

13/16 + 21/4 = 97/16

The patter that represents this is

97/16 = (97/16) / 4 * 4
97/16 = (97/64) * (2^2)
97/16 = (1 + 33/64) * (2^2)
97/16 = (1 + 8.28/(2^16)) * (2^2)

So now we can pick off the components:

S = 0 = 0b;
E = 2 + bias = 5 = 101b
M = 8.25 rounded to 8 = 1000b

Our resulting pattern is 01011000 (0x58)

To accomplish this in the machine, we had to denormalize the second pattern, which means that we must explicitly store whether the leading bit is a one or a zero.

V1 = 0|010|1010 => 0|010|11010 => 0|101| 00011
V2 = 0|101|0101 => 0|101|10101

NOW we can add the two FIVE bit significands to get 11000 leaving us with a denormalized result of

V = V1 + V2 = 0|101|11000

Since this have a FIVE bit significand with a leading one, we can normalize it by removing the msb of the significand

V = V1 + V2 = 0|101|1000 = 01011000 (0x58)

But what if we had use patterns 0x2A and 0x5F?

The after doing the addition we would have

V = V1 + V2 = 0|101|100010

Which has a SIX bit significand, so we need to increment the exponent by one to reduce it to a five bit significant.

V = V1 + V2 = 0|110|10001

And only THEN can we normalize the result to get

V = V1 + V2 = 0|110|0001 = 0x61

So if we map things so that we can use the carry bit, it has to serve the purpose of detecting overflow from the FIVE bit significand. This means that we can't also use it some how to hold the ADDITIONAL bit that comes from the implied 1 -- we need to explicitly represent it.

The carry bit in many CPU's is made for just that, when there is a carry out of the most significant place. Similarly the borrow bit is used for subtractions.

So? The significacand is not stored in the most significant part of the number -- that is where the exponent is stored (other than the sign bit as the msb itself).

We never add two numbers that are not of the same base, just as we dont for a^3+a^4, but we do for a^3+a^3.

In the integer mode, we would have a number like 1001 and 1000 for example, where the addition results in a carry out of the MSB so we get:
0 1001
0 1000
1 0001

where the left most bit is the carry bit. Makes sense doesnt it?

No. You are ignoring the exponent which is located between the sign bit and the significand. You are also ignoring the implied 1.

Most likely with the system you are talking about we have to include the leading 1 first.

Well, then that's an ADDITIONAL bit that is NOT the carry out, isn't it?

I would also guess that if you had an implied lead 1 then that would be for both numbers, and that would always produce a leading 10 always for addition, of which the second digit may change later. I'll wait on this though.

No, it will NOT always result in a leading 10 for addition because it could also result in a leading 11. Also, you only have an implied 1 for both numbers if the exponents START OUT the same. If they don't, then only ONE of the two numbers will be normalized so that one will have an implied 1 and other will have an implied 0, thus you could also end up with a leading 01.

Thread starter	Similar threads	Forum	Replies	Date
	Can a snubber addition work to prevent arcing when switching DC ?	General Electronics Chat	4	Jun 11, 2025
T	What can be added in addition to a 24VDC Smart GPS that loses power momentarily while changing directions.....	Microcontrollers	3	Dec 18, 2024
	Issue with 2-Digit BCD Adder-Subtractor Circuit Using 74LS83	General Electronics Chat	1	Nov 24, 2024
P	Add addition power supply to multi-voltage PS for cct board HDD.	Power Electronics	4	Jun 12, 2023
	Op Amp circuit for analog computing (addition) won't work	General Electronics Chat	12	Sep 23, 2022

Addition of IEE754 half precision floating point numbers

Join our Engineering Community! Sign-in with:

Addition of IEE754 half precision floating point numbers

heimdall01

MrAl

atferrari

heimdall01

heimdall01

MrChips

heimdall01

MrAl

MrAl

heimdall01

MrChips

MrAl

WBahn

MrAl

WBahn

MrAl

atferrari

WBahn

MrAl

WBahn

You May Also Like

New Connectors Focus on AI Memory, Rugged Power, and Sealed USB Designs

Cirrus Logic Expands Pro Audio Converter Lineup With New 32-Bit Devices

Broadcom Unveils Integrated Wi-Fi 8- and 50G PON Gateway-ICs for the AI Era

The Shift to Edge AI Requires Connector Choices to be Re-Examined