problem with representing real numbers in binary system

Thread Starter

Ghina Bayyat

Joined Mar 11, 2018
129
I have a question can you please help me
I know that real numbers are represented using the floating point representation method and that negative numbers are represented using two's complement .
I know there are other methods but these are the most common ways and i know very well how to represent real and negative numbers using the two methods but my question is :
What if i have a real negative number ? like -3.5 or -4.25 or any other number ?
How would i represent these numbers ? is the floating point method used to represent these numbers since it has the left most bit as the sign bit or is there another way to represent these numbers ? maybe a mix of the two ways or something like that ??
 

jpanhalt

Joined Jan 18, 2008
9,420
Source: http://cstl-csm.semo.edu/xzhang/Class Folder/CS280/Workbook_HTML/FLOATING_tut.htm
A 1 bit indicates a negative number, and a 0 bit indicates a positive number. Before a floating-point binary number can be stored correctly, its mantissa must be normalized. ... The exponent expresses the number of positions the decimal point was moved left (positive exponent) or moved right (negative exponent).
Of course, you can also use fixed point or integer math with both types of numbers.
 

Ian Rogers

Joined Dec 12, 2012
774
I tend to use fixed point.. The trade of is a fixed decimal point.

3.50 is just 350 but the decimal place is known and placed when needed.. -3.50 is the same but the sign bit is used
 

Thread Starter

Ghina Bayyat

Joined Mar 11, 2018
129
so you are saying that the two's complement method cannot be used to represent real numbers and it is only for negative numbers ??
 

jpanhalt

Joined Jan 18, 2008
9,420

Thread Starter

Ghina Bayyat

Joined Mar 11, 2018
129
Here's a 2's complement table:
View attachment 208342

Notice the range of real values. Try working with it to understand the concept rather than simply assuming what can and cannot be done.

EDIT: Here's a link to the original document: https://datasheets.maximintegrated.com/en/ds/MAX31856.pdf
thanks but the link is not opening
anyway if the two's complement can be used to represent real numbers can you please explain how with an example ? let's say -3.25 .how can i write this number using two's complement ?can you please explain ?
 

BobTPH

Joined Jun 5, 2013
2,391
If you are using fixed point with two decimal places, -3.25 would be represented the same way as the integer -325.

Floating point formats typically use sign magnitude.

Bob
 

jpanhalt

Joined Jan 18, 2008
9,420
thanks but the link is not opening
anyway if the two's complement can be used to represent real numbers can you please explain how with an example ? let's say -3.25 .how can i write this number using two's complement ?can you please explain ?
It opens for me fine. Just search for and open the datasheet for the Maxim MAX31856 thermocouple amplifier.
 

MrChips

Joined Oct 2, 2009
21,126
Fixed point representation

We will use 4 bits for integer bits and 4 bits for fractional bits.

0 = 0000 0000
8 = 1000 0000
4 = 0100 0000
2 = 0010 0000
1 = 0001 0000
0.5 = 0000 1000
0.25 = 0000 0100
0.125 = 0000 0010
0.0625 = 0000 0001

The value of each bit going from left to right is
2 ^ 3 = 8
2 ^ 2 = 4
2 ^ 1 = 2
2 ^ 0 = 1
2 ^ -1 = 0.5
2 ^ -2 = 0.25
2 ^ -3 = 0.125
2 ^ -4 = 0.0625

-1 = 1111 0000
-0.5 = 1111 1000

3.25 = 0011 0100
-3.25 = 1100 1100

You may wish to think of the value as being scaled by 16.
For example,
0011 0100 = 52 = 3.25 x 16
1100 1100 = -52 = -3.25 x 16

If the MSB is 1, take the straight binary value and subtract 256, then divide by 16
For example,
1100 1100 = 204
Subtract 256
204 - 256 = -52
Divide by 16
-52 / 16 = -3.25
 

MrChips

Joined Oct 2, 2009
21,126
@MrChips @Ian Rogers

Not to derail this thread, just two questions (never came to use fix point just integer math):

When should I use it instead of integer?

How do you express in fix point format: 1783.487?

Thanks.
It comes down to the range of numbers (minimum value to maximum value) and precision desired.
1783487 would require about 25 bits. You would be better to go to floating point representation.

If you can accept 1 decimal place
1782.5 can be represented as 17825 / 10
With 16 bits you can have a range of -3276.8 to +3276.7 with 0.1 resolution.

As a general rule I do not use floating point.
I use scaled decimal as in the example above. All numbers are scaled by a suitable scaling factor, for example, x10, x40, x100, x200, x800, x1600
Numbers are displayed in decimal format and the decimal point is inserted in the correct place.
This is far more efficient than using floating point.

btw, fixed point arithmetic is still integer math. It uses only integer math.
 

Ian Rogers

Joined Dec 12, 2012
774
Once upon a time... Floating point was cumbersome and slowwwww.. It took half your memory.. All my products use Trig.. Lots of Trig.. I couldn't fit all the code into a pic18f so we used fixedpoint... We only needed two decimal places, but calculated with three.

1783.487 is 1783487... just remember where the decimal place is ie 4 /3 = 1 with integer math, but 400 / 3 = 133 put the result with two decimal places 1.33... Correct..
 

MrChips

Joined Oct 2, 2009
21,126
As an example, I had to create an MCU product to display dewpoint in °C or °F calculated from temperature and relative humidity.
One decimal place was desired. In order to maintain accuracy I scaled all values by x40. Final results were divided by 4 (i.e. rounded and shifted right 2 bits). Decimal values were generated and the decimal point put in place.
For example 422 became 10.6°C.
This was all done on a simple 8-bit MCU.
 

Papabravo

Joined Feb 24, 2006
13,728
Another way to look at the problem is to consider the use of rational approximations to real numbers. Each real number is represented as a pair of integers, call them N and D. For example pi can be represented by (N,D) = (22.7) = 22/7. A better approximation would be (N,D) = (355, 113) = 355/113.
 

Ian Rogers

Joined Dec 12, 2012
774
One thing to remember.. A 32bit Floating point unit and a long can contain the same amount of numbers... The Float, however! Represnts larger range by trading off precision ie.. 0.000000001 but when you represent larger numbers ie.. 4.345876 ^ 15.. is being a bit loose... 4,345,876,000,000,000. " Wheres the precision?? "

Horses for courses...
 

MrChips

Joined Oct 2, 2009
21,126
There are many tricks you can use in order to avoid using floats and also to speed up computation.
Let us use the value of pi as an example. As pb points out 22/7 is crude.
π = 3.1415926535897
22 / 7 = 3.14286, error = +0.00127
355 / 113 = 3.1415929, error = + 0.0000003

To multiply by π using integer arithmetic we need one multiply and one divide. Now recognize that shifting bits is much more efficient than multiplication and division.

Hence we convert
355 / 113 = (355 x 256) / (113 x 256) = (355 x 256 / 113) / 256 = 804 / 256
804 / 256 = 3.140625, error = -0.001

In this example we notice that 804 is divisible by 4.
201 / 64 gives the same result = 3.140625

If we use 10-bit shifts
(355 x 1024 / 113) / 1024 = 3217 / 1024
3217 / 1024 = 3.14160, error = +0.00001

You can use this same technique for multiplying or dividing by any number.
(Watch out for overflows when working with large integers!)

Edit: We will move all of this integer math tips to a new thread in Math & Science or maybe a blog.
 

BobaMosfet

Joined Jul 1, 2009
1,108
I have a question can you please help me
I know that real numbers are represented using the floating point representation method and that negative numbers are represented using two's complement .
I know there are other methods but these are the most common ways and i know very well how to represent real and negative numbers using the two methods but my question is :
What if i have a real negative number ? like -3.5 or -4.25 or any other number ?
How would i represent these numbers ? is the floating point method used to represent these numbers since it has the left most bit as the sign bit or is there another way to represent these numbers ? maybe a mix of the two ways or something like that ??
binary works fine for real numbers so long as you use fixed-point notation. You can have as much or as little precision as you have bits, and in fact during calculations you can move the mantissa one way or the other to gain precision where you need it. You still use the high-bits for sign.

In a 16-bit register, 3.5 = 896 or 0x380 which is 11 1000 0000 in binary. If we make it negative it is: 1111 1100 1000 0000 or (0xFC80) or -896. This puts the mantissa between the 8 and 9th bit (the very middle, so 8-bits of precision on either side of the mantissa.
 
Last edited:

andrewmm

Joined Feb 25, 2011
323
Your original statement is incorrect,
A real number does not have to be a floating point number.

A real just has a bit before and after the point,

Fixed point is real by this definition,
 
Top