# blog Floating Point Numbers

# What every embedded programmer should know about … Floating Point Numbers

###
*"Obvious is the most dangerous word in mathematics." *

E. T. Bell

E. T. Bell

“Floating Point Numbers” is one of those areas that seem obvious at first, but do not be fooled!

Perhaps when we learn how to do algebra and arithmetic we are never bound in our precision in quite the same way a computer is, and when we encounter computers we are so in awe by how “smart” they are that it becomes hard to fathom how limited floating point numbers on computers truly are.

The first time I encountered something like this it had me completely baffled:

void example1(void) { volatile double dval1; volatile bool result; dval1 = 0.6; dval1 += 0.1; if (dval1 == (0.6+0.1)) result = true; // Even though these are “obviously” the same else result = false; // This line is in fact what happens! }

In this example we can see first hand one of the effects of floating point rounding errors. We are adding 0.6 to 0.1 in two slightly different ways and the computer is telling us that the result is not the same!

But how is this possible? Surely, something must be broken here?

Rest assured, there is nothing broken. This is simply how floating point numbers work, and as a programmer it is very important to know and understand why! In this particular case the compiler is adding up 0.6 and 0.1 at compile time in a different way than the microcontroller does at runtime. 0.7 represented as a binary fraction does not round to the same value as 0.6 + 0.1.

This of course is merely scratching the surface, there is an impressive number of pitfalls when using floating point numbers in computer programs which result in incorrect logic, inaccurate calculations, broken control systems and unexpected division by 0 errors.

Let’s take a more detailed look at these problems, starting at the beginning.

# Binary Fractions

For decades now computers represent floating point numbers predominantly in a consistent way (IEEE 754-1985). [See this fascinating paper on it’s history by one of it’s architects, Prof W. Kahan]

In order to understand the IEEE 754 representation better, we first have to understand how fractional numbers are represented when using binary.

We are all familiar with the way binary numbers work. It should come as no surprise that when you go right of the decimal point it works just the same as decimal numbers, you keep on dividing by 2, so in binary 0.1 = 1/2 and 0.01 = 1/4 and so on.

Let’s look at an example. The representation of 10.5625 in binary is 1010.1001. ( online calculator if you want to play )

To get the answer we have to add up all of the digits, so we have

1 x8 + 1 x2 + 1 x 1/2 + 1 x 1/16

= 8 + 2 + 0.5 + 0.0625

= 10.5625

That seems easy enough but there are some peculiarities which are not obvious, one of these is that just like 1/3 is represented by a repeating decimal for decimal numbers, in binary entirely different numbers end up being repeating fractions, like 1/10, which is not a repeating decimal is in fact a repeating binaries fraction. The representation of the decimal number 0.1 in binary turns out to be 0.00011 0011 0011 0011 0011 0011 …

When you do math with a repeating fraction like 1/3 or 1/6 or even 1/11 you know to take care as the effects of rounding will impact your accuracy, but it feels counter intuitive to apply the same caution to numbers like 0.1 or 0.6 or 0.7 which we have been trusting for accuracy all of our lives.

# IEEE 754 Standard Representation

Next we need to take a closer look at the IEEE 754 standard so that we can understand what level of accuracy is reasonable to expect.

The basic representation is fairly easy to understand. (for more detail follow the links from the WikiPedia page )

- Standard (or Single Precision) floating point numbers are represented using 32 bits.
- The number is represented as a 24 bit binary number raised to an 8-bit exponent
- The 24 bit number is called the significand.
- The significand is represented as a binary fraction with an implied leading 1,meaning we really only use 23 bits to represent it as the MSB is always 1.
- 1 bit, the MSB, is always used as a sign bit
- Some bit patterns have special meanings, e.g. since there is always a leading 1 for the significand representing exactly 0 requires a special rule. There are a lot of special bit patterns with special meanings in the standard e.g. Infinity, NaN and subnormal numbers.

**NOTE:** As you can imagine all these special bit patterns adds quite a lot of overhead to the code dealing with floating point numbers. The Solaris manual has a very good description of the special cases

The decimal number 0.25 is for example represented using IEEE 754 floating point standard presentation as follows:

Firstly the sign bit is 0 as the number is positive.

Next up is the significand. We want to represent ¼ - in binary that would be 0.01, but the rule for is that we have to have a leading 1 in front of the decimal point, so we have to shift the decimal point by 2 positions to give us 1.0 * 2^-2 = 0.25. (This rule is meant to ensure that we will not have 2 different representations of the same number).

The exponent is constructed by taking the value and adding a bias of 127 to it, this has some nice side effects, we can represent anything from -127 to +128 without sacrificing another sign bit for the exponent, and we can also sort numbers by simply treating the entire number as a signed integer. In the example we need 2 to the power of (-2) so we take 127 + (-2) which gives 125.

In IEEE 754 that would look as follows:

If you would like to see this all done by hand you can look at this nice youtube video which explains how this works in more detail with an example - Video Link

**NOTE: **You can check these numbers yourself using the attached project and the MPLAB-X debugger to inspect them. If you right-click in the MPLAB-X watch window you can change the representation of the numbers to binary to see the individual bits)

Next we will look at what happens with a number like 0.1, which we now know has a repeating binary fraction representation. Remember that 0.1 in binary is 0.00011001100110011..., which is encoded as 1.100110011001100… x 2^(-4) (because the rule demands a leading 1 in the fraction).

Doing the same as above this leads to:

The only tricky bit there is the last digit which is a 1, as we are required to round off to the nearest 23 binary digits and since the next digit would be a 1 that requires us to round up so the last digit becomes a 1.

For more about binary fractional rounding see this post on the angular blog

This rounding process is one of the primary causes of inaccuracy when we work with floating point numbers! In this case 0.1 becomes exactly 0.100000001490116119384765625, which is very close to 0.1 but not quite exact. To make things worse the IDE will try to be helpful by rounding the displayed number to 8 decimal digits for you, so looking at the number in the watch window will show you 0.1 exactly!

# Precision

Before we look at how things can break, let’s get the concept of precision under the knee first. If we have 24 bits to represent a binary significand that equates to 7.22472 decimal digits of precision (since log(2^24) = 7.22472). This means that the number is by design only accurate to the 7th digit and sometimes it can be accurate to the 8th digit, but further digits should be expected to be only approximations.

The smallest absolute value normal 32-bit floating point number would be 1.0 x 2 ^ (-126) = 1.17549453E-38 represented as follows:

**NOTE:** Numbers with exponent 0 represent “Subnormal” numbers which can be even smaller but are treated slightly differently, this is out of scope for this discussion. (see https://en.wikipedia.org/wiki/Denormal_number for more info)

The absolute precision of floating point numbers vary wildly over their range as a consequence of the point floating with the 7 most significant digits. This means that the largest numbers, in the range of 10^38 have only 7 accurate digits, which means the smallest distance between 2 subsequent numbers will be 10^31, or 10 Quintillion for those who know big numbers. This means that a rounding error can mean we are off by 1 Quintillion in error!

To put that in context, if we used 32-bit floating point numbers to store bank balances they would have only 7 digits of accuracy, which means any balance of $100,000.00 or more will not be 100% correct and the bigger the balance gets the larger the error would be.

As the numbers get smaller the error floats down with the decimal point so for the smallest numbers the error shrinks to a miniscule 10^-45. Also these increments do not match the increments we would have for decimal digits, which means converting between the two is imprecise and involves choosing the closest representation instead of the exact number (known as rounding).

The following figure shows a decimal and binary representation of numbers on the same scale. Note that the binary intervals are divisible only by powers of 2 while the decimals are divided by 10’s, which means the lines to not align (follow the blue line down e.g.)

These number lines also show that there are lots of values in-between which cannot be represented and we have to round them to the nearest representable number.

# Implications and Errors

We know now that floating point numbers are imprecise and we have already seen one example of how rounding of these numbers can cause exact comparisons to fail unexpectedly. This will make up our first problem.

### Example PROBLEM 1 - Comparison of floating point numbers

The first typical problem we will look at is the example that we started with.

*PROBLEM 1* : Floating point numbers are imprecise so we cannot reliably check them for equality.*SOLUTION 1* : Never do exact comparisons for equality on floating point numbers, use inequalities like greater than or smaller than instead, and when doing this be sensitive to the fact that the imprecision and rounding can cause unexpected behavior at the boundaries. If you are really forced to look for a specific value consider looking for a small range e.g. check if Temp > 99.0 AND Temp < 101.0 if you are trying to find 100.

void example1(void) { // If you REALLY HAVE to compare these then do something like this. // The range you pick needs to be appropriate to cover the maximum // possible rounding error for the calculation. // If we added up only 1 pair the most this error can be is 7 decimal // digits of accuracy, so these values are appropriate if ((dval1 > 0.6999999) && (dval1 < 0.7000001)) { result = true; // This time we get the truth ! } else { result = false; // And this does NOT happen ... } }

The attached source code project contains more examples of this.

### Example PROBLEM 2 - Addition of numbers of different scale

The next problem happens often when we implement something like a PID controller where we have to make thousands of additions over a time period intended to move the control variable by a small increment every cycle.

Because of the limitation of 7.22 digits of precision we can run into trouble when our sampling rate is high and the value of Pi (integral term) requires more than 7.22 digits of precision.

Example 2 shows this case nicely.

void example2(void) { volatile double PI = 1E-5; volatile double controlValue = 300; controlValue += 1 * PI; controlValue += 1 * PI; controlValue += 1 * PI; }

This failure case happens quite easily when we have e.g. a PWM duty cycle we are controlling, and the current value is 300, but the sampling rate is very high, in this case 100kHz. We want to add a small increment to the * controlValue* at every sample, but we want these to add up to 1, so that the

*becomes 301, after 1 second of control, but since 1E-5 is 8 decimal places down from the value 300 the entire error we are trying to add gets rounded away. If you run this code in the simulator you will see that*

**controlValue***remains the same no matter how many times you add this small value to it.*

**controlValue**When you change the control value to 30 it works though, which means that you can test this as well as you like, and it will look like it is working but it can fail in production! When you change the controlValue to 30 you will see that the first 2 additions work, and then it starts losing precision and it gets worse as you attempt to approach 300

An example of where this happened for real and **people actually died** is the case of the Patriot Missile software bug - yes this really happened! (The full story here and more about it here )

*PROBLEM 2* : When doing addition or subtraction, rounding can cause systems to fail.*SOLUTION 2* : Always take care that 7 digits of precision will be enough to do your math, if not make sure you store the integral or similar part in a separate variable to compensate for the precision mismatch.

Imprecision and rounding of interim values can also cause all kinds of problems like giving different answers depending on the order the numbers are added together or underflows to 0 for interim values causing division by zero incorrectly when doing multiplication or division. There is a nice example of this when calculating quadratic roots in the attached source code project

### Example PROBLEM 3 - Getting consistent absolute precision across the range

Our third problem is very typical in financial systems, where no matter how large the balance gets, for the books to balance we must always be accurate to 1 cent.

The floating point number system ensures that we always get consistent precision as a ratio or percentage of the number we deal with, for single precision numbers that is 7.22 digits or roughly 0.000001 % of accuracy, but as the numbers get really big the absolute precision gets worse as numbers get bigger. If we want to be accurate to 1mm or 1cm or 1 cent this system will not work, but there are ways to get consistent absolute precision, and even avoid the speed penalty of dealing with floating point numbers entirely!

*PROBLEM 3* : Precision of the smallest increment has to remain constant for my application but floating point numbers break things due to lack of precision when my numbers get large.*SOLUTION 3*: In cases like this floating point numbers are just not appropriate to store the number. Use a scaled integer to represent your number. For money this is a simple as storing cents as integer instead of dollars. For distances this may mean storing millimeters as an integer instead of storing meters as a floating point, or for time storing milliseconds as an integer instead of storing seconds as a floating point.

# Conclusion

It is quite natural to think that how Floating Point numbers work is obvious, but as we have seen it is everything but. This blog covered only the basics of how floating point numbers are handled on computers, why they are imprecise and the 3 most common types of problems programmers have to deal with, including typical solutions to these problems.

There are a lot more things that can go wrong when you start getting numbers close to zero (subnormal), floating point underflow to 0, dealing with infinity etc. during calculations. A lot of these are nicely described by David Goldberg in this paper , but they are probably much less likely to affect you unless you are doing serious number crunching.

Please do read that paper if you ever have to build something using any floating point calculations where accuracy is critical like a Patriot Missile System!

# References

Why we needed the IEEE 754 standard by Prof W. Kahan

An online Converter between Binary and Decimal

The WikiPedia page on the standard

Great description in the Solaris Manual

Reproduction of the David Goldberg paper

Youtube video of how to convert a decimal floating point number to IEEE754 format

How rounding of binary fractions work

About subnormal or denormal numbers

Details on the Patriot Missile disaster - Link 1

Details on the Patriot Missile disaster - Link 2

Online IEEE754 calculator so you can check your compilers and hardware

# Source Code

- 1

## 0 Comments

## Recommended Comments

There are no comments to display.