Demystifying Floating Point Precision

gilbetron · on Nov 23, 2017

Years ago I created a solar system simulation for education called the Astronomicon. It was a fascinating education for me, not only about Astronomy and our solar system, but about floating point precision and general numerical stability issues. The problem really showed up with Pluto (back then a planet ;) ). Using 32 bits, a floating point seemed fine, except when you started up the simulation and saw pluto jumping in discrete steps in it's orbit - you could actually see the bits, basically!

Graphics cards (they weren't called GPUs yet!) were just transitioning to 64 bit, so our high end cards could handle 64 bit precision, and everything was fine. But we were stuck with having to support some 32 bit cards. Then I figured if I scaled everything by ... 10,000 ... I think ... it would work out ok. The exact details escape me - maybe the mantissa wasn't enough for the needed precision, so multiplying by 10,000 before sending it to the GPU was enough. Or maybe the opposite. Regardless, it meant there was an annoying constant we had to deal with :)

Nothing like diving in and dealing with an issue like that to do some learnin'!

madez · on Nov 23, 2017

Fixed precision floating point arithmetic is useful when you work over a large range of numbers and care about relative accuracy. That makes them also useful when the range is not known. However, the range is not irrelevant. Violating the allowed range can lead to bad accuracy, and shifting the numbers via a factor into the limitations can improve accuracy. From what you say, this happened in your case; the numbers got too small and you shifted them into the supported range, which is 2^(−126) to (1 − 2^(−24)) × 2^128 for IEEE 754-2008 32 bit numbers.

taejo · on Nov 23, 2017

Half of all floating-point values are between -1 and 1. Half of the remaining values have magnitude between 1 and 2. And so on... If we represent positions in kilometers, and we're talking about a solar system with radius on the order of a billion kilometers in diameter, and it becomes clear that we're wasting almost all of those floating point bits!

movablesed · on Nov 22, 2017

Related: A fun video that illustrates floating-point precision by way of Super Mario 64:

https://youtu.be/9hdFG2GcNuA

MichaelRenor · on Nov 22, 2017

A video game demonstration really helps understand how the coarseness of floats can introduce all kinds of strange bugs.

laythea · on Nov 22, 2017

Great link thanks. Best floating point essential concept video I have seen.

nerdponx · on Nov 22, 2017

I hate these text-on-picture-with-music videos. Even the robot-voice videos are better. I know it's using the Super Mario music, but still. It's like a blog post where you can't copy or paste anything, can't scroll back and forth, can't be cached or indexed, etc.

katastic · on Nov 22, 2017

Guy did it for free man. Give him a little slack. He's not even a programmer. He does Mario speed runs. He didn't post his video here, someone else did.

And I absolutely disagree with the "Robot voice would be better." That's horrific.

chillingeffect · on Nov 23, 2017

i don't think he's picking on the authors themselves but on the broader problem:

How to make decent auto-play presentations on the web.

most of us hate it when scrolling gets broken. many of us like to simply watch a lecture go by with interaction, e.g. for deeper contemplation. many of us would like to hear the lecture.

stuffing it into a video is tradeoff.

this is just a typical problem of a highly-mediated era: we have the core information that can be rendered or published across numerous formats: book, blog post, tweets, video, google slide presentation, etc., etc. How to deliver the information to all formats, to be everything to everyone?

movablesed · on Nov 22, 2017

He has some videos that are commentated by voice, but for some reason he finds doing the recordings very stressful, so most of his content is unfortunately on his 'uncommentated' channel. I agree it would be nice to have the voice-overs

nerdponx · on Nov 22, 2017

He could be writing blog posts and making videos to supplement. His writing isn't bad.

digi_owl · on Nov 22, 2017

I guess he finds it much easier to just dump it all onto Youtube for hosting and ads, rather than deal with setting up his own blog hosting for it.

zmonx · on Nov 22, 2017

Speaking of floats, I highly recommend John Gustafson's book, The End of Error: Unum Computing:

https://www.crcpress.com/The-End-of-Error-Unum-Computing/Gus...

It is an amazing compendium of the many ways in which floats fail to give any guarantees in practice. In some cases, due to the format itself, and also because many subtle issues are indicated via internal processor values and flags that are simply not accessible in high-level languages. In addition, the book presents Unums, an alternative number format with several extremely interesting properties that make computations much less error-prone.

Prof. Gustafson has since also presented several additional formats with much stronger guarantees than IEEE floats:

http://www.johngustafson.net/unums.html

wruza · on Nov 22, 2017

In short, how does u-bit solve multiplication problem when two u-intervals produce >1 interval? Like (1.01+ * 1.01+) = 1.02+ .. 1.04+, in 3-decimal precision

zmonx · on Nov 22, 2017

In such cases, you get a ubound that contains the exact value. When the inputs are exact, the result is exact too.

Here is a sample implementation in Julia:

https://github.com/JuliaComputing/Unums.jl

dragontamer · on Nov 22, 2017

I've stated some notes on Floats before, and I think this blogpost is pretty good overall.

The one thing that trips up a lot of people (that wasn't mentioned in this post) is that Floats are non-associative. Try the following in Python (or whatever language that uses Doubles):

>>> 1 + 1 + (2. * * 53)

> 9007199254740994.0

>>> 1 + (1 + (2. * * 53))

> 9007199254740992.0

claytonjy noted a point of confusion, the above statement has been edited for clarity.

The number 53 is chosen, because there's 53 bits in the double mantissa. Which is when "+1" will become too small and drop-off of the Double.

In the first case, 1+1 becomes 2 (which is large enough to be added to 2. * * 53). So you end up with the "correct" answer. But (2. * * 53 + 1) rounds down (because the 1 drops off), and the additional +1 afterwards also drops off due to rounding error.

Therefore, Floating Point math is commutative but NOT Associative. And that's really what makes things confusing for most people, in my experience.

x86 with the x87 coprocessor actually performs 80-bit floats to deal with this issue. So you can afford to lose a fair number of bits under older x86 platforms. But modern C++ / C code typically compiles to the faster SSE2 instructions, which only keep 64-bits per operation. This only demonstrates that rounding behavior is not only architecture-specific, but also compiler-option specific !!

kraghen · on Nov 22, 2017

It gets even worse: (1 + 2 ^ 53) - 2 ^ 53 evaluates to 0 while 1 + (2 ^ 53 - 2 ^ 53) evaluates to 1 (the correct result), which means that even the operation of adding three floating point numbers has unbounded relative error in the general case.

This is a great source of frustration in computational geometry, where this turns from a quantitative issue to a qualitative one, since if we get the sign of an expression wrong our data structures might no longer reflect the actual topology of the problem.

dragontamer · on Nov 22, 2017

> which means that even the operation of adding three floating point numbers has unbounded relative error in the general case.

Its not quite "in the general case".

In "any case with subtraction", there is unbounded relative error. And remember that addition with negative numbers can become subtraction.

Error-management is very important. But if you can GUARANTEE that your operations have the same sign, and that you're only using addition, multiplication, and division... then error is easier to track.

Its that subtraction (aka: cancellation error) that gets ya.

zmonx · on Nov 22, 2017

It is true that cancellation errors are often the source of wrong results.

However, computations involving IEEE floating point arithmetic can go horribly wrong even if there is no addition and no subtraction whatsoever, no divisions, and only a very small number of rounding errors. Here is one such example:

http://people.eecs.berkeley.edu/~wkahan/WrongR.pdf

Even professional users working in this domain will have a hard time to locate the cause of such problems involving IEEE floating point arithmetic. The format is extremely error-prone to use for casual users and experts alike.

dragontamer · on Nov 22, 2017

Ehhh... I think that paper kinda "tricks" you with point #3.

> Only 257 algebraic operations, so no hordes of rounding errors

Rounding errors in Floating-point math grows exponentially in the common case. So 257 operations is more than enough to wipe out 53-bits of precision. Its not "hordes" of rounding errors that cause issues. Its the exponential nature of rounding errors, exponentially increasing every step of the way.

zmonx · on Nov 22, 2017

Yes, that is true.

kraghen · on Nov 22, 2017

Well, the general case is that there might be a subtraction. In IEEE arithmetic the sum (or difference) of two numbers is guaranteed to be within 1 ulp, which is a very good bound on the relative error. The problem begins when you feed this slightly inaccurate result into some other computation that magnifies the error.

Now, one might have hoped to derive some bound on the relative error of sums with a small number of addends. My point is that already with three terms this is a lost cause.

Of course with extra assumptions on the nature of the input it's possible to state something stronger. In my example of computational geometry this is never the case since unstable arithmetic corresponds to geometric singularities and the real world is never in general position.

dragontamer · on Nov 23, 2017

> Now, one might have hoped to derive some bound on the relative error of sums with a small number of addends. My point is that already with three terms this is a lost cause.

Difficult? Yes. Lost cause? Nope. Although it really helps to define error in terms of your largest addend.

The proper solution to addition / subtraction problems in IEEE754 Floating point is to sort the numbers by magnitude. That ensures that the smallest numbers have the best chance of being "detected" in the floating point operation.

For example... 1 + 2^53 + 1 needs to be sorted into 1 + 1 + 2^53.

Or 1 + 2^53 + 1 - 2^53 needs to be sorted into 1 + 1 + 2^53 - 2^53 (which equals 2. Hurrah! We're exactly correct!).

If all the numbers are sorted from smallest to largest, then the worst case error is relative to 2^53, the largest number in this addend sequence.

I admit that I'm cheating and moving the goalposts a little bit... But that's what you gotta do when you work with Doubles.

----------

> Well, the general case is that there might be a subtraction. In IEEE arithmetic the sum (or difference) of two numbers is guaranteed to be within 1 ulp, which is a very good bound on the relative error. The problem begins when you feed this slightly inaccurate result into some other computation that magnifies the error.

Overall, this statement rings true. In practice, all that sorting and stuff only minimizes the cancellation error. Because the "relative error" as you define it is most important in multiplication and division problems, and as you noted... relative error is unbounded.

But if you keep positives and negatives separated, you can ensure that only one cancellation occurs in any sequence. IE:

2 * (25 * sum(array1) - sum(array2))

Normally, this has a cancellation error, which is then multiplied with the 2 (which makes the cancellation error grow). You distribute the multiply and split the arrays as follows:

50 * (sum(array1_positives) - sum(array1_negatives)) - 2 * (sum(array2_positives) - sum(array2_negatives))

Then you rearrange the sequence to minimize your cancellations.

50 * sum(array1_positives)+ 2 * sum(array2_negatives) - ( 2 * sum(array2_positives) + 50 * sum(array1_negatives))

This sequence guarantees one cancellation error at the very end, and it also ensures that there are no multiplications that occur after the cancellation error.

I wouldn't call this a "lost cause", but this sort of finagling is tedious, error prone, and certainly difficult. But necessary if you want to maximize accuracy in a long-running simulation.

kraghen · on Nov 23, 2017

The lost cause I am referring to is deriving a useful relative error bound on evaluating a+b+c. The implied conclusion is not that floating point arithmetic itself is a lost cause, merely that it's a minefield from step one.

Of course, if you move the goalposts and measure the error compared to the magnitude of the largest number it gets significantly more manageable! Such a situation is benign, relatively speaking.

In my primary domain of computational geometry we don't have this luxury as the sign must be correct. Even ordering by magnitude is insufficient then; for example evaluating 2^54 - 2^54 + 2 - 1 using this method gives 0 when it should have been 1 (and incidentally, in this case a simple left associative ordering would have gotten it right so it's not even guaranteed to improve matters -- not that anyone said otherwise, but it's yet another example that extreme caution is warranted).

Eventually you end up with ideas like http://www.cs.cmu.edu/~quake/robust.html. The effort that goes into getting even the sign of a sum of products correct in every case is absurd.

Unfortunately, most of the computational geometry code I see usually involves starting with the naive implementation and when things go awry various tricks like Kahan sums or ordering by magnitude are applied hoping that the problem goes away or becomes sufficiently obscure, but never is there an attempt to clarify the exact numerical requirements of the result. Even expensive industrial software is prone to this and will sometimes crash if you happen to present it with a slightly unusual piece of geometry.

zmonx · on Nov 23, 2017

Sorting and then summing is a commonly used technique. Still, there are also better methods such as Kahan summation:

https://en.wikipedia.org/wiki/Kahan_summation_algorithm

Using pairwise addition is a very good alternative with the added benefit that it can be easily parallelized.

Personally, I would gladly use a number format that does not require such special considerations for summing a set of numbers, and still yields good results. I am really looking forward to alternative representations.

claytonjy · on Nov 22, 2017

The second equation confused me a bit as a math-first, cs-second person; looks like a commutative transform, not an associative one. The trick is one must already trust these ops are commutative.

The second equation might be more clearly written as: >>> 1 + (1 + (2. * * 53)) to avoid hiding the associativity behind the assumption of commutativity.

dragontamer · on Nov 22, 2017

You're right. Sorry for the confusion. I'll edit it to make it more clear.

gct · on Nov 22, 2017

I believe the x87 uses 80 bits internally, but it only runs operations until it has sufficient precision to do the rounding (which is hard to do in general, see: http://perso.ens-lyon.fr/jean-michel.muller/Intro-to-TMD.htm)

wruza · on Nov 22, 2017

2. 53 should read as 2^53 here (HN markup eaten star characters)

dragontamer · on Nov 22, 2017

Thanks, I just edited it.

colejohnson66 · on Nov 22, 2017

Is there a way to force 80 bit floats if you need them?

dragontamer · on Nov 22, 2017

Well, it seems odd to me that anybody would want "exactly 80 bit floats".

1. In the case of "doubles", it seems like its more important that your code works the same on all platforms. So you can enable precise IEEE754 behavior, to ensure that your rounding errors and whatnot are the same from platform to platform.

You'll still need to somehow ensure that all of your operations occur in the same order however, IEEE754 only specifies when and how 64-bit numbers get rounded. So sorting your numbers, adding them up from smallest magnitude to largest magnitude is important still.

2. In the case that 64-bits is insufficient precision, you should be using something more precise but also consistent across different platforms.

In case #1: its usually more important that all code "makes the same rounding errors", as opposed to "some code rounds more poorly than other code".

In case #2: you'll want ALL your code to have better rounding behavior. And "Long Double" is still implementation specific. Its probably best to go to a pure software solution (ie: BigNums of some kind) if precision is important to you.

The case #3: some people don't care about the precision, and are willing to have lower precision as long as the results are "nearly correct". Video Game programmers are far more interested in speed for example, so the "fast inverse square root" barely even has 10-bits of precision for example, but the far greater speed wins out in most video game situations.

wruza · on Nov 22, 2017

There is a way to politely ask for it and be careful to what you get.

https://en.wikipedia.org/wiki/Long_double

gct · on Nov 22, 2017

If you want a guarantee, use something like gmp that'll do them in software.

Atrix256 · on Nov 22, 2017

good point. That's interesting!

slededit · on Nov 22, 2017

I've always thought that floats should be accompanied with a precision value specifying the number of significant digits. Ideally updated by hardware in the same operation.

All floating point code has subtle bugs if you don't track error accumulation.

kraghen · on Nov 22, 2017

This sounds like unums, a proposed alternative to IEEE floats, where roughly speaking the significand can have variable size thus only boasting as much precision as the accuracy warrants.

slededit · on Nov 22, 2017

Does it solve the equality issue? An epsilon tracker would allow you to say it's equal within the margin of error for the computation.

If so that makes them much more interesting to me. Floating point bugs manifest themselves at non linear parts such as equality and comparison operators.

kraghen · on Nov 22, 2017

I haven't really seen the issue of equality dealt with explicitly, but it appears that you can extract the bounds of the interval implied by a given unum. Essentially, it seems like an alternative to interval arithmetic with potentially tighter bounds and faster computation.

zmonx · on Nov 22, 2017

Yes, I think these are fair statements.

John Gustafson's presentation also includes a few important points about interval arithmetic:

http://www.johngustafson.net/presentations/Right-SizingPreci...

zokier · on Nov 22, 2017

Wouldn't that be closer to interval arithmetic?

pklausler · on Nov 22, 2017

It's in a real scientific programming language, viz. Fortran, as SELECTED_REAL_KIND().

slededit · on Nov 22, 2017

That is the precision of the type, but it does not track accumulated error which quickly gets into the representable range of the type. You need to know both the type, the precision of the data stored in the type and track the operations on it.

madez · on Nov 23, 2017

Floating point numbers are suboptimal to represent timers. They use more bits than needed, aren't accurate on the bits they use, and require more effort for operations.

Unsigned integers are better to represent timers. If the accuracy is 1/30s as in the example, then with a 32 bit unsigned integer you could cover over 55 days with perfect numerical accuracy and minimal computational effort. With 32 bit floating point numbers you get only 6 days, while needing to worry about accuracy and higher computational demand. At 64 bits it is 649936019 years for unsigned integers vs 8925512 years for floating point. If your game doesn't depend on state that is longer ago than the time covered, then using unsigned integers allows unlimited runtime.

Don't blindly use floats.

pjc50 · on Nov 23, 2017

Floating point numbers are terrible at representing times and money quantities. Which is a shame given that Excel does it that way..

ska · on Nov 22, 2017

Worth a read if you haven't seen it: “What every computer scientist should know about floating point arithmetic”. This includes a better discussion of all the points in the blog post.

https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.h...

xaduha · on Nov 23, 2017

There's no reason not to use BigNum type equivalent in your program where computation intensive tasks are not expected, when language supports them well enough.

saagarjha · on Nov 23, 2017

Psst…your LaTeX formulas could be improved if you touched them up a bit. You can put text in \text{}, and use the actual commands for math functions such as floor and log by prefixing then with a \.

Atrix256 · on Nov 23, 2017

Thanks. Probably ought to do that (:

yodacola · on Nov 22, 2017

I try to avoid using floats when doing business calculations. Base 2 doesn't work well with our base 10 currency.

Real world example: 1.09 - 1.09375

Double: 0.00374999999999992

Decimal: 0.00375

wruza · on Nov 22, 2017

Base 2 doesn’t work well with military too:

http://www-users.math.umn.edu/~arnold/disasters/patriot.html

evincarofautumn · on Nov 23, 2017

Doing calculations with an exact number type such as [big] rationals, and then explicitly converting to floating point or rounding/truncating digits of precision as needed, is a pretty good middle ground for many applications. 109/100 − 109375/100000 = 109/100 − 35/32 = −3/800 = −3.75ᴇ−3.

lucio · on Nov 22, 2017

An easier example: type this in your browser console:

0.1 + 0.2 == 0.3

false!(?)

0xbear · on Nov 22, 2017

Key point that takes some time to sink in: there are only 2^32 32 bit floats in total. Meaning that you can easily enumerate them all.