Summing floats, more than one way to do it

So you summed some floats and have some unexpected results? Don’t be afraid, it’s normal.

When summing floats you have errors, not only because some numbers can’t be represented exactly as a float, but you get errors from differences in magnitude too.

Say you have a large float and add a small one to it, the small one will loose a lot of it’s precision in the process. You can see what can happen if you use fixed-width decimal numbers:

12345.6 +
    0.12345
------------
12345.72345

If you only have one addition, the error could be acceptable, but if you have to sum a lot of numbers the small ones could start to be a significant part of the sum, and that part can get lost in the process.

If you had:

12345.6 +
    0.05123
------------
12345.65123 +
    0.05123
------------
12345.65123

You just lost .1, even if you could have avoided that by reordering the operations:

    0.05123 +
    0.05123 
    ---------
    0.10246 +
12345.6
-------------
12345.70246

So what can you do?

Actually, you have multiple choices, depending on what you want to achieve:

  • You can sort the numbers (small to big) and sum them after sorting. This will group together numbers of similar magnitude and it will yield smaller errors. But this is not the best method, there are more accurate methods that will not need a sort.
  • You can use Kahan summation. This is more accurate than sorting and it should be faster too.
  • You can use use doubles. This is the best method, but what if you’re summing a vector of doubles in the first place?
  • If you’re feeling adventurous you can run the sum on the FPU, but not many places let you insert assembly in the codebase.

If you want to see some code, you can download this file: main.zip. That code works on GCC 4.5 with the -std=c++0x flag.

Here is the output of a run, summing 10M random floats:

Starting: /home/daniel/projects/floats/build/floats
2916.660400390625 <- Reverse sorted
3057.517578125 <- Normal sum
3192.88330078125 <- Sorted
3194.340576171875 <- Kahan summation
3194.340576171875 <- FPU
3194.340576171875 <- Double truncated
3194.34058601572951374691911041736602783203125 <- Double precision