4.8 — Floating point numbers

Integers are great for counting whole numbers, but sometimes we need to store very large (positive or negative) numbers, or numbers with a fractional component. A floating point type variable is a variable that can hold a number with a fractional component, such as 4320.0, -3.33, or 0.01226. The floating part of the name floating point refers to the fact that the decimal point can “float” -- that is, it can support a variable number of digits before and after the decimal point. Floating point data types are always signed (can hold positive and negative values).

Tip

When writing floating point numbers in your code, the decimal separator must be a decimal point. If you’re from a country that uses a decimal comma, you’ll need to get used to using a decimal point instead.

C++ floating point types

C++ has three fundamental floating point data types: a single-precision float, a double-precision double, and an extended-precision long double. As with integers, C++ does not define the actual size of these types.

Category C++ Type Typical Size
floating point float

4 bytes
double 8 bytes
long double 8, 12, or 16 bytes

On modern architectures, floating-point types are conventionally implemented using one of the floating-point formats defined in the IEEE 754 standard (see https://en.wikipedia.org/wiki/IEEE_754). As a result, float is almost always 4 bytes, and double is almost always 8 bytes.

On the other hand, long double is a strange type. On different platforms, its size can vary between 8 and 16 bytes, and it may or may not use an IEEE 754 compliant format. We recommend avoiding long double.

Tip

This tutorial series assumes your compiler is using an IEEE 754 compatible format for float and double.

You can see if your floating point types are IEEE 754 compatible with the following code:

#include <iostream>
#include <limits>

int main()
{
    std::cout << std::boolalpha; // print bool as true or false rather than 1 or 0
    std::cout << "float: " << std::numeric_limits<float>::is_iec559 << '\n';
    std::cout << "double: " << std::numeric_limits<double>::is_iec559 << '\n';
    std::cout << "long double: " << std::numeric_limits<long double>::is_iec559 << '\n';
}

For advanced readers

float is almost always implemented using the 4-byte IEEE 754 single-precision format.
double is almost always implemented using the 8-byte IEEE 754 double-precision format.

However, the format used to implement long double varies by platform. Common choices include:

  • 8-byte IEEE 754 double-precision format (same as double).
  • 80-bit (often padded to 12 or 16 bytes) x87 extended-precision format (compatible with IEEE 754).
  • 16-byte IEEE 754 quadruple-precision format.
  • 16-byte double-double format (not compatible with IEEE 754).

Floating point variables and literals

Here are some definitions of floating point variables:

float f;
double d;
long double ld;

When using floating point literals, always include at least one decimal place (even if the decimal is 0). This helps the compiler understand that the number is a floating point number and not an integer.

int a { 5 };      // 5 means integer
double b { 5.0 }; // 5.0 is a floating point literal (no suffix means double type by default)
float c { 5.0f }; // 5.0 is a floating point literal, f suffix means float type

int d { 0 };      // 0 is an integer
double e { 0.0 }; // 0.0 is a double

Note that by default, floating point literals default to type double. An f suffix is used to denote a literal of type float.

Best practice

Always make sure the type of your literals match the type of the variables they’re being assigned to or used to initialize. Otherwise an unnecessary conversion will result, possibly with a loss of precision.

Printing floating point numbers

Now consider this simple program:

#include <iostream>

int main()
{
	std::cout << 5.0 << '\n';
	std::cout << 6.7f << '\n';
	std::cout << 9876543.21 << '\n';

	return 0;
}

The results of this seemingly simple program may surprise you:

5
6.7
9.87654e+06

In the first case, std::cout printed 5, even though we typed in 5.0. By default, std::cout will not print the fractional part of a number if the fractional part is 0.

In the second case, the number prints as we expect.

In the third case, it printed the number in scientific notation (if you need a refresher on scientific notation, see lesson 4.7 -- Introduction to scientific notation).

Floating point range

Format Range Precision
IEEE 754 single-precision (4 bytes) ±1.18 x 10-38 to ±3.4 x 1038 and 0.0 6-9 significant digits, typically 7
IEEE 754 double-precision (8 bytes) ±2.23 x 10-308 to ±1.80 x 10308 and 0.0 15-18 significant digits, typically 16
x87 extended-precision (80 bits) ±3.36 x 10-4932 to ±1.18 x 104932 and 0.0 18-21 significant digits
IEEE 754 quadruple-precision (16 bytes) ±3.36 x 10-4932 to ±1.18 x 104932 and 0.0 33-36 significant digits

For advanced readers

The 80-bit x87 extended-precision floating point type is a bit of a historical anomaly. On modern processors, objects of this type are typically padded to 12 or 16 bytes (which is a more natural size for processors to handle). This means those objects have 80-bits of floating point data, and the remaining memory is filler.

It may seem a little odd that the 80-bit floating point type has the same range as the 16-byte floating point type. This is because they have the same number of bits dedicated to the exponent -- however, the 16-byte number can store more significant digits.

Floating point precision

Consider the fraction 1/3. The decimal representation of this number is 0.33333333333333… with 3’s going out to infinity. If you were writing this number on a piece of paper, your arm would get tired at some point, and you’d eventually stop writing. And the number you were left with would be close to 0.3333333333…. (with 3’s going out to infinity) but not exactly.

On a computer, an infinite precision number would require infinite memory to store, and we typically only have 4 or 8 bytes per value. This limited memory means floating point numbers can only store a certain number of significant digits -- any additional significant digits are either lost or represented imprecisely. The number that is actually stored may be close to the desired number, but not exact. We’ll show an example of this in the next section.

The precision of a floating point type defines how many significant digits it can represent without information loss.

The number of digits of precision a floating point type has depends on both the size (floats have less precision than doubles) and the particular value being stored (some values can be represented more precisely than others).

For example, a float has 6 to 9 digits of precision. This means that a float can exactly represent any number with up to 6 significant digits. A number with 7 to 9 significant digits may or may not be represented exactly depending on the specific value. And a number with more than 9 digits of precision will definitely not be represented exactly.

Double values have between 15 and 18 digits of precision, with most double values having at least 16 significant digits. Long double has a minimum precision of 15, 18, or 33 significant digits depending on how many bytes it occupies.

Key insight

A floating point type can only precisely represent a certain number of significant digits. Using a value with more significant digits than the minimum may result in the value being stored inexactly.

Outputting floating point values

When outputting floating point numbers, std::cout has a default precision of 6 -- that is, it assumes all floating point variables are only significant to 6 digits (the minimum precision of a float), and hence it will truncate anything after that.

The following program shows std::cout truncating to 6 digits:

#include <iostream>

int main()
{
    std::cout << 9.87654321f << '\n';
    std::cout << 987.654321f << '\n';
    std::cout << 987654.321f << '\n';
    std::cout << 9876543.21f << '\n';
    std::cout << 0.0000987654321f << '\n';

    return 0;
}

This program outputs:

9.87654
987.654
987654
9.87654e+006
9.87654e-005

Note that each of these only have 6 significant digits.

Also note that std::cout will switch to outputting numbers in scientific notation in some cases. Depending on the compiler, the exponent will typically be padded to a minimum number of digits. Fear not, 9.87654e+006 is the same as 9.87654e6, just with some padding 0’s. The minimum number of exponent digits displayed is compiler-specific (Visual Studio uses 3, some others use 2 as per the C99 standard).

We can override the default precision that std::cout shows by using an output manipulator function named std::setprecision(). Output manipulators alter how data is output, and are defined in the iomanip header.

#include <iomanip> // for output manipulator std::setprecision()
#include <iostream>

int main()
{
    std::cout << std::setprecision(17); // show 17 digits of precision
    std::cout << 3.33333333333333333333333333333333333333f <<'\n'; // f suffix means float
    std::cout << 3.33333333333333333333333333333333333333 << '\n'; // no suffix means double

    return 0;
}

Outputs:

3.3333332538604736
3.3333333333333335

Because we set the precision to 17 digits using std::setprecision(), each of the above numbers is printed with 17 digits. But, as you can see, the numbers certainly aren’t precise to 17 digits! And because floats are less precise than doubles, the float has more error.

Tip

Output manipulators (and input manipulators) are sticky -- meaning if you set them, they will remain set.

The one exception is std::setw. Some IO operations reset std::setw, so std::setw should be used every time it is needed.

Precision issues don’t just impact fractional numbers, they impact any number with too many significant digits. Let’s consider a big number:

#include <iomanip> // for std::setprecision()
#include <iostream>

int main()
{
    float f { 123456789.0f }; // f has 10 significant digits
    std::cout << std::setprecision(9); // to show 9 digits in f
    std::cout << f << '\n';

    return 0;
}

Output:

123456792

123456792 is greater than 123456789. The value 123456789.0 has 10 significant digits, but float values typically have 7 digits of precision (and the result of 123456792 is precise only to 7 significant digits). We lost some precision! When precision is lost because a number can’t be stored precisely, this is called a rounding error.

Consequently, one has to be careful when using floating point numbers that require more precision than the variables can hold.

Best practice

Favor double over float unless space is at a premium, as the lack of precision in a float will often lead to inaccuracies.

Rounding errors make floating point comparisons tricky

Floating point numbers are tricky to work with due to non-obvious differences between binary (how data is stored) and decimal (how we think) numbers. Consider the fraction 1/10. In decimal, this is easily represented as 0.1, and we are used to thinking of 0.1 as an easily representable number with 1 significant digit. However, in binary, decimal value 0.1 is represented by the infinite sequence: 0.00011001100110011… Because of this, when we assign 0.1 to a floating point number, we’ll run into precision problems.

You can see the effects of this in the following program:

#include <iomanip> // for std::setprecision()
#include <iostream>

int main()
{
    double d{0.1};
    std::cout << d << '\n'; // use default cout precision of 6
    std::cout << std::setprecision(17);
    std::cout << d << '\n';

    return 0;
}

This outputs:

0.1
0.10000000000000001

On the top line, std::cout prints 0.1, as we expect.

On the bottom line, where we have std::cout show us 17 digits of precision, we see that d is actually not quite 0.1! This is because the double had to truncate the approximation due to its limited memory. The result is a number that is precise to 16 significant digits (which type double guarantees), but the number is not exactly 0.1. Rounding errors may make a number either slightly smaller or slightly larger, depending on where the truncation happens.

Rounding errors can have unexpected consequences:

#include <iomanip> // for std::setprecision()
#include <iostream>

int main()
{
    std::cout << std::setprecision(17);

    double d1{ 1.0 };
    std::cout << d1 << '\n';
	
    double d2{ 0.1 + 0.1 + 0.1 + 0.1 + 0.1 + 0.1 + 0.1 + 0.1 + 0.1 + 0.1 }; // should equal 1.0
    std::cout << d2 << '\n';

    return 0;
}
1
0.99999999999999989

Although we might expect that d1 and d2 should be equal, we see that they are not. If we were to compare d1 and d2 in a program, the program would probably not perform as expected. Because floating point numbers tend to be inexact, comparing floating point numbers is generally problematic -- we discuss the subject more (and solutions) in lesson 6.7 -- Relational operators and floating point comparisons.

One last note on rounding errors: mathematical operations (such as addition and multiplication) tend to make rounding errors grow. So even though 0.1 has a rounding error in the 17th significant digit, when we add 0.1 ten times, the rounding error has crept into the 16th significant digit. Continued operations would cause this error to become increasingly significant.

Key insight

Rounding errors occur when a number can’t be stored precisely. This can happen even with simple numbers, like 0.1. Therefore, rounding errors can, and do, happen all the time. Rounding errors aren’t the exception -- they’re the norm. Never assume your floating point numbers are exact.

A corollary of this rule is: be wary of using floating point numbers for financial or currency data.

Related content

For more insight into how floating point numbers are stored in binary, check out the float.exposed tool.
To learn more about floating point numbers and rounding errors, floating-point-gui.de and fabiensanglard.net have approachable guides on the topic.

NaN and Inf

IEEE 754 compatible formats additionally support some special values:

  • Inf, which represents infinity. Inf is signed, and can be positive (+Inf) or negative (-Inf).
  • NaN, which stands for “Not a Number”. There are several different kinds of NaN (which we won’t discuss here).
  • Signed zero, meaning there are separate representations for “positive zero” (+0.0) and “negative zero” (-0.0).

Formats that are not compatible with IEEE 754 may not support some (or any) of these values. In such cases, code that uses or generates these special values will produce implementation-defined behavior.

Here’s a program showing all three:

#include <iostream>

int main()
{
    double zero { 0.0 };
    
    double posinf { 5.0 / zero }; // positive infinity
    std::cout << posinf << '\n';

    double neginf { -5.0 / zero }; // negative infinity
    std::cout << neginf << '\n';

    double z1 { 0.0 / posinf }; // positive zero
    std::cout << z1 << '\n';

    double z2 { -0.0 / posinf }; // negative zero
    std::cout << z2 << '\n';

    double nan { zero / zero }; // not a number (mathematically invalid)
    std::cout << nan << '\n';

    return 0;
}

And the results using Clang:

inf
-inf
0
-0
nan

Note that the results of printing Inf and NaN are platform specific, so your results may vary (e.g. Visual Studio prints the last result as -nan(ind)).

Best practice

Avoid division by 0.0, even if your compiler supports it.

Conclusion

To summarize, the two things you should remember about floating point numbers:

  1. Floating point numbers are useful for storing very large or very small numbers, including those with fractional components.
  2. Floating point numbers often have small rounding errors, even when the number has fewer significant digits than the precision. Many times these go unnoticed because they are so small, and because the numbers are truncated for output. However, comparisons of floating point numbers may not give the expected results. Performing mathematical operations on these values will cause the rounding errors to grow larger.
guest
Your email address will not be displayed
Find a mistake? Leave a comment above!
Correction-related comments will be deleted after processing to help reduce clutter. Thanks for helping to make the site better for everyone!
Avatars from https://gravatar.com/ are connected to your provided email address.
Notify me about replies:  
650 Comments
Newest
Oldest Most Voted
Inline Feedbacks
View all comments