Floating-point_number

Floating-point arithmetic

Computer approximation for real numbers

In computing, floating-point arithmetic (FP) is arithmetic that represents subsets of real numbers using an integer with a fixed precision, called the significand, scaled by an integer exponent of a fixed base. Numbers of this form are called floating-point numbers.^[1]^: 3^[2]^: 10 For example, 12.345 is a floating-point number in base ten with five digits of precision:

12.345=\!\underbrace {12345} _{\text{significand}}\!\times \!\underbrace {10} _{\text{base}}\!\!\!\!\!\!\!\overbrace {{}^{-3}} ^{\text{exponent}}

An early electromechanical programmable computer, the Z3, included floating-point arithmetic (replica on display at Deutsches Museum in Munich).

However, unlike 12.345, 12.3456 is not a floating-point number in base ten with five digits of precision—it needs six digits of precision; the nearest floating-point number with only five digits is 12.346. In practice, most floating-point systems use base two, though base ten (decimal floating point) is also common.

Floating-point arithmetic operations, such as addition and division, approximate the corresponding real number arithmetic operations by rounding any result that is not a floating-point number itself to a nearby floating-point number.^[1]^: 22^[2]^: 10 For example, in a floating-point arithmetic with five base-ten digits of precision, the sum 12.345 + 1.0001 = 13.3451 might be rounded to 13.345.

The term floating point refers to the fact that the number's radix point can "float" anywhere to the left, right, or between the significant digits of the number. This position is indicated by the exponent, so floating point can be considered a form of scientific notation.

A floating-point system can be used to represent, with a fixed number of digits, numbers of very different orders of magnitude — such as the number of meters between galaxies or between protons in an atom. For this reason, floating-point arithmetic is often used to allow very small and very large real numbers that require fast processing times. The result of this dynamic range is that the numbers that can be represented are not uniformly spaced; the difference between two consecutive representable numbers varies with their exponent.^[3]

Single-precision floating-point numbers on a number line: the green lines mark representable values.

Augmented version above showing both signs of representable values

Over the years, a variety of floating-point representations have been used in computers. In 1985, the IEEE 754 Standard for Floating-Point Arithmetic was established, and since the 1990s, the most commonly encountered representations are those defined by the IEEE.

The speed of floating-point operations, commonly measured in terms of FLOPS, is an important characteristic of a computer system, especially for applications that involve intensive mathematical calculations.

A floating-point unit (FPU, colloquially a math coprocessor) is a part of a computer system specially designed to carry out operations on floating-point numbers.

Overview

Floating-point numbers

A number representation specifies some way of encoding a number, usually as a string of digits.

There are several mechanisms by which strings of digits can represent numbers. In standard mathematical notation, the digit string can be of any length, and the location of the radix point is indicated by placing an explicit "point" character (dot or comma) there. If the radix point is not specified, then the string implicitly represents an integer and the unstated radix point would be off the right-hand end of the string, next to the least significant digit. In fixed-point systems, a position in the string is specified for the radix point. So a fixed-point scheme might use a string of 8 decimal digits with the decimal point in the middle, whereby "00012345" would represent 0001.2345.

In scientific notation, the given number is scaled by a power of 10, so that it lies within a specific range—typically between 1 and 10, with the radix point appearing immediately after the first digit. As a power of ten, the scaling factor is then indicated separately at the end of the number. For example, the orbital period of Jupiter's moon Io is 152,853.5047 seconds, a value that would be represented in standard-form scientific notation as 1.528535047×10⁵ seconds.

Floating-point representation is similar in concept to scientific notation. Logically, a floating-point number consists of:

A signed (meaning positive or negative) digit string of a given length in a given base (or radix). This digit string is referred to as the significand, mantissa, or coefficient.^{[nb 1]} The length of the significand determines the precision to which numbers can be represented. The radix point position is assumed always to be somewhere within the significand—often just after or just before the most significant digit, or to the right of the rightmost (least significant) digit. This article generally follows the convention that the radix point is set just after the most significant (leftmost) digit.
A signed integer exponent (also referred to as the characteristic, or scale),^{[nb 2]} which modifies the magnitude of the number.

To derive the value of the floating-point number, the significand is multiplied by the base raised to the power of the exponent, equivalent to shifting the radix point from its implied position by a number of places equal to the value of the exponent—to the right if the exponent is positive or to the left if the exponent is negative.

Using base-10 (the familiar decimal notation) as an example, the number 152,853.5047, which has ten decimal digits of precision, is represented as the significand 1,528,535,047 together with 5 as the exponent. To determine the actual value, a decimal point is placed after the first digit of the significand and the result is multiplied by 10⁵ to give 1.528535047×10⁵, or 152,853.5047. In storing such a number, the base (10) need not be stored, since it will be the same for the entire range of supported numbers, and can thus be inferred.

Symbolically, this final value is:

{\frac {s}{b^{\,p-1}}}\times b^{e},

where s is the significand (ignoring any implied decimal point), p is the precision (the number of digits in the significand), b is the base (in our example, this is the number ten), and e is the exponent.

Historically, several number bases have been used for representing floating-point numbers, with base two (binary) being the most common, followed by base ten (decimal floating point), and other less common varieties, such as base sixteen (hexadecimal floating point^[4]^[5]^{[nb 3]}), base eight (octal floating point^[1]^[5]^[6]^[4]^{[nb 4]}), base four (quaternary floating point^[7]^[5]^{[nb 5]}), base three (balanced ternary floating point^[1]) and even base 256^[5]^{[nb 6]} and base 65,536.^[8]^{[nb 7]}

A floating-point number is a rational number, because it can be represented as one integer divided by another; for example 1.45×10³ is (145/100)×1000 or 145,000/100. The base determines the fractions that can be represented; for instance, 1/5 cannot be represented exactly as a floating-point number using a binary base, but 1/5 can be represented exactly using a decimal base (0.2, or 2×10⁻¹). However, 1/3 cannot be represented exactly by either binary (0.010101...) or decimal (0.333...), but in base 3, it is trivial (0.1 or 1×3⁻¹) . The occasions on which infinite expansions occur depend on the base and its prime factors.

The way in which the significand (including its sign) and exponent are stored in a computer is implementation-dependent. The common IEEE formats are described in detail later and elsewhere, but as an example, in the binary single-precision (32-bit) floating-point representation, $p=24$ , and so the significand is a string of 24 bits. For instance, the number π's first 33 bits are:

11001001\ 00001111\ 1101101{\underline {0}}\ 10100010\ 0.

In this binary expansion, let us denote the positions from 0 (leftmost bit, or most significant bit) to 32 (rightmost bit). The 24-bit significand will stop at position 23, shown as the underlined bit 0 above. The next bit, at position 24, is called the round bit or rounding bit. It is used to round the 33-bit approximation to the nearest 24-bit number (there are specific rules for halfway values, which is not the case here). This bit, which is 1 in this example, is added to the integer formed by the leftmost 24 bits, yielding:

11001001\ 00001111\ 1101101{\underline {1}}.

When this is stored in memory using the IEEE 754 encoding, this becomes the significand s. The significand is assumed to have a binary point to the right of the leftmost bit. So, the binary representation of π is calculated from left-to-right as follows:

{\begin{aligned}&\left(\sum _{n=0}^{p-1}{\text{bit}}_{n}\times 2^{-n}\right)\times 2^{e}\\={}&\left(1\times 2^{-0}+1\times 2^{-1}+0\times 2^{-2}+0\times 2^{-3}+1\times 2^{-4}+\cdots +1\times 2^{-23}\right)\times 2^{1}\\\approx {}&1.5707964\times 2\\\approx {}&3.1415928\end{aligned}}

where p is the precision (24 in this example), n is the position of the bit of the significand from the left (starting at 0 and finishing at 23 here) and e is the exponent (1 in this example).

It can be required that the most significant digit of the significand of a non-zero number be non-zero (except when the corresponding exponent would be smaller than the minimum one). This process is called normalization. For binary formats (which uses only the digits 0 and 1), this non-zero digit is necessarily 1. Therefore, it does not need to be represented in memory, allowing the format to have one more bit of precision. This rule is variously called the leading bit convention, the implicit bit convention, the hidden bit convention,^[1] or the assumed bit convention.

Alternatives to floating-point numbers

The floating-point representation is by far the most common way of representing in computers an approximation to real numbers. However, there are alternatives:

Fixed-point representation uses integer hardware operations controlled by a software implementation of a specific convention about the location of the binary or decimal point, for example, 6 bits or digits from the right. The hardware to manipulate these representations is less costly than floating point, and it can be used to perform normal integer operations, too. Binary fixed point is usually used in special-purpose applications on embedded processors that can only do integer arithmetic, but decimal fixed point is common in commercial applications.
Logarithmic number systems (LNSs) represent a real number by the logarithm of its absolute value and a sign bit. The value distribution is similar to floating point, but the value-to-representation curve (i.e., the graph of the logarithm function) is smooth (except at 0). Conversely to floating-point arithmetic, in a logarithmic number system multiplication, division and exponentiation are simple to implement, but addition and subtraction are complex. The (symmetric) level-index arithmetic (LI and SLI) of Charles Clenshaw, Frank Olver and Peter Turner is a scheme based on a generalized logarithm representation.
Tapered floating-point representation, which does not appear to be used in practice.
Some simple rational numbers (e.g., 1/3 and 1/10) cannot be represented exactly in binary floating point, no matter what the precision is. Using a different radix allows one to represent some of them (e.g., 1/10 in decimal floating point), but the possibilities remain limited. Software packages that perform rational arithmetic represent numbers as fractions with integral numerator and denominator, and can therefore represent any rational number exactly. Such packages generally need to use "bignum" arithmetic for the individual integers.
Interval arithmetic allows one to represent numbers as intervals and obtain guaranteed bounds on results. It is generally based on other arithmetics, in particular floating point.
Computer algebra systems such as Mathematica, Maxima, and Maple can often handle irrational numbers like $\pi$ or ${\sqrt {3}}$ in a completely "formal" way (symbolic computation), without dealing with a specific encoding of the significand. Such a program can evaluate expressions like " $\sin(3\pi )$ " exactly, because it is programmed to process the underlying mathematics directly, instead of using approximate values for each intermediate calculation.

History

Leonardo Torres Quevedo, in 1914 published an analysis of floating point based on the analytical engine

In 1914, the Spanish engineer Leonardo Torres Quevedo published Essays on Automatics,^[9] where he designed a special-purpose electromechanical calculator based on Charles Babbage's Analytical Engine and described a way to store floating-point numbers in a consistent manner. He stated that numbers will be stored in exponential format as n x 10 $^{m}$ , and offered three rules by which consistent manipulation of floating-point numbers by machines could be implemented. For Torres, "n will always be the same number of digits (e.g. six), the first digit of n will be of order of tenths, the second of hundredths, etc, and one will write each quantity in the form: n; m." The format he proposed shows the need for a fixed-sized significand as is presently used for floating-point data, fixing the location of the decimal point in the significand so that each representation was unique, and how to format such numbers by specifying a syntax to be used that could be entered through a typewriter, as was the case of his Electromechanical Arithmometer in 1920.^[10]^[11]^[12]

Konrad Zuse, architect of the Z3 computer, which uses a 22-bit binary floating-point representation

In 1938, Konrad Zuse of Berlin completed the Z1, the first binary, programmable mechanical computer;^[13] it uses a 24-bit binary floating-point number representation with a 7-bit signed exponent, a 17-bit significand (including one implicit bit), and a sign bit.^[14] The more reliable relay-based Z3, completed in 1941, has representations for both positive and negative infinities; in particular, it implements defined operations with infinity, such as $^{1}/_{\infty }=0$ , and it stops on undefined operations, such as $0\times \infty$ .

Zuse also proposed, but did not complete, carefully rounded floating-point arithmetic that includes $\pm \infty$ and NaN representations, anticipating features of the IEEE Standard by four decades.^[15] In contrast, von Neumann recommended against floating-point numbers for the 1951 IAS machine, arguing that fixed-point arithmetic is preferable.^[15]

The first commercial computer with floating-point hardware was Zuse's Z4 computer, designed in 1942–1945. In 1946, Bell Laboratories introduced the Model V, which implemented decimal floating-point numbers.^[16]

The Pilot ACE has binary floating-point arithmetic, and it became operational in 1950 at National Physical Laboratory, UK. Thirty-three were later sold commercially as the English Electric DEUCE. The arithmetic is actually implemented in software, but with a one megahertz clock rate, the speed of floating-point and fixed-point operations in this machine were initially faster than those of many competing computers.

The mass-produced IBM 704 followed in 1954; it introduced the use of a biased exponent. For many decades after that, floating-point hardware was typically an optional feature, and computers that had it were said to be "scientific computers", or to have "scientific computation" (SC) capability (see also Extensions for Scientific Computation (XSC)). It was not until the launch of the Intel i486 in 1989 that general-purpose personal computers had floating-point capability in hardware as a standard feature.

The UNIVAC 1100/2200 series, introduced in 1962, supported two floating-point representations:

Single precision: 36 bits, organized as a 1-bit sign, an 8-bit exponent, and a 27-bit significand.
Double precision: 72 bits, organized as a 1-bit sign, an 11-bit exponent, and a 60-bit significand.

The IBM 7094, also introduced in 1962, supported single-precision and double-precision representations, but with no relation to the UNIVAC's representations. Indeed, in 1964, IBM introduced hexadecimal floating-point representations in its System/360 mainframes; these same representations are still available for use in modern z/Architecture systems. In 1998, IBM implemented IEEE-compatible binary floating-point arithmetic in its mainframes; in 2005, IBM also added IEEE-compatible decimal floating-point arithmetic.

Initially, computers used many different representations for floating-point numbers. The lack of standardization at the mainframe level was an ongoing problem by the early 1970s for those writing and maintaining higher-level source code; these manufacturer floating-point standards differed in the word sizes, the representations, and the rounding behavior and general accuracy of operations. Floating-point compatibility across multiple computing systems was in desperate need of standardization by the early 1980s, leading to the creation of the IEEE 754 standard once the 32-bit (or 64-bit) word had become commonplace. This standard was significantly based on a proposal from Intel, which was designing the i8087 numerical coprocessor; Motorola, which was designing the 68000 around the same time, gave significant input as well.

In 1989, mathematician and computer scientist William Kahan was honored with the Turing Award for being the primary architect behind this proposal; he was aided by his student Jerome Coonen and a visiting professor, Harold Stone.^[17]

Among the x86 innovations are these:

A precisely specified floating-point representation at the bit-string level, so that all compliant computers interpret bit patterns the same way. This makes it possible to accurately and efficiently transfer floating-point numbers from one computer to another (after accounting for endianness).
A precisely specified behavior for the arithmetic operations: A result is required to be produced as if infinitely precise arithmetic were used to yield a value that is then rounded according to specific rules. This means that a compliant computer program would always produce the same result when given a particular input, thus mitigating the almost mystical reputation that floating-point computation had developed for its hitherto seemingly non-deterministic behavior.
The ability of exceptional conditions (overflow, divide by zero, etc.) to propagate through a computation in a benign manner and then be handled by the software in a controlled fashion.

Range of floating-point numbers

A floating-point number consists of two fixed-point components, whose range depends exclusively on the number of bits or digits in their representation. Whereas components linearly depend on their range, the floating-point range linearly depends on the significand range and exponentially on the range of exponent component, which attaches outstandingly wider range to the number.

On a typical computer system, a double-precision (64-bit) binary floating-point number has a coefficient of 53 bits (including 1 implied bit), an exponent of 11 bits, and 1 sign bit. Since 2¹⁰ = 1024, the complete range of the positive normal floating-point numbers in this format is from 2⁻¹⁰²² ≈ 2 × 10⁻³⁰⁸ to approximately 2¹⁰²⁴ ≈ 2 × 10³⁰⁸.

The number of normal floating-point numbers in a system (B, P, L, U) where

B is the base of the system,
P is the precision of the significand (in base B),
L is the smallest exponent of the system,
U is the largest exponent of the system,

is $2\left(B-1\right)\left(B^{P-1}\right)\left(U-L+1\right)$ .

There is a smallest positive normal floating-point number,

Underflow level = UFL =

B^{L}

,

which has a 1 as the leading digit and 0 for the remaining digits of the significand, and the smallest possible value for the exponent.

There is a largest floating-point number,

Overflow level = OFL =

\left(1-B^{-P}\right)\left(B^{U+1}\right)

,

which has B − 1 as the value for each digit of the significand and the largest possible value for the exponent.

In addition, there are representable values strictly between −UFL and UFL. Namely, positive and negative zeros, as well as subnormal numbers.

Dealing with exceptional cases

Floating-point computation in a computer can run into three kinds of problems:

An operation can be mathematically undefined, such as ∞/∞, or division by zero.
An operation can be legal in principle, but not supported by the specific format, for example, calculating the square root of −1 or the inverse sine of 2 (both of which result in complex numbers).
An operation can be legal in principle, but the result can be impossible to represent in the specified format, because the exponent is too large or too small to encode in the exponent field. Such an event is called an overflow (exponent too large), underflow (exponent too small) or denormalization (precision loss).

Prior to the IEEE standard, such conditions usually caused the program to terminate, or triggered some kind of trap that the programmer might be able to catch. How this worked was system-dependent, meaning that floating-point programs were not portable. (The term "exception" as used in IEEE 754 is a general term meaning an exceptional condition, which is not necessarily an error, and is a different usage to that typically defined in programming languages such as a C++ or Java, in which an "exception" is an alternative flow of control, closer to what is termed a "trap" in IEEE 754 terminology.)

Here, the required default method of handling exceptions according to IEEE 754 is discussed (the IEEE 754 optional trapping and other "alternate exception handling" modes are not discussed). Arithmetic exceptions are (by default) required to be recorded in "sticky" status flag bits. That they are "sticky" means that they are not reset by the next (arithmetic) operation, but stay set until explicitly reset. The use of "sticky" flags thus allows for testing of exceptional conditions to be delayed until after a full floating-point expression or subroutine: without them exceptional conditions that could not be otherwise ignored would require explicit testing immediately after every floating-point operation. By default, an operation always returns a result according to specification without interrupting computation. For instance, 1/0 returns +∞, while also setting the divide-by-zero flag bit (this default of ∞ is designed to often return a finite result when used in subsequent operations and so be safely ignored).

The original IEEE 754 standard, however, failed to recommend operations to handle such sets of arithmetic exception flag bits. So while these were implemented in hardware, initially programming language implementations typically did not provide a means to access them (apart from assembler). Over time some programming language standards (e.g., C99/C11 and Fortran) have been updated to specify methods to access and change status flag bits. The 2008 version of the IEEE 754 standard now specifies a few operations for accessing and handling the arithmetic flag bits. The programming model is based on a single thread of execution and use of them by multiple threads has to be handled by a means outside of the standard (e.g. C11 specifies that the flags have thread-local storage).

IEEE 754 specifies five arithmetic exceptions that are to be recorded in the status flags ("sticky bits"):

inexact, set if the rounded (and returned) value is different from the mathematically exact result of the operation.
underflow, set if the rounded value is tiny (as specified in IEEE 754) and inexact (or maybe limited to if it has denormalization loss, as per the 1985 version of IEEE 754), returning a subnormal value including the zeros.
overflow, set if the absolute value of the rounded value is too large to be represented. An infinity or maximal finite value is returned, depending on which rounding is used.
divide-by-zero, set if the result is infinite given finite operands, returning an infinity, either +∞ or −∞.
invalid, set if a real-valued result cannot be returned e.g. sqrt(−1) or 0/0, returning a quiet NaN.

Fig. 1: resistances in parallel, with total resistance

R_{tot}

The default return value for each of the exceptions is designed to give the correct result in the majority of cases such that the exceptions can be ignored in the majority of codes. inexact returns a correctly rounded result, and underflow returns a value less than or equal to the smallest positive normal number in magnitude and can almost always be ignored.^[46] divide-by-zero returns infinity exactly, which will typically then divide a finite number and so give zero, or else will give an invalid exception subsequently if not, and so can also typically be ignored. For example, the effective resistance of n resistors in parallel (see fig. 1) is given by $R_{\text{tot}}=1/(1/R_{1}+1/R_{2}+\cdots +1/R_{n})$ . If a short-circuit develops with $R_{1}$ set to 0, $1/R_{1}$ will return +infinity which will give a final $R_{tot}$ of 0, as expected^[47] (see the continued fraction example of IEEE 754 design rationale for another example).

Overflow and invalid exceptions can typically not be ignored, but do not necessarily represent errors: for example, a root-finding routine, as part of its normal operation, may evaluate a passed-in function at values outside of its domain, returning NaN and an invalid exception flag to be ignored until finding a useful start point.^[46]

Accuracy problems

The fact that floating-point numbers cannot accurately represent all real numbers, and that floating-point operations cannot accurately represent true arithmetic operations, leads to many surprising situations. This is related to the finite precision with which computers generally represent numbers.

For example, the decimal numbers 0.1 and 0.01 cannot be represented exactly as binary floating-point numbers. In the IEEE 754 binary32 format with its 24-bit significand, the result of attempting to square the approximation to 0.1 is neither 0.01 nor the representable number closest to it. The decimal number 0.1 is represented in binary as e = −4; s = 110011001100110011001101, which is

0.100000001490116119384765625 exactly.

Squaring this number gives

0.010000000298023226097399174250313080847263336181640625 exactly.

Squaring it with rounding to the 24-bit precision gives

0.010000000707805156707763671875 exactly.

But the representable number closest to 0.01 is

0.009999999776482582092285156250 exactly.

Also, the non-representability of π (and π/2) means that an attempted computation of tan(π/2) will not yield a result of infinity, nor will it even overflow in the usual floating-point formats (assuming an accurate implementation of tan). It is simply not possible for standard floating-point hardware to attempt to compute tan(π/2), because π/2 cannot be represented exactly. This computation in C:

/* Enough digits to be sure we get the correct approximation. */
double pi = 3.1415926535897932384626433832795;
double z = tan(pi/2.0);

will give a result of 16331239353195370.0. In single precision (using the tanf function), the result will be −22877332.0.

By the same token, an attempted computation of sin(π) will not yield zero. The result will be (approximately) 0.1225×10⁻¹⁵ in double precision, or −0.8742×10⁻⁷ in single precision.^{[nb 10]}

While floating-point addition and multiplication are both commutative (a + b = b + a and a × b = b × a), they are not necessarily associative. That is, (a + b) + c is not necessarily equal to a + (b + c). Using 7-digit significand decimal arithmetic:

 a = 1234.567, b = 45.67834, c = 0.0004

 (a + b) + c:
     1234.567   (a)
   +   45.67834 (b)
   ____________
     1280.24534   rounds to   1280.245

    1280.245  (a + b)
   +   0.0004 (c)
   ____________
    1280.2454   rounds to   1280.245  ← (a + b) + c

 a + (b + c):
   45.67834 (b)
 +  0.0004  (c)
 ____________
   45.67874

   1234.567   (a)
 +   45.67874   (b + c)
 ____________
   1280.24574   rounds to   1280.246 ← a + (b + c)

They are also not necessarily distributive. That is, (a + b) × c may not be the same as a × c + b × c:

 1234.567 × 3.333333 = 4115.223
 1.234567 × 3.333333 = 4.115223
                       4115.223 + 4.115223 = 4119.338
 but
 1234.567 + 1.234567 = 1235.802
                       1235.802 × 3.333333 = 4119.340

In addition to loss of significance, inability to represent numbers such as π and 0.1 exactly, and other slight inaccuracies, the following phenomena may occur:

Cancellation: subtraction of nearly equal operands may cause extreme loss of accuracy.^[48]^[45] When we subtract two almost equal numbers we set the most significant digits to zero, leaving ourselves with just the insignificant, and most erroneous, digits.^[1]^: 124 For example, when determining a derivative of a function the following formula is used: $Q(h)={\frac {f(a+h)-f(a)}{h}}.$ Intuitively one would want an h very close to zero; however, when using floating-point operations, the smallest number will not give the best approximation of a derivative. As h grows smaller, the difference between f(a + h) and f(a) grows smaller, cancelling out the most significant and least erroneous digits and making the most erroneous digits more important. As a result the smallest number of h possible will give a more erroneous approximation of a derivative than a somewhat larger number. This is perhaps the most common and serious accuracy problem.
Conversions to integer are not intuitive: converting (63.0/9.0) to integer yields 7, but converting (0.63/0.09) may yield 6. This is because conversions generally truncate rather than round. Floor and ceiling functions may produce answers which are off by one from the intuitively expected value.
Limited exponent range: results might overflow yielding infinity, or underflow yielding a subnormal number or zero. In these cases precision will be lost.
Testing for safe division is problematic: Checking that the divisor is not zero does not guarantee that a division will not overflow.
Testing for equality is problematic. Two computational sequences that are mathematically equal may well produce different floating-point values.^[49]

Machine precision and backward error analysis

Machine precision is a quantity that characterizes the accuracy of a floating-point system, and is used in backward error analysis of floating-point algorithms. It is also known as unit roundoff or machine epsilon. Usually denoted Ε_mach, its value depends on the particular rounding being used.

With rounding to zero,

\mathrm {E} _{\text{mach}}=B^{1-P},\,

whereas rounding to nearest,

\mathrm {E} _{\text{mach}}={\tfrac {1}{2}}B^{1-P},

where B is the base of the system and P is the precision of the significand (in base B).

This is important since it bounds the relative error in representing any non-zero real number x within the normalized range of a floating-point system:

\left|{\frac {\operatorname {fl} (x)-x}{x}}\right|\leq \mathrm {E} _{\text{mach}}.

Backward error analysis, the theory of which was developed and popularized by James H. Wilkinson, can be used to establish that an algorithm implementing a numerical function is numerically stable.^[51] The basic approach is to show that although the calculated result, due to roundoff errors, will not be exactly correct, it is the exact solution to a nearby problem with slightly perturbed input data. If the perturbation required is small, on the order of the uncertainty in the input data, then the results are in some sense as accurate as the data "deserves". The algorithm is then defined as backward stable. Stability is a measure of the sensitivity to rounding errors of a given numerical procedure; by contrast, the condition number of a function for a given problem indicates the inherent sensitivity of the function to small perturbations in its input and is independent of the implementation used to solve the problem.^[52]

As a trivial example, consider a simple expression giving the inner product of (length two) vectors $x$ and $y$ , then

{\begin{aligned}\operatorname {fl} (x\cdot y)&=\operatorname {fl} {\big (}\operatorname {fl} (x_{1}\cdot y_{1})+\operatorname {fl} (x_{2}\cdot y_{2}){\big )},&&{\text{ where }}\operatorname {fl} (){\text{ indicates correctly rounded floating-point arithmetic}}\\&=\operatorname {fl} {\big (}(x_{1}\cdot y_{1})(1+\delta _{1})+(x_{2}\cdot y_{2})(1+\delta _{2}){\big )},&&{\text{ where }}\delta _{n}\leq \mathrm {E} _{\text{mach}},{\text{ from above}}\\&={\big (}(x_{1}\cdot y_{1})(1+\delta _{1})+(x_{2}\cdot y_{2})(1+\delta _{2}){\big )}(1+\delta _{3})\\&=(x_{1}\cdot y_{1})(1+\delta _{1})(1+\delta _{3})+(x_{2}\cdot y_{2})(1+\delta _{2})(1+\delta _{3}),\end{aligned}}

and so

\operatorname {fl} (x\cdot y)={\hat {x}}\cdot {\hat {y}},

where

{\begin{aligned}{\hat {x}}_{1}&=x_{1}(1+\delta _{1});&{\hat {x}}_{2}&=x_{2}(1+\delta _{2});\\{\hat {y}}_{1}&=y_{1}(1+\delta _{3});&{\hat {y}}_{2}&=y_{2}(1+\delta _{3}),\\\end{aligned}}

where

\delta _{n}\leq \mathrm {E} _{\text{mach}}

by definition, which is the sum of two slightly perturbed (on the order of Ε_mach) input data, and so is backward stable. For more realistic examples in numerical linear algebra, see Higham 2002^[53] and other references below.

Minimizing the effect of accuracy problems

Although individual arithmetic operations of IEEE 754 are guaranteed accurate to within half a ULP, more complicated formulae can suffer from larger errors for a variety of reasons. The loss of accuracy can be substantial if a problem or its data are ill-conditioned, meaning that the correct result is hypersensitive to tiny perturbations in its data. However, even functions that are well-conditioned can suffer from large loss of accuracy if an algorithm numerically unstable for that data is used: apparently equivalent formulations of expressions in a programming language can differ markedly in their numerical stability. One approach to remove the risk of such loss of accuracy is the design and analysis of numerically stable algorithms, which is an aim of the branch of mathematics known as numerical analysis. Another approach that can protect against the risk of numerical instabilities is the computation of intermediate (scratch) values in an algorithm at a higher precision than the final result requires,^[54] which can remove, or reduce by orders of magnitude,^[55] such risk: IEEE 754 quadruple precision and extended precision are designed for this purpose when computing at double precision.^[56]^{[nb 11]}

For example, the following algorithm is a direct implementation to compute the function A(x) = (x−1) / (exp(x−1) − 1) which is well-conditioned at 1.0,^{[nb 12]} however it can be shown to be numerically unstable and lose up to half the significant digits carried by the arithmetic when computed near 1.0.^[57]

double A(double X)
{
        double Y, Z;  // [1]
        Y = X - 1.0;
        Z = exp(Y);
        if (Z != 1.0)
                Z = Y / (Z - 1.0); // [2]
        return Z;
}

If, however, intermediate computations are all performed in extended precision (e.g. by setting line [1] to C99 long double), then up to full precision in the final double result can be maintained.^{[nb 13]} Alternatively, a numerical analysis of the algorithm reveals that if the following non-obvious change to line [2] is made:

Z = log(Z) / (Z - 1.0);

then the algorithm becomes numerically stable and can compute to full double precision.

To maintain the properties of such carefully constructed numerically stable programs, careful handling by the compiler is required. Certain "optimizations" that compilers might make (for example, reordering operations) can work against the goals of well-behaved software. There is some controversy about the failings of compilers and language designs in this area: C99 is an example of a language where such optimizations are carefully specified to maintain numerical precision. See the external references at the bottom of this article.

A detailed treatment of the techniques for writing high-quality floating-point software is beyond the scope of this article, and the reader is referred to,^[53]^[58] and the other references at the bottom of this article. Kahan suggests several rules of thumb that can substantially decrease by orders of magnitude^[58] the risk of numerical anomalies, in addition to, or in lieu of, a more careful numerical analysis. These include: as noted above, computing all expressions and intermediate results in the highest precision supported in hardware (a common rule of thumb is to carry twice the precision of the desired result, i.e. compute in double precision for a final single-precision result, or in double extended or quad precision for up to double-precision results^[59]); and rounding input data and results to only the precision required and supported by the input data (carrying excess precision in the final result beyond that required and supported by the input data can be misleading, increases storage cost and decreases speed, and the excess bits can affect convergence of numerical procedures:^[60] notably, the first form of the iterative example given below converges correctly when using this rule of thumb). Brief descriptions of several additional issues and techniques follow.

As decimal fractions can often not be exactly represented in binary floating-point, such arithmetic is at its best when it is simply being used to measure real-world quantities over a wide range of scales (such as the orbital period of a moon around Saturn or the mass of a proton), and at its worst when it is expected to model the interactions of quantities expressed as decimal strings that are expected to be exact.^[55]^[58] An example of the latter case is financial calculations. For this reason, financial software tends not to use a binary floating-point number representation.^[61] The "decimal" data type of the C# and Python programming languages, and the decimal formats of the IEEE 754-2008 standard, are designed to avoid the problems of binary floating-point representations when applied to human-entered exact decimal values, and make the arithmetic always behave as expected when numbers are printed in decimal.

Expectations from mathematics may not be realized in the field of floating-point computation. For example, it is known that $(x+y)(x-y)=x^{2}-y^{2}\,$ , and that $\sin ^{2}{\theta }+\cos ^{2}{\theta }=1\,$ , however these facts cannot be relied on when the quantities involved are the result of floating-point computation.

The use of the equality test (if (x==y) ...) requires care when dealing with floating-point numbers. Even simple expressions like 0.6/0.2-3==0 will, on most computers, fail to be true^[62] (in IEEE 754 double precision, for example, 0.6/0.2 - 3 is approximately equal to -4.44089209850063e-16). Consequently, such tests are sometimes replaced with "fuzzy" comparisons (if (abs(x-y) < epsilon) ..., where epsilon is sufficiently small and tailored to the application, such as 1.0E−13). The wisdom of doing this varies greatly, and can require numerical analysis to bound epsilon.^[53] Values derived from the primary data representation and their comparisons should be performed in a wider, extended, precision to minimize the risk of such inconsistencies due to round-off errors.^[58] It is often better to organize the code in such a way that such tests are unnecessary. For example, in computational geometry, exact tests of whether a point lies off or on a line or plane defined by other points can be performed using adaptive precision or exact arithmetic methods.^[63]

Small errors in floating-point arithmetic can grow when mathematical algorithms perform operations an enormous number of times. A few examples are matrix inversion, eigenvector computation, and differential equation solving. These algorithms must be very carefully designed, using numerical approaches such as iterative refinement, if they are to work well.^[64]

Summation of a vector of floating-point values is a basic algorithm in scientific computing, and so an awareness of when loss of significance can occur is essential. For example, if one is adding a very large number of numbers, the individual addends are very small compared with the sum. This can lead to loss of significance. A typical addition would then be something like

3253.671
+  3.141276
-----------
3256.812

The low 3 digits of the addends are effectively lost. Suppose, for example, that one needs to add many numbers, all approximately equal to 3. After 1000 of them have been added, the running sum is about 3000; the lost digits are not regained. The Kahan summation algorithm may be used to reduce the errors.^[53]

Round-off error can affect the convergence and accuracy of iterative numerical procedures. As an example, Archimedes approximated π by calculating the perimeters of polygons inscribing and circumscribing a circle, starting with hexagons, and successively doubling the number of sides. As noted above, computations may be rearranged in a way that is mathematically equivalent but less prone to error (numerical analysis). Two forms of the recurrence formula for the circumscribed polygon are:^{[citation needed]}

${\textstyle t_{0}={\frac {1}{\sqrt {3}}}}$
First form: ${\textstyle t_{i+1}={\frac {{\sqrt {t_{i}^{2}+1}}-1}{t_{i}}}}$
second form: ${\textstyle t_{i+1}={\frac {t_{i}}{{\sqrt {t_{i}^{2}+1}}+1}}}$
$\pi \sim 6\times 2^{i}\times t_{i}$ , converging as $i\rightarrow \infty$

Here is a computation using IEEE "double" (a significand with 53 bits of precision) arithmetic:

 i   6 × 2ⁱ × t_i, first form    6 × 2ⁱ × t_i, second form
---------------------------------------------------------
 0   3.4641016151377543863      3.4641016151377543863
 1   3.2153903091734710173      3.2153903091734723496
 2   3.1596599420974940120      3.1596599420975006733
 3   3.1460862151314012979      3.1460862151314352708
 4   3.1427145996453136334      3.1427145996453689225
 5   3.1418730499801259536      3.1418730499798241950
 6   3.1416627470548084133      3.1416627470568494473
 7   3.1416101765997805905      3.1416101766046906629
 8   3.1415970343230776862      3.1415970343215275928
 9   3.1415937488171150615      3.1415937487713536668
10   3.1415929278733740748      3.1415929273850979885
11   3.1415927256228504127      3.1415927220386148377
12   3.1415926717412858693      3.1415926707019992125
13   3.1415926189011456060      3.1415926578678454728
14   3.1415926717412858693      3.1415926546593073709
15   3.1415919358822321783      3.1415926538571730119
16   3.1415926717412858693      3.1415926536566394222
17   3.1415810075796233302      3.1415926536065061913
18   3.1415926717412858693      3.1415926535939728836
19   3.1414061547378810956      3.1415926535908393901
20   3.1405434924008406305      3.1415926535900560168
21   3.1400068646912273617      3.1415926535898608396
22   3.1349453756585929919      3.1415926535898122118
23   3.1400068646912273617      3.1415926535897995552
24   3.2245152435345525443      3.1415926535897968907
25                              3.1415926535897962246
26                              3.1415926535897962246
27                              3.1415926535897962246
28                              3.1415926535897962246
              The true value is 3.14159265358979323846264338327...

While the two forms of the recurrence formula are clearly mathematically equivalent,^{[nb 14]} the first subtracts 1 from a number extremely close to 1, leading to an increasingly problematic loss of significant digits. As the recurrence is applied repeatedly, the accuracy improves at first, but then it deteriorates. It never gets better than about 8 digits, even though 53-bit arithmetic should be capable of about 16 digits of precision. When the second form of the recurrence is used, the value converges to 15 digits of precision.

Share this article:

This article uses material from the Wikipedia article Floating-point_number, and is written by contributors. Text is available under a CC BY-SA 4.0 International License; additional terms may apply. Images, videos and audio are available under their respective licenses.

[NB_Significand-4] [N 1]
The significand of a floating-point number is also called mantissa by some authors—not to be confused with the mantissa of a logarithm. Somewhat vague, terms such as coefficient or argument are also used by some. The usage of the term fraction by some authors is potentially misleading as well. The term characteristic (as used e.g. by CDC) is ambiguous, as it was historically also used to specify some form of exponent of floating-point numbers.

[NB_Exponent-5] [N 2]
The exponent of a floating-point number is sometimes also referred to as scale. The term characteristic (for biased exponent, exponent bias, or excess n representation) is ambiguous, as it was historically also used to specify the significand of floating-point numbers.

[NB_9-8] [N 3]
Hexadecimal (base-16) floating-point arithmetic is used in the IBM System 360 (1964) and 370 (1970) as well as various newer IBM machines, in the RCA Spectra 70 (1964), the Siemens 4004 (1965), 7.700 (1974), 7.800, 7.500 (1977) series mainframes and successors, the Unidata 7.000 series mainframes, the Manchester MU5 (1972), the HEP (1982) computers, and in 360/370-compatible mainframe families made by Fujitsu, Amdahl and Hitachi. It is also used in the Illinois ILLIAC III (1966), Data General Eclipse S/200 (ca. 1974), Gould Powernode 9080 (1980s), Interdata 8/32 (1970s), the SEL Systems 85 and 86 as well as the SDS Sigma 5 (1967), 7 (1966) and Xerox Sigma 9 (1970).

[NB_8-10] [N 4]
Octal (base-8) floating-point arithmetic is used in the Ferranti Atlas (1962), Burroughs B5500 (1964), Burroughs B5700 (1971), Burroughs B6700 (1971) and Burroughs B7700 (1972) computers.

[NB_11-12] [N 5]
Quaternary (base-4) floating-point arithmetic is used in the Illinois ILLIAC II (1962) computer. It is also used in the Digital Field System DFS IV and V high-resolution site survey systems.

[NB_12-13] [N 6]
Base-256 floating-point arithmetic is used in the Rice Institute R1 computer (since 1958).

[NB_10-15] [N 7]
Base-65536 floating-point arithmetic is used in the MANIAC II (1956) computer.

[NB_1-41] [N 8]
Computer hardware doesn't necessarily compute the exact value; it simply has to produce the equivalent rounded result as though it had computed the infinitely precise result.

[NB_2-54] [N 9]
The enormous complexity of modern division algorithms once led to a famous error. An early version of the Intel Pentium chip was shipped with a division instruction that, on rare occasions, gave slightly incorrect results. Many computers had been shipped before the error was discovered. Until the defective computers were replaced, patched versions of compilers were developed that could avoid the failing cases. See Pentium FDIV bug.

[NB_3-57] [N 10]
But an attempted computation of cos(π) yields −1 exactly. Since the derivative is nearly zero near π, the effect of the inaccuracy in the argument is far smaller than the spacing of the floating-point numbers around −1, and the rounded result is exact.

[NB_4-67] [N 11]
William Kahan notes: "Except in extremely uncommon situations, extra-precise arithmetic generally attenuates risks due to roundoff at far less cost than the price of a competent error-analyst."

[NB_5-68] [N 12]
The Taylor expansion of this function demonstrates that it is well-conditioned near 1: A(x) = 1 − (x−1)/2 + (x−1)^2/12 − (x−1)^4/720 + (x−1)^6/30240 − (x−1)^8/1209600 + ... for |x−1| < π.

[NB_6-70] [N 13]
If long double is IEEE quad precision then full double precision is retained; if long double is IEEE double extended precision then additional, but not full precision is retained.

[NB_7-78] [N 14]
The equivalence of the two forms can be verified algebraically by noting that the denominator of the fraction in the second form is the conjugate of the numerator of the first. By multiplying the top and bottom of the first expression by this conjugate, one obtains the second expression.

[Muller_2010-1] [1]
Muller, Jean-Michel; Brisebarre, Nicolas; de Dinechin, Florent; Jeannerod, Claude-Pierre; Lefèvre, Vincent; Melquiond, Guillaume; Revol, Nathalie; Stehlé, Damien; Torres, Serge (2010). Handbook of Floating-Point Arithmetic (1st ed.). Birkhäuser. doi:10.1007/978-0-8176-4705-6. ISBN 978-0-8176-4704-9. LCCN 2009939668.

[sterbenz1974fpcomp-2] [2]
Sterbenz, Pat H. (1974). Floating-Point Computation. Englewood Cliffs, NJ, United States: Prentice-Hall. ISBN 0-13-322495-3.

[Smith_1997-3] [3]
Smith, Steven W. (1997). "Chapter 28, Fixed versus Floating Point". The Scientist and Engineer's Guide to Digital Signal Processing. California Technical Pub. p. 514. ISBN 978-0-9660176-3-2. Retrieved 2012-12-31.

[Zehendner_2008-6] [4]
Zehendner, Eberhard (Summer 2008). "Rechnerarithmetik: Fest- und Gleitkommasysteme" (PDF) (Lecture script) (in German). Friedrich-Schiller-Universität Jena. p. 2. Archived (PDF) from the original on 2018-08-07. Retrieved 2018-08-07. (NB. This reference incorrectly gives the MANIAC II's floating point base as 256, whereas it actually is 65536.)

[Beebe_2017-7] [5]
Beebe, Nelson H. F. (2017-08-22). "Chapter H. Historical floating-point architectures". The Mathematical-Function Computation Handbook - Programming Using the MathCW Portable Software Library (1st ed.). Salt Lake City, UT, USA: Springer International Publishing AG. p. 948. doi:10.1007/978-3-319-64110-2. ISBN 978-3-319-64109-6. LCCN 2017947446. S2CID 30244721.

[Savard_2018-9] [6]
Savard, John J. G. (2018) [2007], "The Decimal Floating-Point Standard", quadibloc, archived from the original on 2018-07-03, retrieved 2018-07-16

[Parkinson_2000-11] [7]
Parkinson, Roger (2000-12-07). "Chapter 2 - High resolution digital site survey systems - Chapter 2.1 - Digital field recording systems". High Resolution Site Surveys (1st ed.). CRC Press. p. 24. ISBN 978-0-20318604-6. Retrieved 2019-08-18. […] Systems such as the [Digital Field System] DFS IV and DFS V were quaternary floating-point systems and used gain steps of 12 dB. […] (256 pages)

[Lazarus_1956-14] [8]
Lazarus, Roger B. (1957-01-30) [1956-10-01]. "MANIAC II" (PDF). Los Alamos, NM, USA: Los Alamos Scientific Laboratory of the University of California. p. 14. LA-2083. Archived (PDF) from the original on 2018-08-07. Retrieved 2018-08-07. […] the Maniac's floating base, which is 2¹⁶ = 65,536. […] The Maniac's large base permits a considerable increase in the speed of floating point arithmetic. Although such a large base implies the possibility of as many as 15 lead zeros, the large word size of 48 bits guarantees adequate significance. […]

[16] [9]
Torres Quevedo, Leonardo. Automática: Complemento de la Teoría de las Máquinas, (pdf), pp. 575–583, Revista de Obras Públicas, 19 November 1914.

[17] [10]
Ronald T. Kneusel. Numbers and Computers, Springer, pp. 84–85, 2017. ISBN 978-3319505084

[FOOTNOTERandell19826,_11–13-18] [11]
Randell 1982, pp. 6, 11–13.

[19] [12]
Randell, Brian. Digital Computers, History of Origins, (pdf), p. 545, Digital Computers: Origins, Encyclopedia of Computer Science, January 2003.

[Rojas_1997-20] [13]
Rojas, Raúl (April–June 1997). "Konrad Zuse's Legacy: The Architecture of the Z1 and Z3" (PDF). IEEE Annals of the History of Computing. 19 (2): 5–16. doi:10.1109/85.586067. Archived (PDF) from the original on 2022-07-03. Retrieved 2022-07-03. (12 pages)

[Rojas_2014-21] [14]
Rojas, Raúl (2014-06-07). "The Z1: Architecture and Algorithms of Konrad Zuse's First Computer". arXiv:1406.1886 [cs.AR].

[Kahan_1997_JVNL-22] [15]
Kahan, William Morton (1997-07-15). "The Baleful Effect of Computer Languages and Benchmarks upon Applied Mathematics, Physics and Chemistry. John von Neumann Lecture" (PDF). p. 3. Archived (PDF) from the original on 2008-09-05.

[Randell_1982_2-23] [16]
Randell, Brian, ed. (1982) [1973]. The Origins of Digital Computers: Selected Papers (3rd ed.). Berlin; New York: Springer-Verlag. p. 244. ISBN 978-3-540-11319-5.

[Severance_1998-24] [17]
Severance, Charles (1998-02-20). "An Interview with the Old Man of Floating-Point".

[C99-25] [18]
ISO/IEC 9899:1999 - Programming languages - C. Iso.org. §F.2, note 307. "Extended" is IEC 60559's double-extended data format. Extended refers to both the common 80-bit and quadruple 128-bit IEC 60559 formats.

[MSVC-26] [19]
"IEEE Floating-Point Representation". 2021-08-03.

[GCC-27] [20]
Using the GNU Compiler Collection, i386 and x86-64 Options Archived 2015-01-16 at the Wayback Machine.

[float_128-28] [21]
"long double (GCC specific) and __float128". StackOverflow.

[ARM_2013_AArch64-29] [22]
"Procedure Call Standard for the ARM 64-bit Architecture (AArch64)" (PDF). 2013-05-22. Archived (PDF) from the original on 2013-07-31. Retrieved 2019-09-22.

[ARM_2013_Compiler-30] [23]
"ARM Compiler toolchain Compiler Reference, Version 5.03" (PDF). 2013. Section 6.3 Basic data types. Archived (PDF) from the original on 2015-06-27. Retrieved 2019-11-08.

[Kahan_2004-31] [24]
Kahan, William Morton (2004-11-20). "On the Cost of Floating-Point Computation Without Extra-Precise Arithmetic" (PDF). Archived (PDF) from the original on 2006-05-25. Retrieved 2012-02-19.

[OpenEXR-32] [25]
"openEXR". openEXR. Archived from the original on 2013-05-08. Retrieved 2012-04-25. Since the IEEE-754 floating-point specification does not define a 16-bit format, ILM created the "half" format. Half values have 1 sign bit, 5 exponent bits, and 10 mantissa bits.

[OpenEXR-half-33] [26]
"Technical Introduction to OpenEXR – The half Data Type". openEXR. Retrieved 2024-04-16.

[Babbage-34] [27]
"IEEE-754 Analysis".

[Borland_1994_MBF-35] [28]
Borland staff (1998-07-02) [1994-03-10]. "Converting between Microsoft Binary and IEEE formats". Technical Information Database (TI1431C.txt). Embarcadero USA / Inprise (originally: Borland). ID 1400. Archived from the original on 2019-02-20. Retrieved 2016-05-30. […] _fmsbintoieee(float *src4, float *dest4) […] MS Binary Format […] byte order => m3 | m2 | m1 | exponent […] m1 is most significant byte => sbbb|bbbb […] m3 is the least significant byte […] m = mantissa byte […] s = sign bit […] b = bit […] MBF is bias 128 and IEEE is bias 127. […] MBF places the decimal point before the assumed bit, while IEEE places the decimal point after the assumed bit. […] ieee_exp = msbin[3] - 2; /* actually, msbin[3]-1-128+127 */ […] _dmsbintoieee(double *src8, double *dest8) […] MS Binary Format […] byte order => m7 | m6 | m5 | m4 | m3 | m2 | m1 | exponent […] m1 is most significant byte => smmm|mmmm […] m7 is the least significant byte […] MBF is bias 128 and IEEE is bias 1023. […] MBF places the decimal point before the assumed bit, while IEEE places the decimal point after the assumed bit. […] ieee_exp = msbin[7] - 128 - 1 + 1023; […]

[Steil_2008_6502-36] [29]
Steil, Michael (2008-10-20). "Create your own Version of Microsoft BASIC for 6502". pagetable.com. Archived from the original on 2016-05-30. Retrieved 2016-05-30.

[Microsoft_2006_KB35826-37] [30]
"IEEE vs. Microsoft Binary Format; Rounding Issues (Complete)". Microsoft Support. Microsoft. 2006-11-21. Article ID KB35826, Q35826. Archived from the original on 2020-08-28. Retrieved 2010-02-24.

[Kharya_2020-38] [31]
Kharya, Paresh (2020-05-14). "TensorFloat-32 in the A100 GPU Accelerates AI Training, HPC up to 20x". Retrieved 2020-05-16.

[NVIDIA_Hopper-39] [32]
"NVIDIA Hopper Architecture In-Depth". 2022-03-22.

[Micikevicius_2022-40] [33]
Micikevicius, Paulius; Stosic, Dusan; Burgess, Neil; Cornea, Marius; Dubey, Pradeep; Grisenthwaite, Richard; Ha, Sangwon; Heinecke, Alexander; Judd, Patrick; Kamalu, John; Mellempudi, Naveen; Oberman, Stuart; Shoeybi, Mohammad; Siu, Michael; Wu, Hao (2022-09-12). "FP8 Formats for Deep Learning". arXiv:2209.05433 [cs.LG].

[Kahan_2006_Mindless-42] [34]
Kahan, William Morton (2006-01-11). "How Futile are Mindless Assessments of Roundoff in Floating-Point Computation?" (PDF). Archived (PDF) from the original on 2004-12-21.

[Gay_1990-43] [35]
Gay, David M. (1990). Correctly Rounded Binary-Decimal and Decimal-Binary Conversions (Technical report). NUMERICAL ANALYSIS MANUSCRIPT 90-10, AT&T BELL LABORATORIES. CiteSeerX 10.1.1.31.4049. (dtoa.c in netlab)

[Loitsch_2010-44] [36]
Loitsch, Florian (2010). "Printing floating-point numbers quickly and accurately with integers" (PDF). Proceedings of the 31st ACM SIGPLAN Conference on Programming Language Design and Implementation. PLDI '10: ACM SIGPLAN Conference on Programming Language Design and Implementation. pp. 233–243. doi:10.1145/1806596.1806623. ISBN 978-1-45030019-3. S2CID 910409. Archived (PDF) from the original on 2014-07-29.

[mazong-45] [37]
"Added Grisu3 algorithm support for double.ToString(). by mazong1123 · Pull Request #14646 · dotnet/coreclr". GitHub.

[Adams_2018-46] [38]
Adams, Ulf (2018-12-02). "Ryū: fast float-to-string conversion". ACM SIGPLAN Notices. 53 (4): 270–282. doi:10.1145/3296979.3192369. S2CID 218472153.

[Giulietti-47] [39]
Giulietti, Rafaello. "The Schubfach way to render doubles".

[abolz-48] [40]
"abolz/Drachennest". GitHub. 2022-11-10.

[double_conversion_2020-49] [41]
"google/double-conversion". GitHub. 2020-09-21.

[Lemire_2021-50] [42]
Lemire, Daniel (2021-03-22). "Number parsing at a gigabyte per second". Software: Practice and Experience. 51 (8): 1700–1727. arXiv:2101.11408. doi:10.1002/spe.2984. S2CID 231718830.

[Goldberg_1991-51] [43]
Goldberg, David (March 1991). "What Every Computer Scientist Should Know About Floating-Point Arithmetic" (PDF). ACM Computing Surveys. 23 (1): 5–48. doi:10.1145/103162.103163. S2CID 222008826. Archived (PDF) from the original on 2006-07-20. Retrieved 2016-01-20. (, , Archived 2017-10-11 at the Wayback Machine)

[Patterson-Hennessy_2014-52] [44]
Patterson, David A.; Hennessy, John L. (2014). Computer Organization and Design, The Hardware/Software Interface. The Morgan Kaufmann series in computer architecture and design (5th ed.). Waltham, Massachusetts, USA: Elsevier. p. 793. ISBN 978-9-86605267-5.

[Sierra_1962-53] [45]
US patent 3037701A, Huberto M Sierra, "Floating decimal point arithmetic control means for calculator", issued 1962-06-05

[Kahan_1997_Status-55] [46]
Kahan, William Morton (1997-10-01). "Lecture Notes on the Status of IEEE Standard 754 for Binary Floating-Point Arithmetic" (PDF). p. 9. Archived (PDF) from the original on 2002-06-22.

[Intel-56] [47]
"D.3.2.1". Intel 64 and IA-32 Architectures Software Developers' Manuals. Vol. 1.

[Harris-58] [48]
Harris, Richard (October 2010). "You're Going To Have To Think!". Overload (99): 5–10. ISSN 1354-3172. Retrieved 2011-09-24. Far more worrying is cancellation error which can yield catastrophic loss of precision.

[Barker-59] [49]
Christopher Barker: PEP 485 -- A Function for testing approximate equality

[GAO_report_IMTEC_92-26-60] [50]
"Patriot missile defense, Software problem led to system failure at Dharhan, Saudi Arabia". US Government Accounting Office. GAO report IMTEC 92-26.

[RalstonReilly2003-61] [51]
Wilkinson, James Hardy (2003-09-08). "Error Analysis". In Ralston, Anthony; Reilly, Edwin D.; Hemmendinger, David (eds.). Encyclopedia of Computer Science. Wiley. pp. 669–674. ISBN 978-0-470-86412-8. Retrieved 2013-05-14.

[Einarsson_2005-62] [52]
Einarsson, Bo (2005). Accuracy and reliability in scientific computing. Society for Industrial and Applied Mathematics (SIAM). pp. 50–. ISBN 978-0-89871-815-7. Retrieved 2013-05-14.

[Higham_2002-63] [53]
Higham, Nicholas John (2002). Accuracy and Stability of Numerical Algorithms (2nd ed.). Society for Industrial and Applied Mathematics (SIAM). pp. 27–28, 110–123, 493. ISBN 978-0-89871-521-7. 0-89871-355-2.

[OliveiraStewart_2006-64] [54]
Oliveira, Suely; Stewart, David E. (2006-09-07). Writing Scientific Software: A Guide to Good Style. Cambridge University Press. pp. 10–. ISBN 978-1-139-45862-7.

[Kahan_2005_ARITH17-65] [55]
Kahan, William Morton (2005-07-15). Floating-Point Arithmetic Besieged by "Business Decisions" (PDF). IEEE-sponsored ARITH 17, Symposium on Computer Arithmetic (Keynote Address). pp. 6, 18. Archived (PDF) from the original on 2006-03-17. Retrieved 2013-05-23. (NB. Kahan estimates that the incidence of excessively inaccurate results near singularities is reduced by a factor of approx. 1/2000 using the 11 extra bits of precision of double extended.)

[Kahan_2011_Debug-66] [56]
Kahan, William Morton (2011-08-03). Desperately Needed Remedies for the Undebuggability of Large Floating-Point Computations in Science and Engineering (PDF). IFIP/SIAM/NIST Working Conference on Uncertainty Quantification in Scientific Computing, Boulder, CO. p. 33. Archived (PDF) from the original on 2013-06-20.

[Kahan_2001_JavaHurt-69] [57]
Kahan, William Morton; Darcy, Joseph (2001) [1998-03-01]. "How Java's floating-point hurts everyone everywhere" (PDF). Archived (PDF) from the original on 2000-08-16. Retrieved 2003-09-05.

[Kahan_2000_Marketing-71] [58]
Kahan, William Morton (2000-08-27). "Marketing versus Mathematics" (PDF). pp. 15, 35, 47. Archived (PDF) from the original on 2003-08-15.

[Kahan_1981_WhyIEEE-72] [59]
Kahan, William Morton (1981-02-12). "Why do we need a floating-point arithmetic standard?" (PDF). p. 26. Archived (PDF) from the original on 2004-12-04.

[Kahan_2001_LN-73] [60]
Kahan, William Morton (2001-06-04). Bindel, David (ed.). "Lecture notes of System Support for Scientific Computation" (PDF). Archived (PDF) from the original on 2013-05-17.

[Speleotrove_2012-74] [61]
"General Decimal Arithmetic". Speleotrove.com. Retrieved 2012-04-25.

[Christiansen_Perl-75] [62]
Christiansen, Tom; Torkington, Nathan; et al. (2006). "perlfaq4 / Why is int() broken?". perldoc.perl.org. Retrieved 2011-01-11.

[Shewchuk-76] [63]
Shewchuk, Jonathan Richard (1997). "Adaptive Precision Floating-Point Arithmetic and Fast Robust Geometric Predicates". Discrete & Computational Geometry. 18 (3): 305–363. doi:10.1007/PL00009321.

[Kahan_1997_Cantilever-77] [64]
Kahan, William Morton; Ivory, Melody Y. (1997-07-03). "Roundoff Degrades an Idealized Cantilever" (PDF). Archived (PDF) from the original on 2003-12-05.

[Vectorizers-79] [65]
"Auto-Vectorization in LLVM". LLVM 13 documentation. We support floating point reduction operations when -ffast-math is used.

[FPM-80] [66]
"FloatingPointMath". GCC Wiki.

[harmful-81] [67]
"55522 – -funsafe-math-optimizations is unexpectedly harmful, especially w/ -shared". gcc.gnu.org.

[Gen-82] [68]
"Code Gen Options (The GNU Fortran Compiler)". gcc.gnu.org.

[zheevd-83] [69]
"Bug in zheevd · Issue #43 · Reference-LAPACK/lapack". GitHub.

[Becker-Darulova-Myreen-Tatlock_2019-84] [70]
Becker, Heiko; Darulova, Eva; Myreen, Magnus O.; Tatlock, Zachary (2019). Icing: Supporting Fast-Math Style Optimizations in a Verified Compiler. CAV 2019: Computer Aided Verification. Vol. 11562. pp. 155–173. doi:10.1007/978-3-030-25543-5_10.

[1]

[2]

[3]

[nb 1]

[nb 2]

[4]

[5]

[nb 3]

[6]

[nb 4]

[7]

[nb 5]

[nb 6]

[8]

[nb 7]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[28]

[29]

[30]

[31]

[32]

[33]

[nb 8]

[34]

[35]

[36]

[37]

[38]

[39]

[40]

[41]

[42]

[43]

[44]

[45]

[nb 9]

[46]

[47]

[nb 10]

[48]

[49]

[50]

[51]

[52]

[53]

[54]

[55]

[56]

[nb 11]

[nb 12]

[57]

[nb 13]

[58]

[59]

[60]

[61]

[62]

[63]

[64]

[nb 14]

[65]

[66]

[67]

[68]

[69]

[70]

Floating-point_number

Floating-point arithmetic

Overview

Floating-point numbers

Alternatives to floating-point numbers

History

Range of floating-point numbers

IEEE 754: floating point in modern computers

Internal representation

Other notable floating-point formats

Representable numbers, conversion and rounding

Rounding modes

Binary-to-decimal conversion with minimal number of digits

Decimal-to-binary conversion

Floating-point operations

Addition and subtraction

Multiplication and division

Literal syntax

Dealing with exceptional cases

Accuracy problems

Incidents

Machine precision and backward error analysis

Minimizing the effect of accuracy problems

"Fast math" optimization

See also

Notes

References

Further reading

External links

Share this article:

Type	Sign	Exponent	Significand	Total	Exponent bias	Bits precision	Number of decimal digits
Type	Bits				Exponent bias	Bits precision	Number of decimal digits
Half (IEEE 754-2008)	1	5	10	16	15	11	~3.3
Single	1	8	23	32	127	24	~7.2
Double	1	11	52	64	1023	53	~15.9
x86 extended precision	1	15	64	80	16383	64	~19.2
Quad	1	15	112	128	16383	113	~34.0

Type	Sign	Exponent	Trailing significand field	Total bits
FP8 (E4M3)	1	4	3	8
FP8 (E5M2)	1	5	2	8
Half-precision	1	5	10	16
Bfloat16	1	8	7	16
TensorFloat-32	1	8	10	19
Single-precision	1	8	23	32