5.1 Overview: formats and conformance
This clause defines several kinds of standard floating-point formats, in two radices, 2 and 10. All the formats specified by this standard are fixed-width. The precision and range of a fixed-width format are determinable from the program text, and the corresponding encoding is usually defined so that all members have the same size in storage.
Formats defined by this standard are interchange or non-interchange:
- interchange formats are formats with encodings defined in this standard. They are widely available for storage and for data interchange among platforms. The format names used in this standard are not usually those used in programming environments. Interchange formats defined by this standard are basic or storage:
- basic formats are interchange formats, available for arithmetic. This standard defines three basic binary floating-point formats in lengths of 32, 64, and 128 bits, and two basic decimal floating-point formats in lengths of 64 and 128 bits. A programming environment conforms to this standard, in a particular radix, by implementing one or more of the basic formats of that radix. The choice of standard formats is language-defined or, if the relevant language standard is silent or defers to the implementation, implementation-defined. A conforming implementation of a basic format shall:
- storage formats are narrow interchange formats. This standard defines one binary storage floating-point format of 16 bits length, and one decimal storage floating-point format of 32 bits length. To support a storage format, this standard only requires that conversions be provided between that storage format and all other supported formats of the same radix. Languages permitting computation upon storage formats should perform such computations in wider formats.
- non-interchange formats are formats whose encodings are not defined in this standard. None are required by this standard. If implemented they are available for arithmetic, but they might not be suitable for interchanging data among platforms.
-
provide means to initialize and store that format,
provide all the operations of this standard for that format,
provide conversions between that basic format and all other implemented standard formats.
5.2 Specification levels
Floating-point arithmetic is a systematic approximation of real arithmetic, as illustrated in Table 1. Floating- point arithmetic can only represent a finite subset of the continuum of real numbers. Consequently certain properties of real arithmetic, such as associativity of addition, do not always hold for floating-point arithmetic.
Level 1 | {−∞ … 0 … +∞} | Extended real numbers. |
many-to-one ↓ | rounding | ↑ one-to-many |
Level 2 | {−∞ … −0} ∪ {+0 … +∞} ∪ NaN | Floating-point data – an algebraically closed system. |
one-to-many ↓ | representation specification | ↑ many-to-one |
Level 3 | (sign, exponent, significand) ∪ {−∞, +∞} ∪ qNaN ∪ sNaN | Representations of floating-point data. |
one-to-many ↓ | encoding for representations of floating-point data | ↑ many-to-one |
Level 4 | 0111000… | Bit strings. |
The mathematical structure underpinning the arithmetic in this standard is the extended reals, that is, the set of real numbers together with positive and negative infinity. For a given format, the process of rounding (see Clause 6) maps an extended real number to a floating-point datum included in that format. A floating-point datum, which can be a signed zero, finite non-zero number, signed infinity, or not-a-number (NaN), can be mapped to one or more representations of floating-point data in a format.
The representations of floating-point data in a format consist of:
-
triples (sign, exponent, significand); in radix b, the floating-point number represented by a triple is (–1)sign × bexponent × significand
+∞, −∞
qNaN (quiet), sNaN (signaling)
An encoding maps a representation of a floating-point datum to a bit string. An encoding might map some representations of floating-point data to more than one bit string. Multiple NaN bit strings may be used to store retrospective diagnostic information (see 8.2).
5.3 Sets of floating-point data
This subclause specifies the sets of floating-point data representable within floating-point formats; the encodings for representations of floating-point data in interchange formats are discussed in 5.4 and 5.5. The set of finite floating-point numbers representable within a particular format is determined by the following integer parameters:
-
b = the radix, 2 or 10
p = the number of significant digits (precision)
emax = the maximum exponent e
emin = the minimum exponent e
-
Shall be either 1 − emax or −emax.
Should be 1 − emax.
The values of these parameters for each interchange format are given in Table 2; constraints on these parameters for extended formats are given in Table 7. Table 2 refers to interchange formats by the number of bits in their encoding. Within each format, the following floating-point data shall be provided:
-
Signed zero and non-zero floating-point numbers of the form (−1)s × be × m, where:
-
s is 0 or 1
e is any integer emin ≤ e ≤ emax
m is a number represented by a digit string of the form
d0 • d1d2…dp–1 where di is an integer digit 0 ≤ di < b (therefore 0 ≤ m < b)
NaN
These are the only floating-point data provided.
In the foregoing description, the significand m is viewed in a scientific form, with the radix point immediately following the first digit. It is also convenient for some purposes to view the significand as an integer; then the finite floating-point numbers are described thus:
-
Signed zero and non-zero floating-point numbers of the form (−1)s × bq × c, where:
-
s is 0 or 1
q is any integer emin ≤ q + p − 1 ≤ emax
c is a number represented by a digit string of the form
d0d1d2…dp–1 where di is an integer digit 0 ≤ di < b (c is therefore an integer with 0 ≤ c < bp)
This view of the significand as an integer c, with its corresponding exponent q, describes exactly the same set of zero and non-zero floating-point numbers as the view in scientific form. (For finite floating-point numbers, e = q + p – 1 and m = c × b1–p).
The smallest positive normal floating-point number is bemin and the largest is bemax × (b − b1–p). The non-zero floating-point numbers for a format with magnitude less than bemin are called subnormal because their magnitudes lie between zero and the smallest normal magnitude. Subnormal numbers are distinguished from normal numbers because of reduced precision and, in binary, because of different encoding methods. Every finite floating-point number is an integral multiple of the smallest subnormal magnitude bemin × b1–p.
For a floating-point number that has the value zero, the sign bit s provides an extra bit of information. Although all formats have distinct representations for +0 and −0, the sign of a zero is significant in some circumstances, such as division by zero, but not in others (see 8.3). Binary interchange formats have just one representation each for +0 and −0, but decimal formats have many. In this standard, 0 and ∞ are written without a sign when the sign is not important.
Binary format (b=2) | Decimal format (b=10) | ||||||
---|---|---|---|---|---|---|---|
parameter | binary16 storage | binary32 basic | binary64 basic | binary128 basic |
decimal32 storage | decimal64 basic | decimal128 basic |
p digits | 11 | 24 | 53 | 113 | 7 | 16 | 34 |
emax | +15 | +127 | +1023 | +16383 | +96 | +384 | +6144 |
emin | −14 | −126 | −1022 | −16382 | −95 | −383 | −6143 |