Chapter 5. Formats // DRAFT Standard for Floating-Point Arithmetic P754, p. 17-19

5. Formats

Copyright © 2006 by the IEEE

Источник: DRAFT Standard for Floating-Point Arithmetic P754, p. 17-19

5.1 Overview: formats and conformance

This clause defines several kinds of standard floating-point formats, in two radices, 2 and 10. All the formats specified by this standard are fixed-width. The precision and range of a fixed-width format are determinable from the program text, and the corresponding encoding is usually defined so that all members have the same size in storage.

Formats defined by this standard are interchange or non-interchange:

interchange formats are formats with encodings defined in this standard. They are widely available for storage and for data interchange among platforms. The format names used in this standard are not usually those used in programming environments. Interchange formats defined by this standard are basic or storage:

basic formats are interchange formats, available for arithmetic. This standard defines three basic binary floating-point formats in lengths of 32, 64, and 128 bits, and two basic decimal floating-point formats in lengths of 64 and 128 bits. A programming environment conforms to this standard, in a particular radix, by implementing one or more of the basic formats of that radix. The choice of standard formats is language-defined or, if the relevant language standard is silent or defers to the implementation, implementation-defined. A conforming implementation of a basic format shall:

storage formats are narrow interchange formats. This standard defines one binary storage floating-point format of 16 bits length, and one decimal storage floating-point format of 32 bits length. To support a storage format, this standard only requires that conversions be provided between that storage format and all other supported formats of the same radix. Languages permitting computation upon storage formats should perform such computations in wider formats.

non-interchange formats are formats whose encodings are not defined in this standard. None are required by this standard. If implemented they are available for arithmetic, but they might not be suitable for interchanging data among platforms.

5.2 Specification levels

Floating-point arithmetic is a systematic approximation of real arithmetic, as illustrated in Table 1. Floating- point arithmetic can only represent a finite subset of the continuum of real numbers. Consequently certain properties of real arithmetic, such as associativity of addition, do not always hold for floating-point arithmetic.

Table 1 – Relationships between different specification levels for a particular format
Level 1	{−∞ … 0 … +∞}	Extended real numbers.
many-to-one ↓	rounding	↑ one-to-many
Level 2	{−∞ … −0} ∪ {+0 … +∞} ∪ NaN	Floating-point data – an algebraically closed system.
one-to-many ↓	representation specification	↑ many-to-one
Level 3	(sign, exponent, significand) ∪ {−∞, +∞} ∪ qNaN ∪ sNaN	Representations of floating-point data.
one-to-many ↓	encoding for representations of floating-point data	↑ many-to-one
Level 4	0111000…	Bit strings.

The mathematical structure underpinning the arithmetic in this standard is the extended reals, that is, the set of real numbers together with positive and negative infinity. For a given format, the process of rounding (see Clause 6) maps an extended real number to a floating-point datum included in that format. A floating-point datum, which can be a signed zero, finite non-zero number, signed infinity, or not-a-number (NaN), can be mapped to one or more representations of floating-point data in a format.

The representations of floating-point data in a format consist of:

sign

exponent

significand

^sign

b^exponent

significand

An encoding maps a representation of a floating-point datum to a bit string. An encoding might map some representations of floating-point data to more than one bit string. Multiple NaN bit strings may be used to store retrospective diagnostic information (see 8.2).

5.3 Sets of floating-point data

This subclause specifies the sets of floating-point data representable within floating-point formats; the encodings for representations of floating-point data in interchange formats are discussed in 5.4 and 5.5. The set of finite floating-point numbers representable within a particular format is determined by the following integer parameters:

emax

emin

emax

The values of these parameters for each interchange format are given in Table 2; constraints on these parameters for extended formats are given in Table 7. Table 2 refers to interchange formats by the number of bits in their encoding. Within each format, the following floating-point data shall be provided:

b^e

emin

emax

₀

₁

₂

_p–1

d_i

These are the only floating-point data provided.

In the foregoing description, the significand m is viewed in a scientific form, with the radix point immediately following the first digit. It is also convenient for some purposes to view the significand as an integer; then the finite floating-point numbers are described thus:

b^q

emin

q + p − 1

emax

₀

₁

₂

_p–1

d_i

b^p

This view of the significand as an integer c, with its corresponding exponent q, describes exactly the same set of zero and non-zero floating-point numbers as the view in scientific form. (For finite floating-point numbers, e = q + p – 1 and m = c × b^1–p).

The smallest positive normal floating-point number is b^emin and the largest is b^emax × (b − b^1–p). The non-zero floating-point numbers for a format with magnitude less than b^emin are called subnormal because their magnitudes lie between zero and the smallest normal magnitude. Subnormal numbers are distinguished from normal numbers because of reduced precision and, in binary, because of different encoding methods. Every finite floating-point number is an integral multiple of the smallest subnormal magnitude b^emin × b^1–p.

For a floating-point number that has the value zero, the sign bit s provides an extra bit of information. Although all formats have distinct representations for +0 and −0, the sign of a zero is significant in some circumstances, such as division by zero, but not in others (see 8.3). Binary interchange formats have just one representation each for +0 and −0, but decimal formats have many. In this standard, 0 and ∞ are written without a sign when the sign is not important.

Table 2 – Interchange format parameters defining floating-point numbers
	Binary format (b=2)				Decimal format (b=10)
parameter	binary16 storage	binary32 basic	binary64 basic	binary128 basic	decimal32 storage	decimal64 basic	decimal128 basic
p digits	11	24	53	113	7	16	34
emax	+15	+127	+1023	+16383	+96	+384	+6144
emin	−14	−126	−1022	−16382	−95	−383	−6143