Skip to main content.

5. Formats

Copyright © 2006 by the IEEE

Источник: DRAFT Standard for Floating-Point Arithmetic P754, p. 17-19

5.1 Overview: formats and conformance

This clause defines several kinds of standard floating-point formats, in two radices, 2 and 10. All the formats specified by this standard are fixed-width. The precision and range of a fixed-width format are determinable from the program text, and the corresponding encoding is usually defined so that all members have the same size in storage.

Formats defined by this standard are interchange or non-interchange:

5.2 Specification levels

Floating-point arithmetic is a systematic approximation of real arithmetic, as illustrated in Table 1. Floating- point arithmetic can only represent a finite subset of the continuum of real numbers. Consequently certain properties of real arithmetic, such as associativity of addition, do not always hold for floating-point arithmetic.

Table 1 – Relationships between different specification levels for a particular format
Level 1{−∞ …   0   … +∞}Extended real numbers.
many-to-one ↓rounding↑ one-to-many
Level 2{−∞ … −0} ∪ {+0 … +∞} ∪ NaNFloating-point data – an algebraically closed system.
one-to-many ↓representation specification↑ many-to-one
Level 3(sign, exponent, significand) ∪ {−∞, +∞} ∪ qNaNsNaNRepresentations of floating-point data.
one-to-many ↓encoding for representations of floating-point data↑ many-to-one
Level 40111000…Bit strings.

The mathematical structure underpinning the arithmetic in this standard is the extended reals, that is, the set of real numbers together with positive and negative infinity. For a given format, the process of rounding (see Clause 6) maps an extended real number to a floating-point datum included in that format. A floating-point datum, which can be a signed zero, finite non-zero number, signed infinity, or not-a-number (NaN), can be mapped to one or more representations of floating-point data in a format.

The representations of floating-point data in a format consist of:

An encoding maps a representation of a floating-point datum to a bit string. An encoding might map some representations of floating-point data to more than one bit string. Multiple NaN bit strings may be used to store retrospective diagnostic information (see 8.2).

5.3 Sets of floating-point data

This subclause specifies the sets of floating-point data representable within floating-point formats; the encodings for representations of floating-point data in interchange formats are discussed in 5.4 and 5.5. The set of finite floating-point numbers representable within a particular format is determined by the following integer parameters:

The values of these parameters for each interchange format are given in Table 2; constraints on these parameters for extended formats are given in Table 7. Table 2 refers to interchange formats by the number of bits in their encoding. Within each format, the following floating-point data shall be provided:

These are the only floating-point data provided.

In the foregoing description, the significand m is viewed in a scientific form, with the radix point immediately following the first digit. It is also convenient for some purposes to view the significand as an integer; then the finite floating-point numbers are described thus:

This view of the significand as an integer c, with its corresponding exponent q, describes exactly the same set of zero and non-zero floating-point numbers as the view in scientific form. (For finite floating-point numbers, e = q + p – 1 and m = c × b1–p).

The smallest positive normal floating-point number is bemin and the largest is bemax × (bb1–p). The non-zero floating-point numbers for a format with magnitude less than bemin are called subnormal because their magnitudes lie between zero and the smallest normal magnitude. Subnormal numbers are distinguished from normal numbers because of reduced precision and, in binary, because of different encoding methods. Every finite floating-point number is an integral multiple of the smallest subnormal magnitude bemin × b1–p.

For a floating-point number that has the value zero, the sign bit s provides an extra bit of information. Although all formats have distinct representations for +0 and −0, the sign of a zero is significant in some circumstances, such as division by zero, but not in others (see 8.3). Binary interchange formats have just one representation each for +0 and −0, but decimal formats have many. In this standard, 0 and ∞ are written without a sign when the sign is not important.

Table 2 – Interchange format parameters defining floating-point numbers
Binary format (b=2)Decimal format (b=10)
parameter binary16
storage
binary32
basic
binary64
basic
binary128
basic
decimal32
storage
decimal64
basic
decimal128
basic
p digits 112453113 71634
emax +15+127+1023+16383 +96+384+6144
emin −14−126−1022−16382 −95−383−6143