# Numerical Precision

**Topics:**Floating point, Decimal, Fixed-point arithmetic

**Pages:**4 (1181 words)

**Published:**May 8, 2013

IEEE Standard for Binary Floating-Point Arithmetic (IEEE 754) is the most widely-used standard for floating-point computation, and is followed by many CPU and FPU implementations. The standard defines formats for representing floating-point numbers and special values together with a set of floating-point operations that operate on these values. It also specifies four rounding modes and five exceptions (Michael L Overton). 2.How floating point numbers are stored in memory

An IEEE-754 float (4 bytes) or double (8 bytes) has three components (there is also an analogous 96-bit extended-precision format under IEEE-854): a sign bit telling whether the number is positive or negative, an exponent giving its order of magnitude, and a mantissa specifying the actual digits of the number. Using single-precision floats as an example, here is the bit layout: seeeeeeeemmmmmmmmmmmmmmmmmmmmmmm meaning

31 0 bit # s = sign bit, e = exponent, m = mantissa In the internal representation, there is 1 bit for the sign (S), 8 bits for the exponent (E), and 23 bits for the mantissa (m). The number is stored as follows, with high memory to the right: Byte 0 Byte 1 Byte 2 Byte 3

00000000 11111100 22221111 33222222

76543210 54321098 32109876 10987654

FFFFFFFF FFFFFFFF EFFFFFFF SEEEEEEE

3.The difficulty of manipulating and using floating point numbers in c calculations There are two reasons why a real number might not be exactly represented as a floating-point number. The most common situation is illustrated by the decimal number 0.1. Although it has a finite decimal representation, in binary it has an infinite repeating representation. Thus when β = 2, the number 0.1 lies strictly between two floating-point numbers and is exactly represented by neither of them (Cleve Moler). Floating-point representations...

Please join StudyMode to read the full document