Floating point – Knowledge and References

Explore chapters and articles related to this topic

Analog functions using Allen-Bradley’s RSLogix software

Published in Raymond F. Gardner, Introduction to Plant Automation and Controls, 2020

The RSLogix scaling instructions are used to convert the analog value into either an integer that is stored in a N7 Input Data File address, or into a floating-point or real number that is stored in a F8 Float Data File address. Where possible, integer values are preferred for conserving memory and increasing PLC scan speed. These benefits occur because integers consume only one 16-bit word of memory, while the floating-point number use two or more 16-bit words.4 When multiplication and division instructions are used, F8 Float Data File outputs are required for maintaining the significant-digits beyond the decimal point, so as to avoid large rounding errors associated when numbers are truncated into integers. RSLogix uses 32-bits for floating-point numbers, where bits 0–22 form the mantissa of the number, bits 23–30 the positive or negative exponent, and bit 31 stores whether the number is positive or negative. Essentially, the floating-point number is scientific notation having nine significant digits, and the decimal floats relative to a power of ten ranging from 10–38 to 10+38. The ranges of numerical data types are summarized in Table 15.4.

Multiplier Design Based on DBNS

View Chapter

Purchase Book

Published in Vassil Dimitrov, Graham Jullien, Roberto Muscedere, Multiple-Base Number System, 2017

Vassil Dimitrov, Graham Jullien, Roberto Muscedere

One particularly interesting application is the possibility to use our multiplier in floating point operations. The floating point number systems used in practice typically represent numbers as described in the IEEE Standard for Floating-Point Arithmetic (IEEE 754) [41], which includes 32-bit (single precision), 64-bit (double precision), and 128-bit (quadraple precision) versions. In all of them, one bit signifies the sign. The exponent is represented with 8, 11, or 15 bits and the fraction is given by 23, 52, or 112 bits for single, double, and quadruple precision, respectively [41]. A floating point multiplication requires a multiplication of the fractions, e.g., a 52 × 52-bit multiplication for double precision, and consequently, a floating point processor must have support for multiplications with large operands. Clearly, the widths used, at least in the double and quadruple precision formats, exceed the threshold, when our multipliers become superior, and could therefore benefit from the results presented in this chapter.

Introduction

View Chapter

Purchase Book

Published in Randall L. Eubank, Ana Kupresanin, Statistical Computing in C++ and R, 2011

Randall L. Eubank, Ana Kupresanin

At this point it must be realized that a general real number cannot be stored in its entirety and, as a result, in most cases the stored value will represent only an approximation to the truth. Errors are created in computer arithmetic with real numbers due to both the rounding of the numbers for storage as well as further manipulations. These issues will be discussed in the next section. For the present it suffices to recognize that there is a limit to the precision that can be achieved from any computer representation that might be employed for irrational numbers. We will express the precision by the number of significant digits of agreement between the true value of a number and its floating-point representation. A good storage system is one that attempts to minimize losses in precision subject to the constraints that have been imposed on the allowed amount of storage.

Accuracy Improvements for Single Precision Implementations of the SPH Method

View Article

Journal Information

Published in International Journal of Computational Fluid Dynamics, 2020

Elie Saikali, Giuseppe Bilotta, Alexis Hérault, Vito Zago

The width of the gap between 1 and the next representable floating-point number is called the machine epsilon, and we denote it by , which in the single-precision binary floating-point standard is (other scientists refer to machine epsilon to the upper bound of the relative error that occurs when rounding the exact result of an operation to the nearest representable value, which is exactly half of our definition). Equivalently, this is the relative value of the least significant bit of the representation of a number to the number itself, and its significance can be illustrated by remarking that if a, b are two non-zero representable numbers such that , then (where ⊕ is the result of the addition of the two numbers, rounded to the nearest representable value).