Floating-point arithmetic – Knowledge and References

Explore chapters and articles related to this topic

Multiplier Design Based on DBNS

Published in Vassil Dimitrov, Graham Jullien, Roberto Muscedere, Multiple-Base Number System, 2017

Vassil Dimitrov, Graham Jullien, Roberto Muscedere

One particularly interesting application is the possibility to use our multiplier in floating point operations. The floating point number systems used in practice typically represent numbers as described in the IEEE Standard for Floating-Point Arithmetic (IEEE 754) [41], which includes 32-bit (single precision), 64-bit (double precision), and 128-bit (quadraple precision) versions. In all of them, one bit signifies the sign. The exponent is represented with 8, 11, or 15 bits and the fraction is given by 23, 52, or 112 bits for single, double, and quadruple precision, respectively [41]. A floating point multiplication requires a multiplication of the fractions, e.g., a 52 × 52-bit multiplication for double precision, and consequently, a floating point processor must have support for multiplications with large operands. Clearly, the widths used, at least in the double and quadruple precision formats, exceed the threshold, when our multipliers become superior, and could therefore benefit from the results presented in this chapter.

Vector and Matrix Norms, Error Analysis, Efficiency, and Stability

View Chapter

Purchase Book

Published in Leslie Hogben, Richard Brualdi, Anne Greenbaum, Roy Mathias, Handbook of Linear Algebra, 2006

Ralph Byers, Biswa Nath Datta

Most scientific and engineering computations rely on floating point arithmetic. At this writing, the IEEE 754 standard of binary floating point arithmetic [IEEE754] and the IEEE 854 standard of radix-independent floating point arithmetic [IEEE854] are the most widely accepted standards for floating point arithmetic. The still incomplete revised floating point arithmetic standard [IEEE754r] is planned to incorporate both [IEEE754] and [IEEE854] along with extensions, revisions, and clarifications. See [Ove01] for a textbook introduction to IEEE standard floating point arithmetic.

Advanced Signal Processing Resources in FPGAs

View Chapter

Purchase Book

Published in Juan José Rodríguez Andina, Eduardo de la Torre Arnanz, María Dolores Valdés Peña, FPGAs, 2017

Juan José Rodríguez Andina, Eduardo de la Torre Arnanz, María Dolores Valdés Peña

The IEEE Standard for Floating-Point Arithmetic (IEEE 754) is the most widely used standard in floating-point computation circuits (IEEE 2008). It defines data formats, operations, and exceptions (such as division by zero, asymptotic functions, overflow, or inputs/outputs producing undefined or unrepresentable numbers, called NaN—Not a Number). The two basic data formats in IEEE 754 are simple (32-bit) and double (64-bit) precision. Any IEEE 754–compliant computing system must at least support simple precision operations.

Optimised Floating Point FFT Core for Improved OMP CS System

View Article

Journal Information

Published in International Journal of Electronics, 2022

Alahari Radhika, K. Satya Prasad, K. Kishan Rao

The systematic CSD-driven arithmetic-based method is implemented by traditional FFT radix index for multiplication mantissa in a single-precision fusion model. As shown in Table 1, the field of utility of multiplier measurements and hardware-specific device fusions is analysed quantitatively. According to LSB truncated shift-based aggregation of twiddle factor multiplication, the advantage of the critical path delay reduction is also illustrated by the check of time metrics. Generally, degradations in performance are due to greater mantissa in the floating point arithmetic. This only requires MSB regions for a bit change that also decreases system efficiency and sequentially dependent FPU operations and this decrease in FPU activity plays a crucial role in the overall decrease in critical path as shown in Table 2.

An efficient workflow for meshing large scale discrete fracture networks for solving subsurface flow problems

View Article

Journal Information

Published in Petroleum Science and Technology, 2022

Mayur Pal, Sandip Jadhav

Floating point arithmetic is most difficult part to handle, especially in geometry operations. For example, and are consider as same point until we increase tolerance to consider more than 5 digits after decimal point. In Computational Geometry, we need to be more tolerant for such cases as Tolerance class provides all necessary features for it. Tolerance Class defines operations for floating point comparisons. The object of tolerance class is constructed by giving two parameters for tolerance value and zero value. The flag useTol can be set by user to use tol variable for floating point comparison, if not then zero variable is used for comparison. For two variable comparison, functions like to isGreater, isLess, isEqual are implemented. IsZero function is provided to check whether variable is zero or not

Iterative-method performance evaluation for multiple vectors associated with a large-scale sparse matrix

View Article

Journal Information

Published in International Journal of Computational Fluid Dynamics, 2016

Seigo Imamura, Kenji Ono, Mitsuo Yokokawa

Here, two types of implementations are presented for the SOR method in Lists 1 and 2, written in Fortran. One implementation reduces the load of the coefficient matrix in ensemble computing by reusing the loaded matrix in the innermost l loop, and the other implementation is a naïve one that solves linear systems sequentially. The former implementation is termed inner loop, and the latter one, outer loop, hereafter. In these lists, p, b, and bp denote the pressure (solution vector ), the RHS vector of the derived linear system , and a coefficient matrix A, respectively. An active variable in the list represents a mask function to activate or inactivate a cell. Furthermore, i, j, and k denote the space coordinates, and l denotes the index of the RHS vector. In order to compare the performance of both implementations, we investigated their operational intensities on Sparc VIIIfx.1.Outer loop has 30 floating-point arithmetic operations (flops), whereas inner loop has 8 + 23 × l flops, where l represents the count of the innermost loop.