Explore chapters and articles related to this topic
Preliminaries
Published in Subrata Ray, Fortran 2018 with Parallel Programming, 2019
Double precision quantities are real numbers but are more precise than their single precision counterparts. It is well known that the computer internally uses binary numbers, and all decimal numbers do not have an exact binary representation. For example, when 0.1 is converted to binary, the binary number may not be exactly 0.1—it is probably 0.09999…. A computer is a finite bit (binary digit) machine, and the number of bits used determines how close the binary number is to the decimal counterpart. For infinite precision arithmetic, the two numbers would have been identical. By increasing the number of bits to store a real number, the binary counterpart can be made closer to the decimal value. Double precision numbers take more memory than do the single precision numbers and consume more central processing unit (CPU) time to perform any arithmetic operations compared with the corresponding single precision numbers.
Floating-Point Computations with Very-High-Speed Integrated Circuit Hardware Description Language and Xilinx System Generator (SysGen) Tools
Published in A. Arockia Bazil Raj, FPGA-Based Embedded System Developer's Guide, 2018
The double-precision floating-point number system provides more digits to the right side of the binary point than a single-precision number. The term double-precision is something of a misnomer because the precision is not really double; however, the double-precision number system uses twice as many bits as the single-precision floating-point number system. For example, the single-precision floating-point number system requires 32 bits and its double, that is, 64 bits, is required for the double-precision floating-point number system. The additional 32 bits increase not only the precision but also the range of magnitudes that can be covered. Double-precision floating-point format is a computer number format that occupies 8 bytes, that is, 64 bits, in computer memory and represents a wide and dynamic range of values by using a floating point [127–131].
Quest for Energy Efficiency in Digital Signal Processing
Published in Tomasz Wojcicki, Krzysztof Iniewski, VLSI: Circuits for Emerging Applications, 2017
Ramakrishnan Venkatasubramanian
On the other hand, floating-point representations offer a wider dynamic range by rational number representation (scientific notation), using a mantissa and an exponent for the representation. Floating-point representation was standardized by the IEEE Standard for Floating-Point Arithmetic, IEEE 754. It is a technical standard for floating-point computation established in 1985. The latest version, IEEE 754-2008 published in August 2008, extends the original IEEE 754-1985 standard and IEEE Standard for Radix-Independent Floating-Point Arithmetic, IEEE 854-1987. Single-precision floating-point format is represented in 4 bytes (32 bits) and represents a wide dynamic range of floating-point values. Double-precision floating-point format is represented by 8 bytes (64 bits) and represents an even wider dynamic range of floating-point values.
A unified reconfigurable CORDIC processor for floating-point arithmetic
Published in International Journal of Electronics, 2020
Linlin Fang, Bingyi Li, Yizhuang Xie, He Chen, Long Pang
This module mainly completes the conversion from floating-point to fixed-point numbers and expands the ROC. In this paper, we adopt the IEEE-754 standard single-precision floating-point data format (IEEE Std 754 2008). The input data can be represented as
New intelligent optimization framework
Published in Automatika, 2018
Secondly, on the transcoding, calculation and accuracy of a floating-point number in algorithm designing and testing. At present, the majority of CPU/GPU, which is in line with the IEEE Standard 754 and support the SSE2 instruction set, are the hardware foundation of algorithm designing and testing. However, according to the floating-point storage format specified by IEEE Standard 754, there are some decimal numbers with decimal places, which cannot be accurately represented using the 0/1 binary code currently used on the CPU/GPU. Therefore, if there is no reasonable way to deal with this situation, the accuracy misalignment will become serious in transcoding and calculation. In short, the precision of floating-point numbers and their transcoding and calculation is largely determined by the exponential range. Take the 32-bit single precision floating-point number as an example, the minimum of exponent part should not be lower than 2−7−1, and the maximum of exponent part should not be higher than 28−1 (which has excluded the particular circumstances that the exponent part all is 0 or 1). If beyond the exponential range, it will have to abandon the end of the decimal and result in up overflow or down overflow. Sometimes, for a very small value, CPU/GPU only has to deal with it as 0, because it cannot be represented by the normalized value. If the low-precision floating-point number is discarded many times, then the accuracy misalignment will become worse, even the divisor may become as 0, and cause an exception. If the floating-point number was treated in a non-normalized way, although the accuracy misalignment can be remedied, the efficiency of transcoding and computing would exponentially decline, because it has a higher request for CPU/GPU and compiler. Therefore, the designing and testing of IOAs not only need to consider the characteristics of the problem domain but also need to consider the hardware CPU/GPU and the software compiler in the calculation process. Finally, when the calculation results return, it also needs to consider the decoding problem in the algorithm itself, that is, the transcoding issue of between the binary machine code and the floating-point number. Yes, this involves algorithmic code optimization.