Explore chapters and articles related to this topic
Portable Software Technology
Published in David R. Martinez, Robert A. Bond, Vai M. Michael, High Performance Embedded Computing Handbook, 2018
The tuning technique in LAPACK requires the user to properly pick parameters. A more automatic method is provided by the automatically tuned linear algebra subprograms, or ATLAS. This is a set of routines that generate and measure the performance of different implementations of the BLAS on a given machine. The results are used to select the optimal version of the BLAS and of some LAPACK routines for that machine (Demmel et al. 2005). A similar technique is used to optimize the FFT by the software package known as the “fastest Fourier transform in the West,” or FFTW, and to optimize general digital signal processor transforms by a package known as SPIRAL (Frigo and Johnson 2005; Puschel et al. 2005). In all of these approaches, a special version of the library tuned for performance on the platform of interest is generated and used.
Open Source Libraries
Published in Federico Milano, Ioannis Dassios, Muyang Liu, Georgios Tzounas, Eigenvalue Problems in Power Systems, 2020
Federico Milano, Ioannis Dassios, Muyang Liu, Georgios Tzounas
Dependencies: A large part of the computations required by the routines of LAPACK are performed by calling the BLAS (Basic Linear Algebra Subprograms) [90]. In general, BLAS functionality is classified in three levels. Level 1 defines routines that carry out simple vector operations; Level 2 defines routines that carry out matrix-vector operations; and Level 3 defines routines that carry out general matrix-matrix operations. Modern optimized BLAS libraries, such as ATLAS (Automatically Tuned Linear Algebra Software) [33] and Intel MKL (Math Kernel Library), typically support all three levels for both real and complex data types.
Fast and Energy-Efficient Block Ciphers Implementations in ARM Processors and Mali GPU
Published in IETE Journal of Research, 2022
W. K. Lee, Raphael C.-W. Phan, B. M. Goi
Iakymchuk and Trahay [13] developed a performance analysis framework for parallel applications named EZTrace. It is being tested on clusters of ARM processors with High-Performance LINPACK (HPL) and Basic Linear Algebra Subroutines (BLAS) from the Automatically Tuned Linear Algebra Software (ATLAS) library. However, they did not focus on optimizing the computing performance in ARM processors and embedded GPU. Bernstein and Schwabe [14] presented their work on optimizing cryptographic algorithms in ARM processors using NEON SIMD instructions. There are also a few research work discussed the feasibility of replacing x86 processors in HPC data center with ARM processors with better power efficiency [15–17]. The current research work done in accelerating algorithms with embedded SoC only focuses on using the multi-core processors and the supported SIMD instructions. The potential of embedded SoC as a high-performance generic computing device can only be fully unleashed if we look into ways to harvest the processing power of embedded GPUs. Davis et al. [18] utilized RenderScript to accelerate the Blowfish block cipher in various mobile devices equipped with GPU, which is closely related to our work. However, they did not evaluate the performance and energy efficiency of lightweight block ciphers in these platforms.