Explore chapters and articles related to this topic
Portable Software Technology
Published in David R. Martinez, Robert A. Bond, Vai M. Michael, High Performance Embedded Computing Handbook, 2018
Portable math library interfaces have existed for some time. The basic linear algebra subroutines (BLAS), which are used by the LAPACK package, are perhaps the oldest example in current use (Anderson et al. 1992). Originally, the BLAS were a set of simple function calls for performing vector operations such as vector element-wise add and multiply operations, plane rotations, and vector dot-products. These vector routines are referred to as level-1 BLAS routines. Level-2 BLAS were later defined that extended the interface to matrix-vector operations, including matrix-vector multiply, rank-one updates to a matrix, and solution of triangular systems with a vector right-hand side. Level-3 BLAS include matrix-matrix operations, both element-wise and product operations, and extensions to the level-2 BLAS, including generalized updates and solution of triangular systems with multiple right-hand sides. The major motivation for the level-2 and level-3 BLAS is to aggregate operations so that the number of function calls required to implement a given function is greatly reduced.
Open Source Libraries
Published in Federico Milano, Ioannis Dassios, Muyang Liu, Georgios Tzounas, Eigenvalue Problems in Power Systems, 2020
Federico Milano, Ioannis Dassios, Muyang Liu, Georgios Tzounas
Dependencies: A large part of the computations required by the routines of LAPACK are performed by calling the BLAS (Basic Linear Algebra Subprograms) [90]. In general, BLAS functionality is classified in three levels. Level 1 defines routines that carry out simple vector operations; Level 2 defines routines that carry out matrix-vector operations; and Level 3 defines routines that carry out general matrix-matrix operations. Modern optimized BLAS libraries, such as ATLAS (Automatically Tuned Linear Algebra Software) [33] and Intel MKL (Math Kernel Library), typically support all three levels for both real and complex data types.
BLAS
Published in Leslie Hogben, Richard Brualdi, Anne Greenbaum, Roy Mathias, Handbook of Linear Algebra, 2006
Jack Dongarra, Victor Eijkhout, Julien Langou
To a great extent, the user community has embraced the BLAS, not only for performance reasons, but also because developing software around a core of common routines like the BLAS is good software engineering practice. Highly efficient, machine-specific implementations of the BLAS are available for most modern high-performance computers. To obtain an up-to-date list of available optimized BLAS, see the BLAS FAQ at http://www.netlib.org/blas.
Enhancing parallelism of distributed algorithms with the actor model and a smart data movement technique
Published in International Journal of Parallel, Emergent and Distributed Systems, 2021
Anatoliy Doroshenko, Eugene Tulika, Olena Yatsenko
BLAS library is used effectively to work with separate blocks of the matrix within a single node with shared memory. BLAS library is a set of matrix operations implemented in Fortran and C languages and optimised for specific hardware. BLAS is the fastest library for working with matrices and provides interfaces for various programming languages. Most of the LAPACK and ScaLAPACK algorithms consist of some fundamental operations, which are implemented using BLAS Level-2 or Level-3 library. The BLAS performance depends on a size of a matrix and the best performance is reached with large block size. Using BLAS/LAPACK, the calculation of steps is implemented with the following operations: DPOTF2 calculates the Cholesky factorisation of the diagonal element .DTRSM calculates the column according to the formula (1).DSYRK updates the rest of the matrix by the formula (2).
GPU parameter tuning for tall and skinny dense linear least squares problems
Published in Optimization Methods and Software, 2020
Benjamin Sauk, Nikolaos Ploskas, Nikolaos Sahinidis
We conducted experiments on a workstation running CentOS7, on two Intel Xeon processors E5-2660 v3 at 2.6 GHz and 128 GB of RAM. The workstation is equipped with a NVIDIA Tesla K40 GPU, which has 15 streaming multiprocessors each with 192 CUDA cores, 12 GB of RAM, and a peak memory bandwidth of 288 GB/s. The algorithms are compiled with GCC version 5.2 using optimization flag -O3, and the NVCC CUDA 7.5 compiler when applicable. The matrices used in all of the experiments were randomly generated with elements between 0 and 1 from a uniform distribution. These matrices are sufficient for our purposes since performance of the algorithms compared is based entirely on the number of floating point operations performed, which is determined by the size of the matrix, and not its condition. Each matrix size was evaluated with ten different randomly generated matrices in double precision accuracy, and the average performances are reported. Performance was measured in terms of billions of floating-point operations per second (GFLOPs) taken by QR factorization. The number of operations needed to solve LLSPs by is negligible and ignored. We conducted two comparative studies, one on square matrices, and the other on TS matrices. Table 1 summarizes the solvers used in the computational experiments, and their defining properties. BLAS refers to basic linear algebra subprograms utilized and are the routines that perform basic vector and matrix operations [14]. LAPACK used a multicore version of BLAS for comparison between LAPACK and the other parallel solvers. MAGMA was compiled with Intel MKL v16.0.3 as the BLAS used for performing CPU computations.