Explore chapters and articles related to this topic
Introduction to Operations Research
Published in Michael W. Carter, Camille C. Price, Ghaith Rabadi, Operations Research, 2018
Michael W. Carter, Camille C. Price, Ghaith Rabadi
Thus, a matrix multiplication algorithm is O(n3) because the process may take n3 steps, although the algorithm could be programmed to look for special input forms that may in certain cases permit completion of the task in fewer than n3 steps. Some algorithms may operate in such a way that their worst case performance is also the best case; the performance of such algorithms does not vary depending on the nature of the data, but, of course, does vary with problem size.
On extending and optimising the direct product decomposition
Published in Molecular Physics, 2019
The absolute per-core performance with full D2h symmetry is summarised for the larger cc-pVQZ basis in Figure 3. From this figure we can see that for both algorithms the per-core performance drops as more and more cores are utilised. The ratio of the parallel performance to the single-core performance is denoted the parallel efficiency, and for the ‘old’ rearrangement-based algorithm the parallel efficiency at 18 cores ranges from 15% (D2h) to 25% (Cs). The ‘new’ native algorithm achieves 40–50% parallel efficiency, so that despite a slight performance disadvantage on one core, parallel performance is greatly improved. This supports the contention that the new algorithm reduces data movement overhead, as memory bandwidth becomes constrained as more cores are used. The ca. 10% drop in performance for the native algorithm on one core can be attributed in part to the lower extent of fine-tuning in the inner TBLIS kernels compared to MKL. Additionally, the native tensor contraction algorithm, as well as the underlying matrix multiplication algorithm in TBLIS, both have additional micro-optimizations that could be implemented to boost performance.
Power-Aware Characteristics of Matrix Operations on Multicores
Published in Applied Artificial Intelligence, 2021
Guruprasad Konnurmath, Satyadhyan Chickerur
This dense matrix multiplication (MatMul) application is systematically optimized in a way to use the maximum computational power characteristics of GPU. To take the advantage of coalesced global memory accessing of GPU and faster local memory, blocked version of matrix multiplication algorithm is adopted. The rate of instructions issued (Mike and Huang 2017) by the GPU kernel is the major bottleneck to be handled. It offers memory access regularly and heavy computations parallelly but features data reuse of O(n) and conforms to be the best legitimate candidate for faster implementation of GPUs.
High-Performance 3D Mesh-Based NOC Architecture Using Node-Layer Clustering
Published in IETE Journal of Research, 2023
Navid Habibi, M. Reza Salehnamadi, Ahmad Khademzadeh
In this paper, a new 3D NOC architecture is proposed to improve the main NOC matrices, latency, power/energy consumption, and throughput of the network, based on node-layer clustering algorithm (NLCA) and De Bruijn graph (DBG). New deadlock-free routing is also suggested for this 3D topology, whose outperformance over its counterparts is proved. The Scalable Universal Matrix Multiplication Algorithm (SUMMA) as the latest version is further utilized on the proposed architecture and its cost is verified over its counterparts.