Compiler optimization – Knowledge and References

Explore chapters and articles related to this topic

Numerical Methods Based on Time-Domain Approaches

Published in Keith Attenborough, Timothy Van Renterghem, Predicting Outdoor Sound, 2021

Keith Attenborough, Timothy Van Renterghem

While Section 4.2 does not describe all of the possible numerical options, it aims to provide an outline of a set of basic and easily implementable (discrete) equations, capturing the major influences on outdoor sound propagation. Care is taken to include refraction effects along the sound propagation path. Although numerical efficiency is kept in mind during all steps taken in the development of this reference model, the volume-discretization technique required is not well suited to calculating sound propagation for large distances. While computer compiler optimization, programming codes on graphical processing units (GPUs) or the use of computer grids or clusters might be helpful in improving efficiency, the hybrid approaches discussed in Section 4.3 might be also needed to keep computation times within reason.

Cache and Memory

View Chapter

Purchase Book

Published in Heqing Zhu, Data Plane Development Kit (DPDK), 2020

Chen Jing, Heqing Zhu

The LLC (L3) has a larger capacity, and it is shared by all cores. In certain cases, the same memories are accessed by multiple cores, causing conflict to occur when multiple cores write or read the data in the same cache line. x86 is designed with a sophisticated mechanism to ensure cache coherency, and software programmers can enjoy such CPU features without worrying about data contention and corruption in a multicore running environment. There is a cost for the data contention in a cache line, and if multicore is trying to access the different data in the same cache, CPU will invalidate the cache line and force an update, hurting the performance. This data sharing is not necessary because the multiple cores are not trying to access the same data, which is known as false sharing. The compiler can find the false sharing, and it will try to eliminate the false sharing at the optimization phase. If compiler optimization is disabled, then there is no compiler attempt to work on this problem.

Using Performance Metrics to Select Microprocessor Cores for IC Designs

View Chapter

Purchase Book

Published in Luciano Lavagno, Igor L. Markov, Grant Martin, Louis K. Scheffer, Electronic Design Automation for IC System Design, Verification, and Testing, 2017

Steve Leibson

It is not even necessary to alter a compiler to produce wildly varying Dhrystone results using the same microprocessor. Using different compiler optimization settings can drastically alter the outcome of a benchmark test, even if the compiler has not been Dhrystone optimized. For example, Bryan Fletcher of Memec taught a class titled “FPGA Embedded Processors: Revealing True System Performance” at the 2005 Embedded Systems Conference in San Francisco where he showed that the Xilinx MicroBlaze soft core processor produced Dhrystone benchmark results that differ by almost 9:1 in terms of DMIPS/MHz depending on configuration and compiler settings as shown in Table 10.2 [11].

Efficient ray-tracing procedure for radio wave propagation modeling using homogeneous geometric algebra

View Article

Journal Information

Published in Electromagnetics, 2020

Ahmad H. Eid, Heba Y. Soliman, Sherif M. Abuelenin

In this work, we utilized an optimizing compiler called GMac (Eid 2016; GMac source code 2015) to generate efficient code from HGA expressions such as the ones in Table 3. The design of GMac combines concepts from geometric algebra (Dorst, Fontijne, and Mann 2007a), symbolic computation (Hazrat 2016), and compiler optimization (A V et al.) to exploit the sparsity of GA multivector representations of geometric primitives. The optimized target code GMac generates can be in any modern programming language of choice such as C++, Java, C#, Python, etc. GMac accepts geometric algebra formulations of geometric procedures written in a GA-based Domain Specific Language (DSL). GMac is very advanced in terms of stability and concepts (Charrier et al. 2014). It was utilized earlier in developing a computer graphics ray-tracer (Eid 2016). Internally, GMac uses the Wolfram Mathematica Computer Algebra System (CAS) (Hazrat 2016) to find a series of symbolic relations between the inputs and output scalar coefficients of the GM, while taking into consideration the variable bindings. Next, GMac optimizes the symbolic relations by removing redundant intermediate computations, propagating constant values, reordering the relations to reduce the amount of required computations, and reusing intermediate temporary variables (A V et al.). Finally, GMac formats the optimized symbolic relations in the syntax of the target language.

Elementary operations: a novel concept for source-level timing estimation

View Article

Journal Information

Published in Automatika, 2019

Nikolina Frid, Danko Ivošević, Vlado Sruk

To summarize, for all three test applications and for all three target platform configurations estimation accuracy remains approximately at the same level, with the average error around 5% and the maximum error below 17%. Estimation accuracy shows no significant degradation for any level of compiler optimization. Even for ARM2 configuration, there is no deviation in error rate compared with results on the other two configurations. This particular configuration is more sensitive to cache effects because processor communicates with a very slow memory. In case of inability to accurately capture cache hits – method would give overestimation, or in case of cache miss - underestimation. However, it must be noted that for all three test cases there was much larger chance of having a cache hit than a cache miss, because memory footprint of each of these applications remains within range of 50 KB – 250 KB. This means they fit well to cache size typical for embedded processors like ARM and likelihood for cache hits is much larger. On the other hand, all three test applications represent well, in both size and structure, common tasks for which embedded systems are used for: signal processing, vector and matrix operations, numeric calculations, search and sorting [22].

Algorithmic differentiation of the Open CASCADE Technology CAD kernel and its coupling with an adjoint CFD solver

View Article

Journal Information

Published in Optimization Methods and Software, 2018

Mladen Banović, Orest Mykhaskiv, Salvatore Auriemma, Andrea Walther, Herve Legrand, Jens-Dominik Müller

Comparing this with the results in Tables 5 and 6, one can state that the differentiated OCCT sources yield run-time ratios that are even below the theoretical lower range boundaries. One reason for this is that the derivation of the theoretical bounds assumes a rather pessimistic run-time ratio for nonlinear univariate operations. Therefore, much better run-time ratios achieved with the traceless forward mode might be connected to the limited use of these costly operations by OCCT. Alternatively, one might also assume that compiler optimization could be a reason for this good run-time ratio. However, a similar effect, i.e. a better run-time ratio than predicted by the theory, is observable also for the trace-based forward mode where compiler optimization is not available in a comprehensive fashion due to the used overloading approach. Finally, the reverse mode of AD obtains a 63% improved efficiency in contrast to the traceless forward mode of AD.