Manycore – Knowledge and References

Explore chapters and articles related to this topic

Interconnection Network Energy-Aware Scheduling Algorithm

Published in Kenli Li, Xiaoyong Tang, Jing Mei, Longxin Zhang, Wangdong Yang, Keqin Li, Workflow Scheduling on Computing Systems, 2023

Kenli Li, Xiaoyong Tang, Jing Mei, Longxin Zhang, Wangdong Yang, Keqin Li

With the advent of multi-core (CPU) and many-core (GPU) processors, large-scale heterogeneous systems (such as Supercomputers Sierra, Perlmutter, Summit, Tianhe-2A, Selene, Marconi-100, Titan, Piz Daint, and so on) have risen as a primary and high-effective computing infrastructure for high-performance applictions [1, 24, 113, 141, 163]. For example, work [113] successfully completed a nonlinear AWP-ODC scientific applications on 4,200 Kepler K20X GPUs on Oak Ridge National Laboratory's Supercomputer Titan, which simulates a 7.7M earthquake on the southern San Andreas fault. Paper [141, 163] implemented large-scale high-order CFD (Computational Fluid Dynamics) simulations on the Tianhe-1A supercomputer, which adopted hybrid MPI+OpenMP+CUDA programming model to realize the parallelism of 1024 computing nodes.

Extreme Heterogeneity in Deep Learning Architectures

View Chapter

Purchase Book

Published in Kuan-Ching Li, Beniamino DiMartino, Laurence T. Yang, Qingchen Zhang, Smart Data, 2019

Jeff Anderson, Armin Mehrabian, Jiaxin Peng, Tarek El-Ghazawi

Manycore processor architectures, such as Intel Xeon Phi Knights Landing (KNL), provide the advantage of CPU-based flexibility with the performance advantages of an on-chip, high-speed network (shown in Figure 1.2), also called a Network-on-a-Chip (NoC). Neurons are mapped over several CPU cores, with each core executing a subset of calculations required by the neuron, and intermediate results are passed between cores within a neuron’s cluster until they are reduced to a single value output from one core. For CPU-intensive operations, the Manycore architecture enables efficient passing of intermediate values from one processor to another without saturating the memory bus [30]. High-bandwidth communications between processors and memory in the KNL accelerate the training phase, where large amounts of data are frequently moved between memory and execution units [31].

Trauma Outcome Prediction in the Era of Big Data: From Data Collection to Analytics

View Chapter

Purchase Book

Published in Ervin Sejdić, Tiago H. Falk, Signal Processing and Machine Learning for Biomedical Big Data, 2018

Shiming Yang, Peter F. Hu, Colin F. Mackenzie

Second, utilizing the independence among tasks allows for parallel data processing, thus making full use of multicore or many-core machines. There are mainly two levels of parallelism in typical medical prediction model training. One is between subjects, and the other is within subject. Often, the feature extraction from a patient’s data is independent of others. At this level, we can distribute study cases evenly to all computing units. Within each subject, many tasks can also be done simultaneously. For example, features derived from single variables can be calculated on separate cores. Features from moving windows are also highly parallelizable. In the model learning steps, repeated cross-validation is commonly adopted, to test and validate these models’ performance on new data and to prevent potential overfitting. A balanced training and testing model prediction is used to see if the model can be generalized to new previously unused data. For example, with multiple combinations of five outcomes, six feature groups, and 10-fold cross-validation repeated 10 times, about 1500–3000 multiples of model calculations and 100–300 model comparisons and statistical tests are required. Parallel training and testing can be used to speed up the learning process.

Parallelizable adjoint stencil computations using transposed forward-mode algorithmic differentiation

View Article

Journal Information

Published in Optimization Methods and Software, 2018

J.C. Hückelheim, P.D. Hovland, M.M. Strout, J.-D. Müller

This new method becomes particularly relevant in the presence of the current trend toward massively parallel computations at low clock speeds [4]. Caused by a stall in achievable processor clock rates [27], as well as a growing concern for energy-efficient computing [12], simulation programs are increasingly using shared-memory parallelism to run efficiently on new, more power-efficient multicore and many-core architectures [23,30] such as Xeon Phi or Nvidia Tesla. A widely used programming model to implement software for these new machines is OpenMP [10]. The reverse differentiation of distributed-memory parallel code using MPI has been a matter of research for a long time [20,25,28,34] and is mature enough to be used in practice [32]. However, AD tools have largely ignored shared-memory parallelization and focused mostly on forward- or parallel vector-mode differentiation [5–7], or resorted to conservative parallelization approaches with critical sections or atomic operations for the adjoint code, thus reducing its scalability [15,16].

Network on chip for enterprise information management and integration in intelligent physical systems

View Article

Journal Information

Published in Enterprise Information Systems, 2021

Guobin Chen, Shijin Li

Increased computational efficiency depends significantly on the improvements in many core chip designs. Consequently, traditional design approaches with this growing parallelism cannot scale efficiently (Liang et al. 2018). An alternative development method was suggested for early master-learning methods, such as simple regression and neural networks. The latest developments in machine education use significant enhancement to provide improved exploration space in design (Zeng and Guo 2017). In large design spaces such as network-on-chip design (NoC), this capability is especially promising.