Explore chapters and articles related to this topic
A Distributed Artificial Intelligence
Published in Satya Prakash Yadav, Dharmendra Prasad Mahato, Nguyen Thi Dieu Linh, Distributed Artificial Intelligence, 2020
Pushpa Singh, Rajnesh Singh, Narendra Singh, Murari Kumar Singh
The primary resource of AI is the GPU that is acting as accelerators and offers storage solutions and networking. The GPU is on the IBM Power system, server nodes. NVIDIA Tesla V100 GPU model offers high processing computing (HPC) and graphics. It provides performance of 100 CPUs in single GPUs. GPUs are suitable for the matrix and vector math involved in machine learning/deep learning and has the ability to enhance the speed of deep-learning systems more than 100 times, reducing running times from weeks to days. If this is processed on different computing nodes or as a separate agent and run parallel, reducing running time from hours to seconds.
Proposal and evaluation of adjusting resource amount for automatically offloaded applications
Published in Cogent Engineering, 2022
Regarding GPU, I use two boards of NVIDIA Tesla T4 (CUDA core: 2560, Memory: GDDR6 16GB) and NVIDIA Quadro P4000 (CUDA core: 1792, Memory: GDDR5 8GB). I use CUDA Toolkit 10.1 and PGI compiler 19.10 for GPU control. NVIDIA vGPU virtual Compute Server virtualizes GPU resources. Using vGPU, Tesla T4 resources are divided, and resources of 1 board are divided into 1, 2, and 4 parts. Kernel-based VM (KVM) of RHEL7.9 is used for CPU virtualization. The resource of the VM that becomes 1 standard size is 2 cores and 16GB RAM. Half size (1 core), standard size (2 cores), and double size (4 cores) can be selected. For example, when setting the CPU and GPU resources with standard sizes at a time, our implementation virtualizes CPU and GPU resources and links the 2-core CPU and Tesla T4 1 board. Minimum unit sizes are 1-core CPU and 1/4 GPU board. Figure 3 shows the experimental environment and specifications. Here, the application code used by the user is specified from the client notebook PC, tuned using the bare metal verification machine, and then deployed to the virtual running environment for the actual use.
Study of in-plane wave propagation in 2-D polycrystalline microstructure
Published in Mechanics of Advanced Materials and Structures, 2022
Manas Kumar Padhan, Mira Mitra
Simulation of Wave propagation, as mentioned in the earlier section, is performed using a dense mesh which increases the computational cost. Additionally, the use of high excitation frequency and geometry with anisotropic microstructure requires a further decrease in the element size and computation becomes heavily expensive. The result then outperforms and takes too much time to solve using a CPU-based computer. Hence wave propagation in polycrystalline materials requires high-performance computing (HPC) using either a multi-core processor found in supercomputers or a workstation with an inbuilt GPU. Here authors found a drastic increase in computational speed, in the order of 50 times high, by using a multi-core supercomputer with one dedicated GPU. The supercomputer used is based on a heterogeneous and hybrid configuration of Intel Xeon Skylake processors, and GPU of NVIDIA Tesla V100.
Parallelizable adjoint stencil computations using transposed forward-mode algorithmic differentiation
Published in Optimization Methods and Software, 2018
J.C. Hückelheim, P.D. Hovland, M.M. Strout, J.-D. Müller
This new method becomes particularly relevant in the presence of the current trend toward massively parallel computations at low clock speeds [4]. Caused by a stall in achievable processor clock rates [27], as well as a growing concern for energy-efficient computing [12], simulation programs are increasingly using shared-memory parallelism to run efficiently on new, more power-efficient multicore and many-core architectures [23,30] such as Xeon Phi or Nvidia Tesla. A widely used programming model to implement software for these new machines is OpenMP [10]. The reverse differentiation of distributed-memory parallel code using MPI has been a matter of research for a long time [20,25,28,34] and is mature enough to be used in practice [32]. However, AD tools have largely ignored shared-memory parallelization and focused mostly on forward- or parallel vector-mode differentiation [5–7], or resorted to conservative parallelization approaches with critical sections or atomic operations for the adjoint code, thus reducing its scalability [15,16].