NVLink – Knowledge and References

Explore chapters and articles related to this topic

Intelligent Cloud

Published in Haishi Bai, Zen of Cloud, 2019

Training machine learning models is quite expensive. Techniques such as distributed training and GPU-based training can significantly improve training speed. Especially because GPU is very good at applying the same calculation on top of a large amount of data, it's very suitable for machine learning trainings. Nvidia has been leading the field of GPU-based training by releasing increasingly powerful GPUs. For example, in 2018 Nvidia revealed its DGX-2 with a new NVSwitch technology that provide high speed interconnections among 16 GPUs through NVLink. DGX-2 can deliver 2 petaflops at half precision. That's indeed some tremendous compute power by today's (2019) standard!

Artificial Intelligence Software and Hardware Platforms

View Chapter

Purchase Book

Published in Mazin Gilbert, Artificial Intelligence for Autonomous Networks, 2018

Rajesh Gadiyar, Tong Zhang, Ananth Sankaranarayanan

One challenge in using GPUs is the larger distance from the main memory of the server, hence causing delay in data movement (it is often done through the PCI Express or PCIe protocol). To speed this up, companies like Nvidia have developed a faster inter-connection called NVLink [13]. Other challenges with GPUs include lack of scalability across cards and across servers, usually higher prices and their need for higher power consumption. For example, Nvidia rates their Titan X graphics card at 250W, and recommends a system power supply of 600W [14].

Graphics Programming

View Chapter

Purchase Book

Published in Aditi Majumder, M. Gopi, Introduction to Visual Computing, 2018

Aditi Majumder, M. Gopi

There is a memory hierarchy used by CUDA and supported by the streaming multiprocessor and GPU architectures. Within each processor inside the chip, we noted that there are registers that are accessible per thread and this space is valid until that thread is alive. If a thread uses more registers than are available, the system automatically uses “Local memory” which is actually the off‐chip memory on the GPU card (device). So, although the data can be transparently fetched from the local memory as if it is in the register, the latency of this data fetch is as high as the data fetched from the global memory, for a simple reason that “local” memory is just a part of allocated global memory. The “shared” memory is an on‐chip memory like registers, but is allocated per‐block, and the data in the shared memory is valid until the block is being executed by the processor. Global memory, as mentioned earlier, is off‐chip, but on the GPU card. This memory is accessible by all threads of all kernels, as well as the host (CPU). Data sharing between threads in different blocks of the same kernel or even different kernels can be done using the global memory. The host (CPU) memory, which is the slowest from the GPU perspective is not directly accessible by CUDA threads, but the data has to be explicitly transferred from the host memory to the device memory (global memory). However, CUDA 6 introduces unified memory using which the data in the host memory can be directly indexed from the GPU side without explicitly transfering data between the host and the device. Finally, communication between different GPUs have to go through the PCI express bus and through the host memory. This is clearly the most expensive communication. However, the latest NVLink a power‐efficient high‐speed bus between the CPU and GPU, and between multiple GPUs, allows much higher transfer speeds than those achievable by using PCI Express.

Towards overcoming the LES crisis

View Article

Journal Information

Published in International Journal of Computational Fluid Dynamics, 2019

Rainald Löhner

Most if not all of the LES runs are currently performed at large HPC centers. The intrinsic assumption is that with Intel Xeon cores (or equivalent) industrial LES will be possible. The more fundamental assumption is that network latency and speeds will scale appropriately, but this assumption is not supported by empirical evidence. CPUs are advancing faster than access to RAM, and access to RAM in turn faster than access to other nodes. If therefore the network is the problem: why not develop specialised motherboards with GPUs that are able to fit gridpoints in shared memory ? This would drastically simplify code development and ease of use and might get to industrial LES before the traditional hardware. As an example, the recent NVIDIA DGX-2 delivers an improved NVLINK connectivity that allows the use of up to 16 powerful GPUs width huge interconnect bandwidth. Such a system in combination with a solver capable of harnessing the full power of the GPUs would be a candidate for industrial LES.