High-availability cluster

Data Recovery

Published in Michael F. Hordeski, Emergency and Backup Power Sources:, 2020

The Linux-HA Project is a failover system where nodes in the high availability cluster can take over the IP addresses of failed nodes. When a node fails, it is replaced by another that acts as the failed node. A critical part of any failover system is preserving the state of applications. Memory caches and other client specific data make client failover difficult.

High-Performance Computing for Nuclear Reactor Design and Safety Applications

View Article

Journal Information

Published in Nuclear Technology, 2020

Afaque Shams, Dante De Santis, Adam Padee, Piotr Wasiuk, Tobiasz Jarosiewicz, Tomasz Kwiatkowski, Sławomir Potempski

The production system is built using the Torque 5.1.2 as a queuing system and the MAUI 3.3.1 as a job scheduler. The selection of these tools is based on the configuration of the computing infrastructure for high-energy physics experiments as described in Ref. 2 and the collaboration within the Polish HPC community as outlined in Ref. 3. All the users access the cluster using several login nodes (Fig. 1). Both the queuing system server and login nodes are virtualized and run on the high availability cluster to immunize the system to hardware failures. The configuration of queues takes into account two main variables: maximum execution time (interactive, 1 day, 3 days, 1 week, 2 weeks, dedicated) and node type (details on the type of nodes available in the cluster are presented in Table I). There are two reasons why the mixing of nodes of different types in a single job is not allowed. The first one is due to performance loss when executing a typical, synchronous MPI application on a set of CPUs with different parameters. The second reason is due to the fact that the fat-tree InfiniBand islands consist only of nodes of the same type. The connection between the islands has much less bandwidth, which may lead to even further performance loss. The production system processes approximately 100 000 jobs per month, and the job size varies from 1 to 10 000 cores.

High-availability cluster

Explore chapters and articles related to this topic

Data Recovery

High-Performance Computing for Nuclear Reactor Design and Safety Applications