Job scheduler – Knowledge and References

Explore chapters and articles related to this topic

Evaluating a Job Scheduler in the Open Systems Arena

Published in Steven F. Blanding, Enterprise Operations Management, 2020

The features and functions of a job scheduler that comprise recurring cost-savings can be sorted into three categories: automated schedule administration, ease-of-use considerations, and automated operations. Because these areas promise the highest degree of cost savings, they are highlighted here:

Automated surface water detection from space: a Canada-wide, open-source, automated, near-real time solution

View Article

Journal Information

Published in Canadian Water Resources Journal / Revue canadienne des ressources hydriques, 2020

Koreen Millard, Nicholas Brown, Douglas Stiff, Alain Pietroniro

In order to support operational, national scale mapping of water body extents in near real-time, the classification framework (Figure 3) was implemented on a High Powered Computer (HPC) located at the Canadian Meteorological Center (CMC). A scheduled task is set up to monitor an input directory. The system is designed so that as new images are acquired by RS2 (or RCM as of July 12, 2020), these images will be copied to the HPC’s data storage repository. All processing steps are run through a job scheduler which can allocate up to 200 GB of RAM and up to 44 processors. Periodically, scheduled tasks will run to determine if files (i.e. newly added files) are found with an appropriate file extension, the pre-processing routine is launched. This means that as images are acquired and added to this directory, the process will run in near real-time. Pre-processed files are then saved to another directory which is also monitored by a scheduled task and launched in the same way. Many of the parameters of the classifier, including the location of input and output directories, and the file extension to watch for in the input directories, are specified in a text control file. A schematic of the operational procedure is shown in Figure 4.

High-Performance Computing for Nuclear Reactor Design and Safety Applications

View Article

Journal Information

Published in Nuclear Technology, 2020

Afaque Shams, Dante De Santis, Adam Padee, Piotr Wasiuk, Tobiasz Jarosiewicz, Tomasz Kwiatkowski, Sławomir Potempski

The production system is built using the Torque 5.1.2 as a queuing system and the MAUI 3.3.1 as a job scheduler. The selection of these tools is based on the configuration of the computing infrastructure for high-energy physics experiments as described in Ref. 2 and the collaboration within the Polish HPC community as outlined in Ref. 3. All the users access the cluster using several login nodes (Fig. 1). Both the queuing system server and login nodes are virtualized and run on the high availability cluster to immunize the system to hardware failures. The configuration of queues takes into account two main variables: maximum execution time (interactive, 1 day, 3 days, 1 week, 2 weeks, dedicated) and node type (details on the type of nodes available in the cluster are presented in Table I). There are two reasons why the mixing of nodes of different types in a single job is not allowed. The first one is due to performance loss when executing a typical, synchronous MPI application on a set of CPUs with different parameters. The second reason is due to the fact that the fat-tree InfiniBand islands consist only of nodes of the same type. The connection between the islands has much less bandwidth, which may lead to even further performance loss. The production system processes approximately 100 000 jobs per month, and the job size varies from 1 to 10 000 cores.

Automatic Between-Pulse Analysis of DIII-D Experimental Data Performed Remotely on a Supercomputer at Argonne Leadership Computing Facility

View Article

Journal Information

Published in Fusion Science and Technology, 2018

M. Kostuk, T. D. Uram, T. Evans, D. M. Orlov, M. E. Papka, D. Schissel

To facilitate the near-real-time integration of ALCF computing resources with the experiment cycle at DIII-D, we established a persistent connection between the MDSplus database at DIII-D and a job submission service running at ALCF, named Balsam. When a discharge occurs, the coil data are saved to MDSplus, and EFIT is launched automatically (on local DIII-D resources). Once EFIT has completed and uploaded its results to MDSplus, an MDSplus event is issued in the form of a user datagram protocol (UDP) packet sent to the port on which the Balsam service is listening. The packet transfers the discharge number, which is the only additional piece of information that SURFMN needs to run, uniquely identifying the specific experiment. The Balsam service recognizes the form of the expected number and constructs the SURFMN submission script and passes it to the COBALTcCOBALT is a job scheduler that manages a queue of requested analysis jobs on the ALCF’s resources; https://www.alcf.anl.gov/cobalt-scheduler. job scheduler where it is placed in the queue to be run. The submission script of SURFMN makes a request to a metadata server at DIII-D for a unique runID and also sends the message that SURFMN has started at ALCF. The root process of SURFMN connects directly to the MDSplus server (at DIII-D) and requests the 75 MB of raw coil currents; it then connects a second time to retrieve the 15 MB of EFIT equilibrium profiles.