<style>
.reveal {
font-size: 30px;
}
.reveal h3{
color: #eb3474;
}
.reveal p {
text-align: left;
}
.reveal ul {
display: block;
}
.reveal ol {
display: block;
}
</style>
# <font color="#4542ad">Introduction on DiCOS Slurm Job Submission</font>
Michael Ting-Chang Yang
楊庭彰
mike.yang@twgrid.org
Academia Sinica Grid-computing Centre (ASGC)
Date: 2022-12-15
---
# Introduction of DiCOS Slurm
## Introduction of Slurm - Computing
* Computing Machine Specifications
- Computing Nodes:
- FDR5: 92 nodes
- HDR1: 6 nodes
- GPU: 3 nodes
- 1x NVIDIA tesla-V100 (8 GPU cards each node)
- 2x NVIDIA tesla-A100 (8 GPU cards each node)
- Local Slurm Submission Only
- Workflow management system (WFMS): slurm
----
## Introduction of Slurm - Storage
* Software repositories (readonly)
- /cvmfs/cvmfs.grid.sinica.edu.tw
* Assuming your user name is **jack**, then you have following storage spaces to use:
- Home Directory (NFS, not guaranteed for your data security): __/dicos_ui_home/jack__
- Working Directory (CEPH): __/ceph/sharedfs/users/j/jack/__
- Group Directory (CEPH, specially for your research group): __/ceph/sharedfs/groups/<your_group>__
* Ceph partitions will be charged by groups, and accounting bill will be sent to group PIs in monthly basis
* You will be prompted with the relative information when login into slurm-ui
---
# User Interfaces (Login Nodes)
## Login into Slurm User Interface
* The user interface node for slurm are:
- slurm-ui.twgrid.org
* Login in user interface:
```bash
ssh jack@slurm-ui.twgrid.org
```
* You will be prompted with the relative information of your account when login into the slurm user interfaces
---
# Basic Usage of Slurm System
* Query cluster information
```bash
sinfo
```
* Query the jobs submitted by you
```bash
sacct
```
or
```bash
sacct -u jack
```
----
* Submit your job with bash script (recommended)
- You need to add __#!/bin/bash__ (shebang/hashbang) in the first line of your script, then sbatch will recognize it is a shell script
```bash
sbatch your_script.sh
```
* Submit your job (binary executable) with srun
- It's not easy to control the resource usage with direct submit with **srun**. We will recommand you to wrap your **srun** command within a batch shell and then run with **sbatch**
```bash
srun your_program arg1 arg2
```
----
* Show queue information
```bash
squeue
```
* Show your job in the queue
```bash
squeue -u jack
```
* Show the detailed job information
```bash
scontrol show job your_jobid
```
* Cancel your job
```bash
scancel your_jobid
```
---
# Partitions/Queues of Slurm
* Slurm Partitions (Queues)
- Please see: https://dicos.grid.sinica.edu.tw/static/docs/slurm_job_submission.html#slurm-partitions-queues
* The default queue is "short". Users could submit to different partitions by assigning partition parameters, e.g.
```bash
sbatch -p large myscript.sh
```
---
# Environment Modules
## Introduction
* In DiCOS Slurm system, we have environment modules installed in user interfaces and worker nodes
* Detailed information please refer to the original document:
- https://modules.readthedocs.io/en/latest/
* Environment-modules help user to setup environment and envoronment variables properly for specific software environments
- User doesn't need to worry about the complex settings of the environments
----
## Environment Modules - Use Scopes
- The environment modules will be initialized automatically when you login in UI
- You could load the necessary modules in UI and then submit your job, the environment settings will be bring to worker nodes automatically
## Environment Modules - Initialization
* In slurm-ui, the environment modules will be initialized automatically when user login in
----
## Basic Usage of Environment Modules
* Show available modules in slurm-ui
```less
$ module avail
------ /cvmfs/cvmfs.grid.sinica.edu.tw/hpc/modules/modulefiles/Core ---------
app/anaconda3/4.9.2 app/cmake/3.20.3 app/root/6.24 gcc/9.3.0 gcc/11.1.0
intel/2018 nvhpc_sdk/20.11 python/3.9.5 app/binutils/2.35.2 app/make/4.3
gcc/4.8.5 gcc/10.3.0 intel/2017 intel/2020 pgi/20.11
```
* Load module
```bash
module load intel/2020
```
* Unload module
```bash
module unload intel/2020
```
* Show currently loaded modules
```bash
module list
```
----
* Unload all loaded modules
```bash
module purge
```
---
# Python, Compilation and MPI Environment
* Use
```
module avail
```
for the available pre-set environment first
----
## Python
* The default system python on CentOS 7 is python 2.7.4
* If you are going to use python 3, please consider using anaconda with python3 first
```
module load app/anaconda3/4.12.0
```
* If you woule like to install your python package, use:
```
pip install --user <your_package>
```
to install the python package from pypi directly into your home directory (need to pair with your python version)
----
## Compilation
* Intel compiler
```bash
module load intel/2022
```
* GCC
```bash
module load gcc/12.1.0
```
* nvidia development kit (nvcc, for GPU program development)
```bash
module load nvhpc_sdk/20.11
```
----
## MPI
* Load compiler first, e.g. intel compiler
```bash
module load intel/2020
```
* Load different MPI implementation
- mpich
```bash
module load mpich
```
- openmpi
```bash
module load openmpi/4.1.0
```
- mvapich2
```bash
module load mvapich2
```
---
# Slurm Job Submission Examples - Hands on
---
# Example 1 - Simple Job Submission (Hello World)
----
* Prepare a user defined shell script hello_world.sh
```bash
#!/bin/bash
date
echo "Hello World DiCOS Users!"
hostname
```
* Submit the job with sbatch
```bash
sbatch hello_world.sh
```
---
# Example 2 - Submit a MCORE job
----
* Assume I have a multi-process program call mcore.exe
* You will need to assign in your preamble of the script for the requesting resources. E.g. mcore.sh
```bash
#SBATCH --job-name=My_MCORE_Job # shows up in the output of 'squeue'
#SBATCH --time=1-00:00:00 # specify the requested wall-time
#SBATCH --nodes=1 # -n number of nodes allocated for this job
#SBATCH --ntasks-per-node=1 # number of MPI ranks per node
#SBATCH --cpus-per-task=10 # -c number of OpenMP threads per MPI rank
#SBATCH --error=job.%J.err # job error. By default, both files are directed to a file of the name slurm-%j.err
#SBATCH --output=job.%J.out # job output. By default, both files are directed to a file of the name slurm-%j.out
srun mcore.exe -c 10 -t 100
```
* Submit mcore_script.sh via sbatch to a proper partition:
```bash
#!/bin/bash
sbatch mcore.sh -p large
```
* Submit job:
```bash
sbatch mcore_script.sh
```
* This example will submit a job which requesting 10 CPU cores
---
# Example 3 - Run a MPI LAMMPS Job
----
* Prepare a MPI capable run script: run_lammps_mpi.sh
```bash
#!/usr/bin/bash
#SBATCH --job-name=JCT_TEST # shows up in the output of 'squeue'
#SBATCH --time=1-00:00:00 # specify the requested wall-time
#SBATCH --nodes=3 # -n number of nodes allocated for this job
#SBATCH --ntasks-per-node=1 # number of MPI ranks per node
#SBATCH --cpus-per-task=1 # -c number of OpenMP threads per MPI rank
#SBATCH --error=job.%J.err # job error. By default, both files are directed to a file of the name slurm-%j.err
#SBATCH --output=job.%J.out # job output. By default, both files are directed to a file of the name slurm-%j.out
module purge
module load intel/2020
module load mpich
module load lammps/jct/3Mar2020
export OMP_NUM_THREADS=3
srun lmp -sf omp -pk omp 3 -in SSMD_input_run.txt
```
* Submit job with sbatch:
```bash
sbatch run_lammps_mpi.sh
```
---
# Example 4 - Submit a python job using anaconda3 python3
----
* Prepare a python script that calcuate pi number: calculate_pi.py
```python
# Initialize denominator
k = 1
# Initialize sum
s = 0
for i in range(1000000000):
# even index elements are positive
if i % 2 == 0:
s += 4/k
else:
# odd index elements are negative
s -= 4/k
# denominator is odd
k += 2
print(f"{s}")
```
----
* Prepare a shell script that wrapping the environment modules and run python script: calculate_pi.sh
```bash
#!/bin/bash
source /etc/profile.d/dicos-environment-modules.sh
module load app/anaconda3/4.9.2
python calculate_pi.py
```
* Submit job using sbatch
```bash
sbatch calculate_pi.sh
```
---
# Problem Report and FAQ
* Online documents: https://dicos.grid.sinica.edu.tw/wiki/
* Email channel to ASGC admins: DiCOS-Support@twgrid.org
* Regular face-to-face (live) video conference: ASGC DiCOS User Meeting (held at 13:20 (UTC+8), every Wednesday), ask our staffs for the concall information
---
* This Slide: https://docs.twgrid.org/p/WVt0nmE8i#/
![](/uploads/upload_ed6d9f7a2711c4ab7b56dd5920d467de.png)
---
* Training Git Repo https://github.com/ASGCOPS/slurm_training/
![](/uploads/upload_3f8421343aacd59e5179904666df33ca.png)
{"title":"Introduction DiCOS Slurm Job Submission","tags":"presentation,DiCOS_Document","slideOptions":{"transition":"fade","theme":"white","parallaxBackgroundImage":"/uploads/upload_d60ca84ca111101563c574837e417042.jpg"}}