<style> .reveal { font-size: 30px; } .reveal h3{ color: #eb3474; } .reveal p { text-align: left; } .reveal ul { display: block; } .reveal ol { display: block; } </style> # <font color="#4542ad">Introduction on DiCOS Slurm Job Submission</font> Michael Ting-Chang Yang 楊庭彰 mike.yang@twgrid.org Academia Sinica Grid-computing Centre (ASGC) Date: 2022-12-15 --- # Introduction of DiCOS Slurm ## Introduction of Slurm - Computing * Computing Machine Specifications - Computing Nodes: - FDR5: 92 nodes - HDR1: 6 nodes - GPU: 3 nodes - 1x NVIDIA tesla-V100 (8 GPU cards each node) - 2x NVIDIA tesla-A100 (8 GPU cards each node) - Local Slurm Submission Only - Workflow management system (WFMS): slurm ---- ## Introduction of Slurm - Storage * Software repositories (readonly) - /cvmfs/cvmfs.grid.sinica.edu.tw * Assuming your user name is **jack**, then you have following storage spaces to use: - Home Directory (NFS, not guaranteed for your data security): __/dicos_ui_home/jack__ - Working Directory (CEPH): __/ceph/sharedfs/users/j/jack/__ - Group Directory (CEPH, specially for your research group): __/ceph/sharedfs/groups/<your_group>__ * Ceph partitions will be charged by groups, and accounting bill will be sent to group PIs in monthly basis * You will be prompted with the relative information when login into slurm-ui --- # User Interfaces (Login Nodes) ## Login into Slurm User Interface * The user interface node for slurm are: - slurm-ui.twgrid.org * Login in user interface: ```bash ssh jack@slurm-ui.twgrid.org ``` * You will be prompted with the relative information of your account when login into the slurm user interfaces --- # Basic Usage of Slurm System * Query cluster information ```bash sinfo ``` * Query the jobs submitted by you ```bash sacct ``` or ```bash sacct -u jack ``` ---- * Submit your job with bash script (recommended) - You need to add __#!/bin/bash__ (shebang/hashbang) in the first line of your script, then sbatch will recognize it is a shell script ```bash sbatch your_script.sh ``` * Submit your job (binary executable) with srun - It's not easy to control the resource usage with direct submit with **srun**. We will recommand you to wrap your **srun** command within a batch shell and then run with **sbatch** ```bash srun your_program arg1 arg2 ``` ---- * Show queue information ```bash squeue ``` * Show your job in the queue ```bash squeue -u jack ``` * Show the detailed job information ```bash scontrol show job your_jobid ``` * Cancel your job ```bash scancel your_jobid ``` --- # Partitions/Queues of Slurm * Slurm Partitions (Queues) - Please see: https://dicos.grid.sinica.edu.tw/static/docs/slurm_job_submission.html#slurm-partitions-queues * The default queue is "short". Users could submit to different partitions by assigning partition parameters, e.g. ```bash sbatch -p large myscript.sh ``` --- # Environment Modules ## Introduction * In DiCOS Slurm system, we have environment modules installed in user interfaces and worker nodes * Detailed information please refer to the original document: - https://modules.readthedocs.io/en/latest/ * Environment-modules help user to setup environment and envoronment variables properly for specific software environments - User doesn't need to worry about the complex settings of the environments ---- ## Environment Modules - Use Scopes - The environment modules will be initialized automatically when you login in UI - You could load the necessary modules in UI and then submit your job, the environment settings will be bring to worker nodes automatically ## Environment Modules - Initialization * In slurm-ui, the environment modules will be initialized automatically when user login in ---- ## Basic Usage of Environment Modules * Show available modules in slurm-ui ```less $ module avail ------ /cvmfs/cvmfs.grid.sinica.edu.tw/hpc/modules/modulefiles/Core --------- app/anaconda3/4.9.2 app/cmake/3.20.3 app/root/6.24 gcc/9.3.0 gcc/11.1.0 intel/2018 nvhpc_sdk/20.11 python/3.9.5 app/binutils/2.35.2 app/make/4.3 gcc/4.8.5 gcc/10.3.0 intel/2017 intel/2020 pgi/20.11 ``` * Load module ```bash module load intel/2020 ``` * Unload module ```bash module unload intel/2020 ``` * Show currently loaded modules ```bash module list ``` ---- * Unload all loaded modules ```bash module purge ``` --- # Python, Compilation and MPI Environment * Use ``` module avail ``` for the available pre-set environment first ---- ## Python * The default system python on CentOS 7 is python 2.7.4 * If you are going to use python 3, please consider using anaconda with python3 first ``` module load app/anaconda3/4.12.0 ``` * If you woule like to install your python package, use: ``` pip install --user <your_package> ``` to install the python package from pypi directly into your home directory (need to pair with your python version) ---- ## Compilation * Intel compiler ```bash module load intel/2022 ``` * GCC ```bash module load gcc/12.1.0 ``` * nvidia development kit (nvcc, for GPU program development) ```bash module load nvhpc_sdk/20.11 ``` ---- ## MPI * Load compiler first, e.g. intel compiler ```bash module load intel/2020 ``` * Load different MPI implementation - mpich ```bash module load mpich ``` - openmpi ```bash module load openmpi/4.1.0 ``` - mvapich2 ```bash module load mvapich2 ``` --- # Slurm Job Submission Examples - Hands on --- # Example 1 - Simple Job Submission (Hello World) ---- * Prepare a user defined shell script hello_world.sh ```bash #!/bin/bash date echo "Hello World DiCOS Users!" hostname ``` * Submit the job with sbatch ```bash sbatch hello_world.sh ``` --- # Example 2 - Submit a MCORE job ---- * Assume I have a multi-process program call mcore.exe * You will need to assign in your preamble of the script for the requesting resources. E.g. mcore.sh ```bash #SBATCH --job-name=My_MCORE_Job # shows up in the output of 'squeue' #SBATCH --time=1-00:00:00 # specify the requested wall-time #SBATCH --nodes=1 # -n number of nodes allocated for this job #SBATCH --ntasks-per-node=1 # number of MPI ranks per node #SBATCH --cpus-per-task=10 # -c number of OpenMP threads per MPI rank #SBATCH --error=job.%J.err # job error. By default, both files are directed to a file of the name slurm-%j.err #SBATCH --output=job.%J.out # job output. By default, both files are directed to a file of the name slurm-%j.out srun mcore.exe -c 10 -t 100 ``` * Submit mcore_script.sh via sbatch to a proper partition: ```bash #!/bin/bash sbatch mcore.sh -p large ``` * Submit job: ```bash sbatch mcore_script.sh ``` * This example will submit a job which requesting 10 CPU cores --- # Example 3 - Run a MPI LAMMPS Job ---- * Prepare a MPI capable run script: run_lammps_mpi.sh ```bash #!/usr/bin/bash #SBATCH --job-name=JCT_TEST # shows up in the output of 'squeue' #SBATCH --time=1-00:00:00 # specify the requested wall-time #SBATCH --nodes=3 # -n number of nodes allocated for this job #SBATCH --ntasks-per-node=1 # number of MPI ranks per node #SBATCH --cpus-per-task=1 # -c number of OpenMP threads per MPI rank #SBATCH --error=job.%J.err # job error. By default, both files are directed to a file of the name slurm-%j.err #SBATCH --output=job.%J.out # job output. By default, both files are directed to a file of the name slurm-%j.out module purge module load intel/2020 module load mpich module load lammps/jct/3Mar2020 export OMP_NUM_THREADS=3 srun lmp -sf omp -pk omp 3 -in SSMD_input_run.txt ``` * Submit job with sbatch: ```bash sbatch run_lammps_mpi.sh ``` --- # Example 4 - Submit a python job using anaconda3 python3 ---- * Prepare a python script that calcuate pi number: calculate_pi.py ```python # Initialize denominator k = 1 # Initialize sum s = 0 for i in range(1000000000): # even index elements are positive if i % 2 == 0: s += 4/k else: # odd index elements are negative s -= 4/k # denominator is odd k += 2 print(f"{s}") ``` ---- * Prepare a shell script that wrapping the environment modules and run python script: calculate_pi.sh ```bash #!/bin/bash source /etc/profile.d/dicos-environment-modules.sh module load app/anaconda3/4.9.2 python calculate_pi.py ``` * Submit job using sbatch ```bash sbatch calculate_pi.sh ``` --- # Problem Report and FAQ * Online documents: https://dicos.grid.sinica.edu.tw/wiki/ * Email channel to ASGC admins: DiCOS-Support@twgrid.org * Regular face-to-face (live) video conference: ASGC DiCOS User Meeting (held at 13:20 (UTC+8), every Wednesday), ask our staffs for the concall information --- * This Slide: https://docs.twgrid.org/p/WVt0nmE8i#/ ![](/uploads/upload_ed6d9f7a2711c4ab7b56dd5920d467de.png) --- * Training Git Repo https://github.com/ASGCOPS/slurm_training/ ![](/uploads/upload_3f8421343aacd59e5179904666df33ca.png)
{"title":"Introduction DiCOS Slurm Job Submission","tags":"presentation,DiCOS_Document","slideOptions":{"transition":"fade","theme":"white","parallaxBackgroundImage":"/uploads/upload_d60ca84ca111101563c574837e417042.jpg"}}