Kay User Guide
Table of Contents
Hardware and Architecture
When you connect to Kay using ssh your connection will automatically be routed to one of three login nodes (login1, login2, or login3). These nodes are intended for interactive tasks like compiling code, editing and managing files. They are shared with other users hence they should not be used for compute-intensive workloads.
Apart from the login nodes, the vast majority of Kay is made up of compute nodes. These nodes are used for running compute-intensive jobs that are managed by a batch system. Users submit batch scripts to request node(s) for each job, which are then placed into a scheduling system before compute nodes are allocated to execute the job.
Further hardware and architectural details of Kay are described in our Infrastructure section.
- In order to connect to Kay via SSH, you will need to login using your central ICHEC account username. You will be requested to authenticate using both your central ICHEC account password and SSH key-based authentication.
- Users must initially configure an SSH key on each device they use to connect to Kay. Please note that the latter requires that you generate SSH public-private key pairs on their local workstation (and not on Kay) for security reasons (e.g. ssh-keygen for Linux/Mac, MobaXterm for Windows; DO NOT store your private key on Kay). Check out our tutorial - Setting up SSH Keys - for more details.
Data Storage
When using ICHEC systems your files will be stored in two locations:
Home: /ichec/home/users/username Work: /ichec/work/projectname
Home will have a relatively small storage quota (25GB) and should be used for personal files and source code which are related to your use of the system. It is not suited for storing large volumes for simulation results for example.
Work is an area of common storage for use by all the members of a project with a much larger quota. In practice this is where the majority of files should be stored. Note that only home directories are backed up to tape; project directories under /ichec/work/projectname are NOT backed up. The backup of home directories is only intended for disaster recovery, we do not accommodate bespoke data recovery requests from users, e.g. accidental file deletions.
While a job is running on the compute nodes, it will also have access to two scratch directories (/scratch/global/ and /scratch/local/) to store temporary files. While /scratch/global/ is just a temporary directory similar to others in your home or work directories, /scratch/local/ points to local SSD drives on individual compute nodes that can provide fast read/write access. All compute nodes have 400GB SSDs for local scratch storage, apart from the High Memory nodes which have 1TB SSDs. Please keep in mind that any files stored in scratch storage only lasts for the duration of the job, once the job ends everything in the scratch directories are deleted.
Environment Modules
We support a range of software packages on Kay - a detailed list is in our Software section. In order to make use of any specific software package, you must load its appropriate module(s).
Loading a module typically sets or modifies some environment variables, e.g. the PATH variable (so that the shell knows where to look for the relevant executable binaries, libraries, etc. for a particular software package).
You can load the appropriate modules (software, compilers, etc)
module load modulename
# Load the software module
module load intel/2019
Note: The software specific module load commands must be present in your job submission scripts (before the software specific run command)
Some other useful module commands are
# List the loaded modules module list # Unload the loaded modules module unload intel/2019
Software Packages
The details of software applications available on Kay via modules can be found in our Software section.
Workload Manager (SLURM)
We use SLURM to allocate compute resources on Kay. In order to submit jobs to the compute nodes you should:
- Write a job script which describes the resources required (e.g how many CPUs and for how long), instructions such as where to redirect standard output and error, and the commands to be executed once the job starts.
- Submit the job to the workload manager which will then start the job once the requested resources are available.
- Once the job completes, you will find the results generated by the job on the filesystem (e.g. expected application output files, special files that contain the standard output/error generated by the job).
The jobs can be:
- Interactive : With interactive jobs, you can request a set of nodes to run an interactive bash shell on. This is useful for quick tests and development work. For example, the following command will submit an interactive job requesting 1 node for 1 hour to be charged to myproj_id:
salloc -p DevQ -N 1 -A myproj_id -t 1:00:00
Note: Interactive jobs should only be triggered with the DevQ queue
- Batch : For each batch job, a job script is submitted for execution whenever the requested resources are available. A sample script is displayed below which request 2 nodes (each with 40 cores, i.e. 80 cores in total) for 20 minutes to run an MPI application:
#!/bin/sh #SBATCH --time=00:20:00 #SBATCH --nodes=2 #SBATCH -A myproj_id #SBATCH -p DevQ module load intel/2019 mpirun -n 80 mpi-benchmarks/src/IMB-MPI1
Note: The file must be a shell script (e.g. the first line being #!/bin/sh
like above) with Slurm directives preceeded by #SBATCH
.
To submit the batch job, use the sbatch command:
sbatch mybatchjob.sh
- Multiple Serial Jobs (Task Farming) : For running multiple serial jobs using slurm, refer to our tutorial for Task Farming.
Job Reason Codes
These codes identify the reason that a job is waiting for execution. A job may be waiting for more than one reason, in which case only one of those reasons is displayed.
Those code can be found in the Slurm Documentation Website.
Backup Policy
As stated in our Acceptable Usage Policy backups are only made of user's home directories. Project directories under /ichec/work/projectname are NOT backed up. Furthermore, backups are only carried out as part of our system failure recovery plan; the restoration of user files deleted accidentally is not provided as a service.
Support
The Helpdesk is the main entry point to get help from the ICHEC's Support team. Here you can get help in using our facilities, find out more about ICHEC or send us your comments. If our documentation or our FAQ section does not resolve your query, do not hesitate to use it to contact ICHEC.