Friedrich-Alexander-Universität UnivisDeutsch FAU-Logo
Techn. Fakultät Willkommen am Institut für Informatik FAU-Logo
Logo LSS
Chair for System Simulation (Department of Computer Science 10)
HPC-Cluster
Hardware
Access
Environment
Interactive work
Batch-System
Documentation
Dept. of Computer Science  >  Computer Science 10  >  HPC-Cluster  >  HPC-Cluster
HPC-Cluster

HPC-Cluster - Access and Use


Since November 2004, the LSS maintains a 52-CPU High-Performance-Cluster. In the following, general information and hints about the usage of the machine is given. For problem reports, questions, suggestions etc. please mail to clusteradm@immd10.informatik.uni-erlangen.de.

Hardware

The cluster consists of 9 dual-, 8 quad-nodes and a fileserver. The machines have the following specifications in common:
  • CPU: AMD Opteron, 2.2 GHz, 1 MB L2 cache
  • RAM: DDR 333, 4 GB for dual-, 16 GB for the quad-nodes
All nodes are connected with a GBit-interface, the quad-nodes additionally with Infiniband. Latter provides a bandwidth of up to 10 GBit/s. The fileserver has a capacity of roughly 500 GB disk space. Local disk space of about 70 GB is provided on each node also.

Access

In order to obtain a login for the cluster, please mail to clusteradm@immd10.informatik.uni-erlangen.de providing a short description of your project and the required resources. In case you recieve a login, your public ssh-key will be neccessary for authentication.
The node names are as follows:
  • Front-End: fauia50
  • Quad-Nodes: fauia<51-58>
  • Dual-Nodes: fauia<59-67>
Access to the cluster is granted via authentication by ssh-keys. The front-end is called fauia50.informatik.uni-erlangen.de where all users have to login fist. When logging in, the option -A for ssh is essential. Moreover, the ssh-agent on your local machine has to be started.

ssh-agent $SHELL
ssh-add
ssh -A <login>@fauia50.informatik.uni-erlangen.de


After login to fauia50, you will encounter a menu containing all available machines for interactive work. By scrolling up or down with cursor keys you can select a machine. Pressing enter will direct you to the chosen one.
Login1 Login2
For profiling purposes, please use fauia59. It contains a kernel with the perfctr-Patch (PAPI). An attempt to login directly to the shown machines will be blocked. To logout, type exit and choose Exit in the menu.

Environment

Because there are multiple compiler and library options available for Linux on the opteron, the Environment Modules Package has been provided for easily switching between different packages. You can find a full description of the modules project on the sourceforge modules homepage.

In order to make use of the modules utilities on the cluster, you will need to add a new line to your shell .rc file:

.bashrc
# bash resource file
. /central/modules/default/init/bash
...

.cshrc
# cshrc resource file
source /central/modules/default/init/csh
...

There are separate initialization scripts available for bash, csh, ksh, perl, python, sh, tcsh, and zsh.

To load a module, simply issue the command:
% module load name_of_module_to_load

Note: You can also add module load commands to your shell .rc file so that a default set of modules gets loaded everytime that you login.

To unload the module:
% module unload name_of_module_to_unload

For a listing of available modules type module avail.
ModuleAvail
To see what a module is doing when it gets loaded you can use the show command. As an example, consider the output for the mpich-1.2.6 package:
ModuleShow

Hint: You can add your own custom module files by creating ~/.modulefiles and putting the custom module definitions there. For help on writing module file definitions look at the modules homepage, or load the modules module:
% module load modules

and read the manpages:
% man module
% man modulefile

Interactive work

There are three dual-nodes for interactive work, one of it especially for profiling purposes (fauia59). These nodes are intended for compilation and short benchmark runs. For long running jobs, please use the queueing system. In case of abuse, jobs will be terminated without warning.
Disk space is limited by the quota-tools. Type
quota -vs
to see your current status. All users, who have an account at LSS, can access the home-directory on their local machine. For those users, access from the cluster to the local machines is possible, too (/lsshome). All other users will have ro use scp to copy data to the cluster. Since only fauia50 is visible, please use this machine in your statement (/home is NFS-mounted on all nodes)
scp myStuff <login>@fauia50.informatik.uni-erlangen.de
In addition to common software, an installation of the module utils provides easy access to different compilers and MPI-implementations. Use the command
module avail
to receive a list of all available modules. In order to load a certain module, type
module load <module-name>
After that, you will be able to use the specified software (see section Environment).
To make use of MPI in connection with Infiniband, the module scali has to be loaded. After compiling, the command mpirun starts the program like any other MPI-job.
mpirun -np 2 -machinefile mfile ./a.out
Your machine file mfile should look like this:
fauia60
fauia60
fauia61
fauia61
In order to use more CPUs, please refer to the batch system. Again, long running jobs with too many CPUs will be killed on the interactive nodes without warning.

Batch-System

All jobs with moderate or high computation times have to be submitted to the batch system. A ssh-key has to be generated before the batch system can be used properly. The batch system is able to handle serial and parallel jobs.

Generation of ssh-key

Before using the batch system, you should create a ssh-key on the cluster with an empty passphrase to avoid problems with MPI.
  • ssh-keygen -t dsa -b 1024
  • Answer questions, enter no passphrase, name empty key different from your other keys (for example id_dsa_cluster)
  • cd ~/.ssh
  • cat id_dsa_cluster.pub >> authorized_keys
In order to ensure the right key is used, please place a file .ssh/config with the following content in your home:

Host fauia5*
IdentityFile path_to_your_home/.ssh/id_dsa_cluster
Host fauia6*
IdentityFile path_to_your_home/.ssh/id_dsa_cluster
Host *
IdentityFile path_to_your_home/.ssh/id_dsa

Usage

Jobs can be submitted from fauia60 and fauia61. For that purpose an initial
module load pbs
and job script is neccessary, in which all requirements of the job are specified (see Serial Jobs, Parallel Jobs for examples). Thereafter, the script is submitted to the queueing system by the command qsub:
qsub <NameOfScript>
The qualifier of the job is returned by the queueing system. In order to see all queued jobs, the command
qstat -a
can be used. Queued jobs may also be deleted by the user who submitted the job. In this case the first number of the job identifier has to be used:
qdel <NumberOfjobIdentifier>
The following screenshot shows an example.
Queue1
There are 3 queues configured at the moment. If no queue is specified, the job will be sent to the serial queue.
serial-Queueparallel-Queuelss-Queue
access all allrestricted
priority normalnormalhigh
default run time 2 h 2 h2 h
no. of CPUs per job 1 1632
no. of running jobs 1 4unlimited
The cluster can be used for serial and parallel jobs. However, due to the configuration of the cluster, jobs with a moderate number of CPUs and high communication effort are most suitable. Users with different requirements should refer to the local computing center (www.rrze.uni-erlangen.de/dienste/arbeiten-rechnen/hpc/).

Serial Jobs

A sample job script might look like this:

#!/bin/sh
#PBS -l nodes=1:ppn=1
#PBS -l walltime=01:00:00
#PBS -q serial
#PBS -M x@y -m abe

. /central/modules/default/init/bash
export FOO=bar
module load XXX
cd /homes/<staff|stud|guests>/<login>/...
./a.out

The number of nodes and processors per nodes is set to 1. After 1 hour of wall clock time, the job will be terminated by the queueing system. The default is 2 hours. Notifcation by mail at start, finish or abort of the job will be sent to address x@y. All essential environment variables have to be set before the executable is started. Especially soucing the module init command is important in order to be able to use the module command in batch scripts.

Parallel Jobs

OpenMP

OpenMP can be used on a single shared memory node only. For such jobs, the number of processors per nodes should be equal to the number of OpenMP threads used. For example:

#!/bin/sh
#PBS -l nodes=1:ppn=4
#PBS -l walltime=01:00:00
#PBS -q parallel
#PBS -M x@y -m abe

. /central/modules/default/init/bash
export FOO=bar
export OMP_NUM_THREADS=4
module load XXX
cd /homes/<staff|stud|guests>/<login>/...
./a.out

Execpt of setting all needed OpenMP variables, no different statements are neccessary compared to serial jobs.

MPI

In case of MPI jobs, the product of number of nodes and processors per node (max. 4) returns the number of MPI processes to use in -np option. For example:

#!/bin/sh
#
#PBS -l nodes=2:ppn=4
#PBS -q parallel

export FOO=bar

. /central/modules/default/init/bash
module load scali
module load XXX

cd /homes/<staff|stud|guests>/<login>/...
mpirun -np 8 -machinefile $PBS_NODEFILE ./a.out

The variable $PBS_NODEFILE represents all a suitable machine file for the number of CPU's chosen.

MPI2

When using MPI2, we have to take care of a daemon neccessary to run the job. The job script should look like this:

#!/bin/sh
#
#PBS -l nodes=2:ppn=4
#PBS -q parallel

export FOO=bar

. /central/modules/default/init/bash
module load mpich2
module load XXX

mpdboot --verbose --totalnum=8 --file=$PBS_NODEFILE
sleep 5

cd /homes/<staff|stud|guests>/<login>/...
mpirun -machinefile $PBS_NODEFILE -np 8 ./a.out

mpdallexit

Please note the order of the command options of the mpirun command.

Documentation

Further information can be found here:
  Contact Last modified: 2007-08-09 12:00   fd