Cookies on this website

We use cookies to ensure that we give you the best experience on our website. If you click 'Accept all cookies' we'll assume that you are happy to receive all cookies and you won't see this message again. If you click 'Reject all non-essential cookies' only necessary cookies providing core functionality such as security, network management, and accessibility will be enabled. Click 'Find out more' for information on how to change your cookie settings.

Basic notes on rescomp. The rescomp webpages should be consulted if you are unsure about anything. Last updated: November 2019

V1, 20191112

Cores, Threads, Memory , MPI proc, Node



Submitting a job on rescomp.


Note that the information presented below is insufficient. Please familiarise yourself with the wiki available at https://help.bmrc.ox.ac.uk/tutorials/submit-jobs.html (note that content on this site is only accessible with rescomp login credentials).


To assess core availability, login to rescomp1/2 and type the following into the terminal window:qload -nallFor CPU jobs, you should be interested in compG, compC, compD, compE and belmont. The dots indicate free cores. Note that belmont details only 'appear' if you narrow your terminal window. Relion jobs are submitted to relionx.qc or relionx.qe (x=1,4,8,16). For other programs, jobs that should take no longer than 24 hours should be submitted to short.qc (or short.qb etc.), while jobs that are anticipated to take longer (rare for strubi progs) should be submitted to long.qc. This does not apply to relion jobs.

Use the standard submission script in the 'standard submission script' tab on the left for relion jobs.


For submitting other jobs to the cluster, such as cppxfel, use the following template, and save (here I will call the file 'index.sh'). This is based on the sungrid engine:
#!/bin/bash

#$ -N index
#$ -cwd -V
#$ -pe shmem 16
#$ -P strubi.prjc
#$ -q belmont.q#$ -o run1_$JOB_ID.out
#$ -e run1_$JOB_ID.err

cppxfel.run_dials shot*.pickle
-N is the name of the job.-pe shmem states the number of cores that you wish to request.-P indicates the project.-q is the queue name, such as short.qc.-o and -e, corresponding to the output log and error files are optional flags. Here I use the job id (the number of of your job that flashes up when submitted). If these flags are not set, a index.o<JOB_ID> and inde.e<JOB_ID> files will be generated instead.
The command here is the cppxfel.run_dials line.


To submit a job, simply execute the following:
qsub index.sh
Check that your job is running by typing:

qstat

To decide on the number of threads and cores, it is often best to refer back to tutorials associated with the particular piece of software that you are using. If the job is taking a long time, or you are hitting the memory limit, it is worth increasing the number of requested CPUs and potentially switching to a higher memory queue.

Relion job submission

This is typically achieved using the relion gui and a standard submission script, though it is also possible to submit your own submission scripts via the command line.

Input from rescomp:

"The relionX.q[ce] queues are essentially a cheat to facilitate the above for mpi jobs, the qsub syntax for which would otherwise be horrific. Each mpi rank gets one slot, so for relion1.q[ce] that is a limit of 16GB/rank, for relion2.q[ce] that is 32GB/rank, etc. If your mpi ranks need less than 16GB each, use relion1.q[ce], 16-32GB use relion2.q[ce], etc. Relion can use the extra cores, but it is necessary to give appropriate arguments to both mpirun and relion to match the queue you are targeting - I don’t think it can figure that out for itself, though someone could probably write a clever wrapper or helper. The normal short/long queue are only appropriate for mpi jobs with memory requirement less than 16GB/rank. The relion queues are appropriate for a wider range of mpi jobs as you can pick the specific queue to match your per-rank memory requirement. How you know your per-rank memory requirement is where your domain knowledge comes in, but for relion there’s probably some rule of thumb based on the box size."


Best practices on rescomp.

It is important to take note of the number of cores available on each of the rescomp queues. If you submit a job requesting 11 cores to relion1.qc, this may take up the entire memory limit of a given node such that no users can run a 5 core job and will take longer than requesting all 16 cores.
Note that, with the latest version of relion(3.1), refinement can only be run on an odd number of mpi procs.

The appropriate number of slots to request should be driven by memory requirements...details from rescomp staff in italics below:

"In theory, fewer slots per job is always better, and very few multi-threaded applications scale well, certainly beyond four threads. If the nature of the application is that the work can be split into independent chunks, N separate jobs given one core each usually completes faster overall than one job given N cores even leaving aside the scheduler considerations.

The primary driver for using more than one slot per job is thus memory rather than cpu, there being a 16GB/slot memory limit. If you need more memory than that, you need more slots. If you can use the extra cores available then it is obviously beneficial to do so, but it is rarely the driver of the choice. I would say four slots is the number above which you should think about doing it differently, especially if you have a lot of jobs, and at eight slots you should talk to us [rescomp] to ensure the jobs get scheduled in a timely manner."


Who to contact when stuck.

 


Contact rescomp@well.ox.ac.uk for all enquiries. Try to trouble shoot your problem first by speaking to other regular EM users such as Pranav, Helen, Jeremy and Loic.


A very common problem with relion, especially amongst those processing virus data (big box sizes!) is memory. Rather unhelpfully, in such cases, a relion job tends to just fail with no information in the error log. When this happens, it is advisable to request access to one of the himem queues, or consider upping the number of threads.

 


Configuration and memory for each queue available to strubi.


Refer to the research computing webpages (https://help.bmrc.ox.ac.uk/tutorials/submit-jobs.html). For relion jobs, C (16 cores per node) and E (24 cores per node) are available, and queues are named according to an even number of threads (excepting '1') e.g. relion1.qc, relion8.qc (8 threads) etc. Note that for E nodes, only relion1.qe to relion4.qe are available.

 


Data archiving.


Data storage significantly more expensive than you may think, and we do have finite space available on the gpfs system. If you have any data that are no longer 'active', please do consider archiving. Decide on what you wish to archive, and then contact team rescomp on rescomp@well.ox.ac.uk to find out how to do this.