The GROMACS build system and the gmx mdrun tool has a lot of built-in and configurable intelligence to detect your hardware and make pretty effective use of that hardware. For a lot of casual and serious use of gmx mdrun, the automatic machinery works well enough. But to get the most from your hardware to maximize your scientific quality, read on!
Modern computer hardware is complex and heterogeneous, so we need to discuss a little bit of background information and set up some definitions. Experienced HPC users can skip this section.
The algorithms in gmx mdrun and their implementations are most relevant when choosing how to make good use of the hardware. For details, see the Reference Manual. The most important of these are
gmx mdrun can be configured and compiled in several different ways that are efficient to use within a single node. The default configuration using a suitable compiler will deploy a multi-level hybrid parallelism that uses CUDA, OpenMP and the threading platform native to the hardware. For programming convenience, in GROMACS, those native threads are used to implement on a single node the same MPI scheme as would be used between nodes, but much more efficient; this is called thread-MPI. From a user’s perspective, real MPI and thread-MPI look almost the same, and GROMACS refers to MPI ranks to mean either kind, except where noted. A real external MPI can be used for gmx mdrun within a single node, but runs more slowly than the thread-MPI version.
By default, gmx mdrun will inspect the hardware available at run time and do its best to make fairly efficient use of the whole node. The log file, stdout and stderr are used to print diagnostics that inform the user about the choices made and possible consequences.
A number of command-line parameters are available to vary the default behavior.
gmx mdrun
Starts mdrun using all the available resources. mdrun will automatically choose a fairly efficient division into thread-MPI ranks, OpenMP threads and assign work to compatible GPUs. Details will vary with hardware and the kind of simulation being run.
gmx mdrun -nt 8
Starts mdrun using 8 threads, which might be thread-MPI or OpenMP threads depending on hardware and the kind of simulation being run.
gmx mdrun -ntmpi 2 -ntomp 4
Starts mdrun using eight total threads, with four thread-MPI ranks and two OpenMP threads per core. You should only use these options when seeking optimal performance, and must take care that the ranks you create can have all of their OpenMP threads run on the same socket. The number of ranks must be a multiple of the number of sockets, and the number of cores per node must be a multiple of the number of threads per rank.
gmx mdrun -gpu_id 12
Starts mdrun using GPUs with IDs 1 and 2 (e.g. because GPU 0 is dedicated to running a display). This requires two thread-MPI ranks, and will split the available CPU cores between them using OpenMP threads.
gmx mdrun -ntmpi 4 -gpu_id "1122"
Starts mdrun using four thread-MPI ranks, and maps them to GPUs with IDs 1 and 2. The CPU cores available will be split evenly between the ranks using OpenMP threads.
gmx mdrun -nt 6 -pin on -pinoffset 0
gmx mdrun -nt 6 -pin on -pinoffset 3
Starts two mdrun processes, each with six total threads. Threads will have their affinities set to particular logical cores, beginning from the logical core with rank 0 or 3, respectively. The above would work well on an Intel CPU with six physical cores and hyper-threading enabled. Use this kind of setup only if restricting mdrun to a subset of cores to share a node with other processes.
gmx mpirun_mpi -np 2
When using an gmx mdrun compiled with external MPI, this will start two ranks and as many OpenMP threads as the hardware and MPI setup will permit. If the MPI setup is restricted to one node, then the resulting gmx mdrun will be local to that node.
This requires configuring GROMACS to build with an external MPI library. By default, this mdrun executable will be named gmx mdrun. All of the considerations for running single-node mdrun still apply, except that -ntmpi and -nt cause a fatal error, and instead the number of ranks is controlled by the MPI environment. Settings such as -npme are much more important when using multiple nodes. Configuring the MPI environment to produce one rank per core is generally good until one approaches the strong-scaling limit. At that point, using OpenMP to spread the work of an MPI rank over more than one core is needed to continue to improve absolute performance. The location of the scaling limit depends on the processor, presence of GPUs, network, and simulation algorithm, but it is worth measuring at around ~200 particles/core if you need maximum throughput.
There are further command-line parameters that are relevant in these cases.
Note that -tunepme has more effect when there is more than one node, because the cost of communication for the PP and PME ranks differs. It still shifts load between PP and PME ranks, but does not change the number of separate PME ranks in use.
Note also that -dlb and -tunepme can interfere with each other, so if you experience performance variation that could result from this, you may wish to tune PME separately, and run the result with mdrun -notunepme -dlb yes.
The gmx tune_pme utility is available to search a wider range of parameter space, including making safe modifications to the tpr file, and varying -npme. It is only aware of the number of ranks created by the MPI environment, and does not explicitly manage any aspect of OpenMP during the optimization.
The examples and explanations for for single-node mdrun are still relevant, but -nt is no longer the way to choose the number of MPI ranks.
mpirun -np 16 gmx mdrun_mpi
Starts gmx mdrun with 16 ranks, which are mapped to the hardware by the MPI library, e.g. as specified in an MPI hostfile. The available cores will be automatically split among ranks using OpenMP threads, depending on the hardware and any environment settings such as OMP_NUM_THREADS.
mpirun -np 16 gmx mdrun_mpi -npme 5
Starts gmx mdrun with 16 ranks, as above, and require that 5 of them are dedicated to the PME component.
mpirun -np 11 gmx mdrun_mpi -ntomp 2 -npme 6 -ntomp_pme 1
Starts gmx mdrun with 11 ranks, as above, and require that six of them are dedicated to the PME component with one OpenMP thread each. The remaining five do the PP component, with two OpenMP threads each.
mpirun -np 4 gmx mdrun -ntomp 6 -gpu_id 00
Starts gmx mdrun on a machine with two nodes, using four total ranks, each rank with six OpenMP threads, and both ranks on a node sharing GPU with ID 0.
mpirun -np 8 gmx mdrun -ntomp 3 -gpu_id 0000
Starts gmx mdrun on a machine with two nodes, using eight total ranks, each rank with three OpenMP threads, and all four ranks on a node sharing GPU with ID 0. This may or may not be faster than the previous setup on the same hardware.
mpirun -np 20 gmx mdrun_mpi -ntomp 4 -gpu_id 0
Starts gmx mdrun with 20 ranks, and assigns the CPU cores evenly across ranks each to one OpenMP thread. This setup is likely to be suitable when there are ten nodes, each with one GPU, and each node has two sockets.
mpirun -np 20 gmx mdrun_mpi -gpu_id 00
Starts gmx mdrun with 20 ranks, and assigns the CPU cores evenly across ranks each to one OpenMP thread. This setup is likely to be suitable when there are ten nodes, each with one GPU, and each node has two sockets.
mpirun -np 20 gmx mdrun_mpi -gpu_id 01
Starts gmx mdrun with 20 ranks. This setup is likely to be suitable when there are ten nodes, each with two GPUs.
mpirun -np 40 gmx mdrun_mpi -gpu_id 0011
Starts gmx mdrun with 40 ranks. This setup is likely to be suitable when there are ten nodes, each with two GPUs, and OpenMP performs poorly on the hardware.
This section lists all the options that affect how the domain decomposition algorithm decomposes the workload to the available parallel hardware.
TODO In future patch: red flags in log files, how to interpret wallcycle output
TODO In future patch: import wiki page stuff on performance checklist; maybe here, maybe elsewhere
TODO In future patch: any tips not covered above
The current version works with GCN-based AMD GPUs, and NVIDIA CUDA GPUs. Make sure that you have the latest drivers installed. The minimum OpenCL version required is 1.1. See also the known limitations.
The same -gpu_id option (or GMX_GPU_ID environment variable) used to select CUDA devices, or to define a mapping of GPUs to PP ranks, is used for OpenCL devices.
Building the OpenCL program can take a few seconds when gmx mdrun starts up, because the kernels that run on the GPU can only be compiled at run time. This is not normally a problem for long production MD, but you might prefer to do some kinds of work on just the CPU (e.g. see -nb above).
Some other OpenCL management environment variables may be of interest to developers.
Limitations in the current OpenCL support of interest to GROMACS users:
Limitations of interest to GROMACS developers: