Getting good performance from mdrun =================================== The |Gromacs| build system and the :ref:`gmx mdrun` tool has a lot of built-in and configurable intelligence to detect your hardware and make pretty effective use of that hardware. For a lot of casual and serious use of :ref:`gmx mdrun`, the automatic machinery works well enough. But to get the most from your hardware to maximize your scientific quality, read on! Hardware background information ------------------------------- Modern computer hardware is complex and heterogeneous, so we need to discuss a little bit of background information and set up some definitions. Experienced HPC users can skip this section. .. glossary:: core A hardware compute unit that actually executes instructions. There is normally more than one core in a processor, often many more. cache A special kind of memory local to core(s) that is much faster to access than main memory, kind of like the top of a human's desk, compared to their filing cabinet. There are often several layers of caches associated with a core. socket A group of cores that share some kind of locality, such as a shared cache. This makes it more efficient to spread computational work over cores within a socket than over cores in different sockets. Modern processors often have more than one socket. node A group of sockets that share coarser-level locality, such as shared access to the same memory without requiring any network hardware. A normal laptop or desktop computer is a node. A node is often the smallest amount of a large compute cluster that a user can request to use. thread A stream of instructions for a core to execute. There are many different programming abstractions that create and manage spreading computation over multiple threads, such as OpenMP, pthreads, winthreads, CUDA, OpenCL, and OpenACC. Some kinds of hardware can map more than one software thread to a core; on Intel x86 processors this is called "hyper-threading", while the more general concept is often called SMT for "simultaneous multi-threading". IBM Power8 can for instance use up to 8 hardware threads per core. This feature can usually be enabled or disabled either in the hardware bios or through a setting in the Linux operating system. |Gromacs| can typically make use of this, for a moderate free performance boost. In most cases it will be enabled by default e.g. on new x86 processors, but in some cases the system administrators might have disabled it. If that is the case, ask if they can re-enable it for you. If you are not sure if it is enabled, check the output of the CPU information in the log file and compare with CPU specifications you find online. thread affinity (pinning) By default, most operating systems allow software threads to migrate between cores (or hardware threads) to help automatically balance workload. However, the performance of :ref:`gmx mdrun` can deteriorate if this is permitted and will degrade dramatically especially when relying on multi-threading within a rank. To avoid this, :ref:`gmx mdrun` will by default set the affinity of its threads to individual cores/hardware threads, unless the user or software environment has already done so (or not the entire node is used for the run, i.e. there is potential for node sharing). Setting thread affinity is sometimes called thread "pinning". MPI The dominant multi-node parallelization-scheme, which provides a standardized language in which programs can be written that work across more than one node. rank In MPI, a rank is the smallest grouping of hardware used in the multi-node parallelization scheme. That grouping can be controlled by the user, and might correspond to a core, a socket, a node, or a group of nodes. The best choice varies with the hardware, software and compute task. Sometimes an MPI rank is called an MPI process. GPU A graphics processing unit, which is often faster and more efficient than conventional processors for particular kinds of compute workloads. A GPU is always associated with a particular node, and often a particular socket within that node. OpenMP A standardized technique supported by many compilers to share a compute workload over multiple cores. Often combined with MPI to achieve hybrid MPI/OpenMP parallelism. CUDA A proprietary parallel computing framework and API developed by NVIDIA that allows targeting their accelerator hardware. |Gromacs| uses CUDA for GPU acceleration support with NVIDIA hardware. OpenCL An open standard-based parallel computing framework that consists of a C99-based compiler and a programming API for targeting heterogeneous and accelerator hardware. |Gromacs| uses OpenCL for GPU acceleration on AMD devices (both GPUs and APUs); NVIDIA hardware is also supported. SIMD Modern CPU cores have instructions that can execute large numbers of floating-point instructions in a single cycle. |Gromacs| background information -------------------------------- The algorithms in :ref:`gmx mdrun` and their implementations are most relevant when choosing how to make good use of the hardware. For details, see the Reference Manual. The most important of these are .. glossary:: Domain Decomposition The domain decomposition (DD) algorithm decomposes the (short-ranged) component of the non-bonded interactions into domains that share spatial locality, which permits the use of efficient algorithms. Each domain handles all of the particle-particle (PP) interactions for its members, and is mapped to a single MPI rank. Within a PP rank, OpenMP threads can share the workload, and some work can be off-loaded to a GPU. The PP rank also handles any bonded interactions for the members of its domain. A GPU may perform work for more than one PP rank, but it is normally most efficient to use a single PP rank per GPU and for that rank to have thousands of particles. When the work of a PP rank is done on the CPU, mdrun will make extensive use of the SIMD capabilities of the core. There are various `command-line options `. Devices from the AMD GCN architectures (all series) and NVIDIA Fermi and later (compute capability 2.0) are known to work, but before doing production runs always make sure that the |Gromacs| tests pass successfully on the hardware. The OpenCL GPU kernels are compiled at run time. Hence, building the OpenCL program can take a few seconds introducing a slight delay in the :ref:`gmx mdrun` startup. This is not normally a problem for long production MD, but you might prefer to do some kinds of work, e.g. that runs very few steps, on just the CPU (e.g. see ``-nb`` above). The same ``-gpu_id`` option (or ``GMX_GPU_ID`` environment variable) used to select CUDA devices, or to define a mapping of GPUs to PP ranks, is used for OpenCL devices. Some other :ref:`OpenCL management ` environment variables may be of interest to developers. .. _opencl-known-limitations: Known limitations of the OpenCL support ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Limitations in the current OpenCL support of interest to |Gromacs| users: - No Intel devices (CPUs, GPUs or Xeon Phi) are supported - Due to blocking behavior of some asynchronous task enqueuing functions in the NVIDIA OpenCL runtime, with the affected driver versions there is almost no performance gain when using NVIDIA GPUs. The issue affects NVIDIA driver versions up to 349 series, but it known to be fixed 352 and later driver releases. - On NVIDIA GPUs the OpenCL kernels achieve much lower performance than the equivalent CUDA kernels due to limitations of the NVIDIA OpenCL compiler. - The AMD APPSDK version 3.0 ships with OpenCL compiler/runtime components, libamdocl12cl64.so and libamdocl64.so (only in earlier releases), that conflict with newer fglrx GPU drivers which provide the same libraries. This conflict manifests in kernel launch failures as, due to the library path setup, the OpenCL runtime loads the APPSDK version of the aforementioned libraries instead of the ones provided by the driver installer. The recommended workaround is to remove or rename the APPSDK versions of the offending libraries. Limitations of interest to |Gromacs| developers: - The current implementation is not compatible with OpenCL devices that are not using warp/wavefronts or for which the warp/wavefront size is not a multiple of 32 - Some Ewald tabulated kernels are known to produce incorrect results, so (correct) analytical kernels are used instead. Performance checklist --------------------- There are many different aspects that affect the performance of simulations in |Gromacs|. Most simulations require a lot of computational resources, therefore it can be worthwhile to optimize the use of those resources. Several issues mentioned in the list below could lead to a performance difference of a factor of 2. So it can be useful go through the checklist. |Gromacs| configuration ^^^^^^^^^^^^^^^^^^^^^^^ * Don't use double precision unless you're absolute sure you need it. * Compile the FFTW library (yourself) with the correct flags on x86 (in most cases, the correct flags are automatically configured). * On x86, use gcc or icc as the compiler (not pgi or the Cray compiler). * On POWER, use gcc instead of IBM's xlc. * Use a new compiler version, especially for gcc (e.g. from the version 5 to 6 the performance of the compiled code improved a lot). * MPI library: OpenMPI usually has good performance and causes little trouble. * Make sure your compiler supports OpenMP (some versions of Clang don't). * If you have GPUs that support either CUDA or OpenCL, use them. * Configure with ``-DGMX_GPU=ON`` (add ``-DGMX_USE_OPENCL=ON`` for OpenCL). * For CUDA, use the newest CUDA availabe for your GPU to take advantage of the latest performance enhancements. * Use a recent GPU driver. * If compiling on a cluster head node, make sure that ``GMX_CPU_ACCELERATION`` is appropriate for the compute nodes. Run setup ^^^^^^^^^ * For an approximately spherical solute, use a rhombic dodecahedron unit cell. * When using a time-step of 2 fs, use :mdp:`cutoff-scheme` = :mdp:`h-bonds` (and not :mdp:`all-bonds`), since this is faster, especially with GPUs, and most force fields have been parametrized with only bonds involving hydrogens constrained. * You can increase the time-step to 4 or 5 fs when using virtual interaction sites (``gmx pdb2gmx -vsite h``). * For massively parallel runs with PME, you might need to try different numbers of PME ranks (``gmx mdrun -npme ???``) to achieve best performance; ``gmx tune_pme`` can help automate this search. * For massively parallel runs (also ``gmx mdrun -multidir``), or with a slow network, global communication can become a bottleneck and you can reduce it with ``gmx mdrun -gcom`` (note that this does affect the frequency of temperature and pressure coupling). Checking and improving performance ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ * Look at the end of the ``md.log`` file to see the performance and the cycle counters and wall-clock time for different parts of the MD calculation. The PP/PME load ratio is also printed, with a warning when a lot of performance is lost due to imbalance. * Adjust the number of PME ranks and/or the cut-off and PME grid-spacing when there is a large PP/PME imbalance. Note that even with a small reported imbalance, the automated PME-tuning might have reduced the initial imbalance. You could still gain performance by changing the mdp parameters or increasing the number of PME ranks. * If the neighbor searching takes a lot of time, increase nstlist (with the Verlet cut-off scheme, this automatically adjusts the size of the neighbour list to do more non-bonded computation to keep energy drift constant). * If ``Comm. energies`` takes a lot of time (a note will be printed in the log file), increase nstcalcenergy or use ``mdrun -gcom``. * If all communication takes a lot of time, you might be running on too many cores, or you could try running combined MPI/OpenMP parallelization with 2 or 4 OpenMP threads per MPI process.