Getting good performance from mdrun =================================== The GROMACS build system and the :ref:`gmx mdrun` tool has a lot of built-in and configurable intelligence to detect your hardware and make pretty effective use of that hardware. For a lot of casual and serious use of :ref:`gmx mdrun`, the automatic machinery works well enough. But to get the most from your hardware to maximize your scientific quality, read on! Hardware background information ------------------------------- Modern computer hardware is complex and heterogeneous, so we need to discuss a little bit of background information and set up some definitions. Experienced HPC users can skip this section. .. glossary:: core A hardware compute unit that actually executes instructions. There is normally more than one core in a processor, often many more. cache A special kind of memory local to core(s) that is much faster to access than main memory, kind of like the top of a human's desk, compared to their filing cabinet. There are often several layers of caches associated with a core. socket A group of cores that share some kind of locality, such as a shared cache. This makes it more efficient to spread computational work over cores within a socket than over cores in different sockets. Modern processors often have more than one socket. node A group of sockets that share coarser-level locality, such as shared access to the same memory without requiring any network hardware. A normal laptop or desktop computer is a node. A node is often the smallest amount of a large compute cluster that a user can request to use. thread A stream of instructions for a core to execute. There are many different programming abstractions that create and manage spreading computation over multiple threads, such as OpenMP, pthreads, winthreads, CUDA, OpenCL, and OpenACC. Some kinds of hardware can map more than one software thread to a core; on Intel x86 processors this is called "hyper-threading." Normally, :ref:`gmx mdrun` will not benefit from such mapping. affinity On some kinds of hardware, software threads can migrate between cores to help automatically balance workload. Normally, the performance of :ref:`gmx mdrun` will degrade dramatically if this is permitted, so :ref:`gmx mdrun` will by default set the affinity of its threads to their cores, unless the user or software environment has already done so. Setting thread affinity is sometimes called "pinning" threads to cores. MPI The dominant multi-node parallelization-scheme, which provides a standardized language in which programs can be written that work across more than one node. rank In MPI, a rank is the smallest grouping of hardware used in the multi-node parallelization scheme. That grouping can be controlled by the user, and might correspond to a core, a socket, a node, or a group of nodes. The best choice varies with the hardware, software and compute task. Sometimes an MPI rank is called an MPI process. GPU A graphics processing unit, which is often faster and more efficient than conventional processors for particular kinds of compute workloads. A GPU is always associated with a particular node, and often a particular socket within that node. OpenMP A standardized technique supported by many compilers to share a compute workload over multiple cores. Often combined with MPI to achieve hybrid MPI/OpenMP parallelism. CUDA A programming-language extension developed by Nvidia for use in writing code for their GPUs. SIMD Modern CPU cores have instructions that can execute large numbers of floating-point instructions in a single cycle. GROMACS background information ------------------------------ The algorithms in :ref:`gmx mdrun` and their implementations are most relevant when choosing how to make good use of the hardware. For details, see the Reference Manual. The most important of these are .. glossary:: Domain Decomposition The domain decomposition (DD) algorithm decomposes the (short-ranged) component of the non-bonded interactions into domains that share spatial locality, which permits efficient code to be written. Each domain handles all of the particle-particle (PP) interactions for its members, and is mapped to a single rank. Within a PP rank, OpenMP threads can share the workload, or the work can be off-loaded to a GPU. The PP rank also handles any bonded interactions for the members of its domain. A GPU may perform work for more than one PP rank, but it is normally most efficient to use a single PP rank per GPU and for that rank to have thousands of particles. When the work of a PP rank is done on the CPU, mdrun will make extensive use of the SIMD capabilities of the core. There are various `command-line options `. The same ``-gpu_id`` option (or ``GMX_GPU_ID`` environment variable) used to select CUDA devices, or to define a mapping of GPUs to PP ranks, is used for OpenCL devices. The following devices are known to work correctly: - AMD: FirePro W5100, HD 7950, FirePro W9100, Radeon R7 240, Radeon R7 M260, Radeon R9 290 - NVIDIA: GeForce GTX 660M, GeForce GTX 660Ti, GeForce GTX 750Ti, GeForce GTX 780, GTX Titan Building the OpenCL program can take a few seconds when :ref:`gmx mdrun` starts up, because the kernels that run on the GPU can only be compiled at run time. This is not normally a problem for long production MD, but you might prefer to do some kinds of work on just the CPU (e.g. see ``-nb`` above). Some other :ref:`OpenCL management ` environment variables may be of interest to developers. .. _opencl-known-limitations: Known limitations of the OpenCL support ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Limitations in the current OpenCL support of interest to |Gromacs| users: - Using more than one GPU on a node is supported only with thread MPI - Sharing a GPU between multiple PP ranks is not supported - No Intel devices (CPUs, GPUs or Xeon Phi) are supported - Due to blocking behavior of some asynchronous task enqueuing functions in the NVIDIA OpenCL runtime, with the affected driver versions there is almost no performance gain when using NVIDIA GPUs. The issue affects NVIDIA driver versions up to 349 series, but it known to be fixed 352 and later driver releases. Limitations of interest to |Gromacs| developers: - The current implementation is not compatible with OpenCL devices that are not using warp/wavefronts or for which the warp/wavefront size is not a multiple of 32 - Some Ewald tabulated kernels are known to produce incorrect results, so (correct) analytical kernels are used instead.