Performance improvements¶
GPU direct communication is default with CUDA and thread-MPI¶
GPU-enabled multi-rank runs using CUDA will default to using direct GPU communication for halo exchange and/or PP-PME communication. This can be disabled using the GMX_DISABLE_DIRECT_GPU_COMM environment variable to return to the previous default communication scheme (staged communication). Regular MPI runs still use staged communication by default.
Dynamic pairlist generation for energy minimization¶
With energy minimization, the pairlist, and domain decomposition when running in parallel, is now performed when at least one atom has moved more than the half the pairlist buffer size. The pairlist used to be constructed every step.
Nonbonded free-energy kernels use SIMD¶
Free energy calculation performance is improved by making the nonbonded free-energy kernels SIMD accelerated. On AVX2-256 these kernels are 4 to 8 times as fast. This should give a noticeable speed-up for most systems, especially if the perturbed interaction calculations were a bottleneck. This is particularly the case when using GPUs, where the performance improvement of free-energy runs is up to a factor of 3.
PME-PP GPU Direct Communication Pipelining¶
For multi-GPU runs with direct PME-PP GPU comunication enabled, the PME rank can now pipeline the coordinate transfers with computation in the PME Spread and Spline kernel (where the coordinates are consumed). The data from each transfer is handled seperately, allowing computation and communication to be overlapped. This is expected to have most benefit on systems where hardware communication interfaces are shared between multiple GPUs, e.g. PCIe within multi-GPU servers or Infiniband across multiple nodes.
Domain decomposition with single MPI rank¶
When running with a single MPI rank with PME and without GPU, mdrun will now use the domain decomposition machinery to reorder particles. This can improve performance, especially for large systems. This behavior can be controlled with the enviroment variable GMX_DD_SINGLE_RANK.
gmx grompp
now runs 20-50% faster¶
After a series of improvements, the loops in the parameter- and
atom-lookup code in gmx grompp
have been transformed to
run faster while using simpler, standard code idioms.
PME decomposition support in mixed mode with CUDA and process-MPI¶
PME decomposition is supported now in mixed mode with CUDA backend. This is supported only if GROMACS is compiled with external process-MPI and underlying MPI implementation is CUDA-aware. This feature lacks substantial testing and has been disabled by default but can be enabled by setting GMX_GPU_PME_DECOMPOSITION=1 environment variable.