Improvements to mdrun performance ================================= Improved default threading configuration ------------------------------------------ ``mdrun`` now chooses better default settings for thread-MPI and/or OpenMP on (particularly) x86 hardware, with and without GPUs. Made it harder to use slow parallelism setups ----------------------------------------------- ``mdrun`` now issues fatal errors when the number of OpenMP threads is very likely to be too high for good performance. Using more MPI ranks or fewer cores is the correct approach in such cases. Such setups can still be run explicitly with the use of ``gmx mdrun -ntomp``, and only a warning is issued. Improved performance auto-tuning with GPUs ------------------------------------------ With GPUs and DD, the DLB can sometimes limit the PME load balancing room too much. In such cases (and only with ``gmx mdrun -dlb auto``), we now first do PME load balancing without DLB and then, if DLB gets turned on, a second round of PME load balancing. Also fixed that when DLB limited the tuning, the fastest choice was reset, which would often lead to stronger limitations. Added kernel and compiler suport for latest CUDA CC 3.7 devices --------------------------------------------------------------- Added support for the NVIDIA Tesla K80 and Maxwell architecture GeForge GPUs like GTX 9x0 (compute capability 3.7 and 5.x) On such GPUs, we can make use of the increased register size by running 128 threads/block with keeping the minimum number of blocks per multiprocessor at 16. Added optional NVIDIA Management Library (NVML) Integration ------------------------------------------------------------ NVML integration allows control of GPUBoost from GROMACS directly on supported GPUs. With this, GROMACS either changes clock rates automatically to the best settings, or informs the user to do it (if permissions do not allow the executable to change clock speeds). Allowed increasing CUDA thread block size ------------------------------------------------------------ This change parametrizes the CUDA kernel to allow increasing the number of threads per block by processing multiple j-clusters concurrently on additional pairs of warps. The change supports 1-, 2-, and 4-way concurrent j-cluster processing resulting in 64, 128, and 256 threads per block, respectively. Due to register limitations, on current CUDA architectures the version with 64 threads/block (equivalent with the original kernels) is fastest. The latter configurations using 128 and 256 threads are 3-4% and 10-13% slower on CC 3.5/5.2, respectively. Added CUDA compiler support for CC 5.0 ------------------------------------------------------------ With CUDA 6.5 and later, compute capability 5.0 devices are supported, so we generate cubin and PTX for these too and remove PTX 3.5. This change also removes explicit optimization for CC 2.1 where sm_20 binary code runs equally fast as sm_21. Added support for flushing WDDM queue on Windows --------------------------------------------------- On Windows the WDDM driver (default for non-Tesla cards) can prevent immediate submission of CUDA tasks to the GPU in an attempt to try to amortize driver overheads. However, as we need tasks to start immediately for optimal concurrent execution, this "feature" will result in large overheads. So we have implemented the well- documented workaround for that. Optimized atomic accumulation in CUDA short-ranged kernels ---------------------------------------------------------- The final atomic accumulation of the three force components can now happen on three threads concurrently. As this can be serviced by the hardware in a single instruction, the optimization improves overall performance by a few percent. This also results in fewer shuffle operations. Improved pair search thread load balance ---------------------------------------- With very small systems and many OpenMP threads, especially when using GPUs, some threads can end up without pair search work. Better load balancing reduces the pair search time. Also the CPU non-bonded kernel time is slightly reduced in the extreme parallelization limit. Improved the intra-GPU load balancing ------------------------------------- The splitting of the pair list to improve load balancing on the GPU was based on the number of generated lists, which was sub-optimal. Now the splitting is based on the number of pairs in the list, which produces much more stable results. Added support for arbitrary number of OpenMP threads ------------------------------------------------------------ Improves performance particularly for bonded interactions :issue:`1386` Improved performance of bonded interactions from use of SIMD ------------------------------------------------------------ All supported SIMD-capable CPU architectures now use that functionality when evaluating Fourier dihedral functions, Ryckaert-Bellemans dihedral functions, and normal angle functions. The SIMD versions only run on MD steps where the energy and/or virial is *not* required, so do choose your .mdp settings according to what you actually need. (Technically, this was functionality added to 5.0.x during the release phase.) Added checks for inefficient resource usage -------------------------------------------- Checks have been added for using too many OpenMP threads and when using GPUs for using single OpenMP thread. A fatal error is generated in case where we are quite sure performance is very sub-optimal. Now also avoids rank counts with thread-MPI that don't fit with the total number of threads requested. Made it possible to use 1 PP and 1 PME rank ------------------------------------------- This could run faster on a single node, particularly where the CPU is relatively more powerful than the GPU. Reduced the cost of communication in the pull code -------------------------------------------------- With more than 32 ranks, a sub-communicator will be used for the pull communication. This reduces the pull communication significantly with small pull groups. With large pull groups the total simulation performance might not improve much, because ranks that are not in the sub-communicator will later wait for the pull ranks during the communication for the constraints. Added SIMD acceleration for LINCS --------------------------------- Added SIMD acceleration for the LINCS PBC distance calculation and the right-hand side of the LINCS matrix equation. The sparse matrix multiplication and atom updates are not suited for SIMD acceleration. Improved OpenMP scaling of LINCS ------------------------------------------ With very locally coupled constraints, such as H-bonds only constraints, the LINCS OpenMP tasks are now independent. This means that no OpenMP barriers are required, which can significantly speed up LINCS. Additionally, triangle constraints (which are present in e.g. OH groups when using ``gmx pdb2gmx -vsite``) are now divided over thread tasks instead of done by the master thread only. This slightly improves load balancing and removes two thread barriers. Improved SIMD support in LINCS and bonded interactions --------------------------------------------------------------------- We have added proper gather/scatter operations to work on 3D vectors for all SIMD architectures, for future GROMACS releases. However, we already had code that would work on AVX & AVX2 so we added that to get the performance benefits in GROMACS 5.1. This is definitely a hack, and the code will be replaced once the extended SIMD module is in place. Cleaned up unused code paths that made some kinds of calculations slow -------------------------------------------------------------------------------- Some operations were useful only in some code paths, several such operations have been made conditional, which will improve performance at high parallelism when the operations were not required. Fixed for processors being offline on Arm ------------------------------------------------------- Use the number of configured rather than online CPUs. We will still get a warning about failures when trying to pin to offline CPUs, which hurts performance slightly. To fix this, we also check if there is a mismatch between configured and online processors and warn the user that they should force all their processors online for better performance.