Performance improvements
^^^^^^^^^^^^^^^^^^^^^^^^

GPU improvements
^^^^^^^^^^^^^^^^

In addition to those noted below, overall minor improvements contribute
up to 5% increase in CUDA performance, so depending on parameters and compilers
an 5-20% GPU kernel performance increase is expected.
These benefits are seen with CUDA 7.5 (which is now the version we recommend);
certain older versions (e.g. 7.0) see even larger improvements.

Even larger improvements in OpenCL performance on AMD devices are
expected, e.g. can be >50% with RF/plain cut-off and PME with potential shift
with recent AMD OpenCL compilers. 

Note that due to limitations of the NVIDIA OpenCL compiler CUDA is still superior
in performance on NVIDIA GPUs. Hence, it is recommended to use CUDA-based GPU acceleration
on NVIDIA hardware.


Improved support for OpenCL devices
""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
The OpenCL support is now fully compatible with all intra- and
inter-node parallelization mode, including MPI, thread-MPI, and GPU
sharing by PP ranks. (The previous limitations were caused by bugs in high-level
GROMACS code.)

Additionally some prefetching in the short-ranged kernels (similar to
that in the CUDA code) that had been disabled was found to be useful
after all.

Added Lennard-Jones combination-rule kernels for GPUs
""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
Implemented LJ combination-rule parameter lookup in the CUDA and
OpenCL kernels for both geometric and Lorentz-Berthelot combination
rules, and enabled it for plain LJ cut-off. This optimization was
already present in the CPU kernels. This improves performance with
e.g. OPLS, GROMOS and AMBER force fields by about 10-15% (but does not
help with CHARMM force fields because they use force-switched kernels).

Added support for CUDA CC 6.0/6.1
""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
Added build-system and kernel-generator support for the Pascal
architectures announced so far (GP100: 6.0, GP104: 6.1) and supported
by the CUDA 8.0 compiler.

By default we now generate binary as well as PTX code for both sm_60 and
sm_61 and given the considerable differences between the two, we also
generate PTX for both virtual arch. For now we don't add CC 6.2 (GP102)
compilation support as we know nothing about it.

On the kernel-generation side, given the increased register file, for
CC 6.0 the "wider" 128 threads/block kernels are enabled, on 6.1 and
later the 64 threads/block remains.

Improved GPU pair-list splitting to improve performance
""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
Instead of splitting the GPU lists (to generate more work units) based
on a maximum cut-off, we now generate lists as close to the target
list size as possible. The heuristic estimate for the number of
cluster pairs is now too high by 0-1% instead of 10%. This results in
a few percent fewer pair lists, but still slightly more than
requested.

Improved CUDA GPU memory configuration
""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
This makes use of the larger amount of L1 cache
available for global load caching on hardware that supports it (K40,
K80, Tegra K1, & CC 5.2) by passing the appropriate command line
option ("-dlcm=ca").

:issue:`1804`

Automatic nstlist changes were tuned for Intel Knight's Landing
""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""

CPU improvements
^^^^^^^^^^^^^^^^

These improvements to individual kernels will provide incremental
improvements to CPU performance for simulations where they are active,
but their value for simulations using GPU offload are much higher,
because via the auto-tuning, they permit all kinds of resource
utilization and throughput to increase.

Optimized the bonded thread force reduction
""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
The code for multi-threading of bonded interactions has to combine the
forces afterwards. This reduction now uses fixed-size blocks of 32
atoms, and instead of dividing reduction of the whole range of blocks
uniformly over the threads, now only used blocks are divided
(uniformly) over the threads.  This speeds up the reduction by a
factor of the number of threads (!) for typical protein+water systems
when not using domain decomposition. With domain decomposition, the
speed up is up to a factor of 3.

Used SIMD transpose-scatter in bonded force reduction
""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
The angle and dihedral SIMD functions now use the SIMD transpose
scatter functions for force reduction. This change gives a massive
performance improvement for bondeds, mainly because the dihedral
force update did a lot of vector operations without SIMD that are
now fully replaced by SIMD operations.

Added SIMD implementation of Lennard-Jones 1-4 interactions
""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""-
The gives a few factors speed improvement. The main improvement comes
from simplified analytical LJ instead of tables; SIMD helps a bit.

Added SIMD implementation of SETTLE
""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
On Haswell CPUs, this makes SETTLE a factor 5 faster.

Added SIMD support for routines that do periodic boundary coordinate transformations
""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""

Threading improvements
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

These improvements enhance the performance of code that runs over
multiple CPU threads.

Improved Verlet-scheme pair-list workload balancing
""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""

Implemented near perfect load-balancing for Verlet-scheme CPU
pair-lists. This increases the search cost by 3%, but this is
outweighed by the more balanced non-bonded kernel times, particularly
for small systems.

Improved the threading of virtual-site code
""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
On many threads, a significant part of the vsites would end up in
the separate serial task, thereby limiting scaling. Now two weakly
dependent tasks are generated for each thread and one of them uses
a thread-local force buffer, parts of which are reduced by different
threads that are responsible for those parts.

Also the setup now runs multi-threaded.

Add OpenMP support to more loops
""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
Loops over number of atoms cause significant amount of serial time with
large number of threads, which limits scaling.

Add OpenMP parallelization for the pull code
""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
The pull code could take up to a third of the compute time for OpenMP
parallel simulation with large pull groups.
Now all pull-code loops over atoms have an OpenMP parallel version.

Other improvements
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Multi-simulations are coupled less frequently
""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
For example, replica-exchange simulations communicate between simulations
only at exchange attempts. Plain multi-simulations do not communicate
between simulations. Overall performance will tend to improve any time
the progress of one simulation might be faster than others (e.g. it's
at a different pressure, or using a quieter part of the network).