New features
=====================

Added support for OpenCL acceleration
--------------------------------------

StreamComputing (http://www.streamcomputing.eu) has implemented OpenCL
support for the short-ranged non-bonded interaction accleration
features previously accelerated in GROMACS with CUDA. Supported OpenCL
1.1 devices include GCN-based AMD GPUs, where performance is good, and
NVIDIA GPUs, where performance with PME is currently not
good. Currently only a one-to-one mapping of PP ranks to GPUs is
supported (ie. with thread-MPI). With real MPI, only one PP rank per
node (and thus one GPU) is supported. We expect to be able to improve
support and performance in future GROMACS versions.


Added support for inter-molecular bonded interactions
-----------------------------------------------------

The .top file can now contain an ``[ intermolecular_interactions ]``
directive, after which bonded interactions can be entered using
global atom indices. See the topologies chapter in the reference
manual for details.


Added flat-bottomed potential to pull-coordinate types
------------------------------------------------------

Allowed pull groups of 1 atom with mass 0
------------------------------------------------------------


Unless constraint pulling is used, a pull group consisting of 1 atom
can have mass 0, since the mass of the COM is irrelevant. This is
useful for pulling on a virtual site.


Changed the way pull directions are selected
--------------------------------------------

Pull types and geometries are now selected on a per-coordinate basis,
so that different pull coordinates can have different pull types and
geometries. The ``pull`` .mdp option now takes simply ``yes`` or
``no``, and ``pull-coord1-type`` and ``pull-coord1-geomtry`` play
the role of the old ``pull`` and ``pull-geometry`` options.


Improved pull geometry cylinder
-------------------------------

Cylinder geometry now uses a smooth radial weight function and adds
radial forces so it now conserves energy.  :issue:`1590`.


Expanded pulling output options
-------------------------------

There is now also an option for printing the COM of the second group,
as well as the reference value for all pull coordinates. Writing the
distance components to the pull coordinate output is now optional (off
by default).


Allowed configure-time specification target GPU architectures
--------------------------------------------------------------

With CUDA support enabled, by default the GROMACS build system will
compile CUDA kernels for all supported hardware. However, this takes a
long time and many of the flavours will not be used. For convenience
in cases where the GPU model is known in advance, the cmake variables
``GMX_CUDA_TARGET_SM`` and ``GMX_CUDA_TARGET_COMPUTE`` allow setting
the GPU architectures and virtual architectures, respectively. For
details of what to put here, see the `background information
<http://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#gpu-compilation>`_.

Added single-accuracy SIMD double math functions
------------------------------------------------------------

Apart from double-precision SIMD variables typically being half the
width of single, the math functions are considerably more expensive
due to higher-order polynomials, which can drop the throughput to 25%
of single. In some cases we do not need the full double precision in
SIMD operations, so these new math functions use double-precision SIMD
variables but only target single-precision accuracy, which can improve
performance twofold.  The patch also makes the target precision in
single and double SIMD an advanced CMake variable, and the unit test
tolerance is set based on these variables. This can be used (decided
by the user) for a few platforms where the rsqrt/inv table lookups
provide one bit too little to get by with a single N-R iteration based
on our default target accuracy of 22 bits.

Added new tests
------------------------------------------------------------


Several new test cases have been added, including that plain
multi-simulation, replica-exchange and empty domains work.


Improved parallel distribution and thread count reporting
------------------------------------------------------------


The domain decomposition no longer prints the atom count for all ranks
(which gets far too long at high parallelization), but rather average,
standard deviation, min and max.

Implemented zsh shell completions for gmx command
-------------------------------------------------------

New fatal error to stop benchmarking runs if tuning is still active
-------------------------------------------------------

Triggering the reset of mdrun counters (which can happen in various
ways) could happen while tuning was active, which is useless and
potentially wrong. This now emits a fatal error, so the user
can judge how to run the benchmark more stably or for longer time.

:issue:`1781`