gmx nonbonded-benchmark

Synopsis

gmx nonbonded-benchmark [-size <int>] [-nt <int>] [-simd <enum>]
             [-coulomb <enum>] [-[no]table] [-combrule <enum>]
             [-[no]halflj] [-[no]energy] [-[no]all] [-cutoff <real>]
             [-iter <int>] [-warmup <int>] [-[no]cycles]

Description

gmx nonbonded-benchmark runs benchmarks for one or more so-called Nbnxm non-bonded pair kernels. The non-bonded pair kernels are the most compute intensive part of MD simulations and usually comprise 60 to 90 percent of the runtime. For this reason they are highly optimized and several different setups are available to compute the same physical interactions. In addition, there are different physical treatments of Coulomb interactions and optimizations for atoms without Lennard-Jones interactions. There are also different physical treatments of Lennard-Jones interactions, but only a plain cut-off is supported in this tool, as that is by far the most common treatment. And finally, while force output is always necessary, energy output is only required at certain steps. In total there are 12 relevant combinations of options. The combinations double to 24 when two different SIMD setups are supported. These combinations can be run with a single invocation using the -all option. The behavior of each kernel is affected by caching behavior, which is determined by the hardware used together with the system size and the cut-off radius. The larger the number of atoms per thread, the more L1 cache is needed to avoid L1 cache misses. The cut-off radius mainly affects the data reuse: a larger cut-off results in more data reuse and makes the kernel less sensitive to cache misses.

OpenMP parallelization is used to utilize multiple hardware threads within a compute node. In these benchmarks there is no interaction between threads, apart from starting and closing a single OpenMP parallel region per iteration. Additionally, threads interact through sharing and evicting data from shared caches. The number of threads to use is set with the -nt option. Thread affinity is important, especially with SMT and shared caches. Affinities can be set through the OpenMP library using the GOMP_CPU_AFFINITY environment variable.

The benchmark tool times one or more kernels by running them repeatedly for a number of iterations set by the -iter option. An initial kernel call is done to avoid additional initial cache misses. Times are recording in cycles read from efficient, high accuracy counters in the CPU. Note that these often do not correspond to actual clock cycles. For each kernel, the tool reports the total number of cycles, cycles per iteration, and (total and useful) pair interactions per cycle. Because a cluster pair list is used instead of an atom pair list, interactions are also computed for some atom pairs that are beyond the cut-off distance. These pairs are not useful (except for additional buffering, but that is not of interest here), only a side effect of the cluster-pair setup. The SIMD 2xMM kernel has a higher useful pair ratio then the 4xM kernel due to a smaller cluster size, but a lower total pair throughput. It is best to run this, or for that matter any, benchmark with locked CPU clocks, as thermal throttling can significantly affect performance. If that is not an option, the -warmup option can be used to run initial, untimed iterations to warm up the processor.

The most relevant regime is between 0.1 to 1 millisecond per iteration. Thus it is useful to run with system sizes that cover both ends of this regime.

The -simd and -table options select different implementations to compute the same physics. The choice of these options should ideally be optimized for the target hardware. Historically, we only found tabulated Ewald correction to be useful on 2-wide SIMD or 4-wide SIMD without FMA support. As all modern architectures are wider and support FMA, we do not use tables by default. The only exceptions are kernels without SIMD, which only support tables. Options -coulomb, -combrule and -halflj depend on the force field and composition of the simulated system. The optimization of computing Lennard-Jones interactions for only half of the atoms in a cluster is useful for water, which does not use Lennard-Jones on hydrogen atoms in most water models. In the MD engine, any clusters where at most half of the atoms have LJ interactions will automatically use this kernel. And finally, the -energy option selects the computation of energies, which are usually only needed infrequently.

Options

Other options:

-size <int> (1)
The system size is 3000 atoms times this value
-nt <int> (1)
The number of OpenMP threads to use
-simd <enum> (auto)
SIMD type, auto runs all supported SIMD setups or no SIMD when SIMD is not supported: auto, no, 4xm, 2xmm
-coulomb <enum> (ewald)
The functional form for the Coulomb interactions: ewald, reaction-field
-[no]table (no)
Use lookup table for Ewald correction instead of analytical
-combrule <enum> (geometric)
The LJ combination rule: geometric, lb, none
-[no]halflj (no)
Use optimization for LJ on half of the atoms
-[no]energy (no)
Compute energies in addition to forces
-[no]all (no)
Run all 12 combinations of options for coulomb, halflj, combrule
-cutoff <real> (1)
Pair-list and interaction cut-off distance
-iter <int> (100)
The number of iterations for each kernel
-warmup <int> (0)
The number of iterations for initial warmup
-[no]cycles (no)
Report cycles/pair instead of pairs/cycle