Coding with SIMD instructions

One important way for GROMACS to achieve high performance is to use modern hardware capabilities where a single assembly instruction operates on multiple data units, essentially short fixed-length vectors (usually 2,4,8, or 16 elements). This provides a very efficient way for the CPU to increase floating-point performance, but it is much less versatile than general purpose registers. For this reason it is difficult for the compiler to generate efficient SIMD code, so the user has to organize the data in a way where it is possible to access as vectors, and these vectors often need to be aligned on cache boundaries.

We have supported a number of different SIMD instruction sets in the group kernels for ages, and it is now also present in the verlet kernels and a few other places. However, with the increased usage and several architectures with different capabilities we now use a vendor-agnostic GROMACS SIMD module, as documented in SIMD intrinsics interface (simd).

Design of the GROMACS SIMD module

The macros in src/gromacs/simd are intended to be used for writing architecture-independent SIMD intrinsics code. Rather than making assumptions based on architecture, we have introduced a limited number of predefined preprocessor macros that describe the capabilities of the current implementation - these are the ones you need to check when writing SIMD code. As you will see, the functionality exposed by this module as typically a small subset of general SIMD implementations, and in particular we do not even try to expose advanced shuffling or permute operations, simply because we haven't been able to describe those in a generic way that can be implemented efficiently regardless of the hardware. However, the advantage of this approach is that it is straightforward to extend with support for new simd instruction sets in the future, and that will instantly speed up old code too.

Unfortunately there is no standard for SIMD architectures. The available features vary a lot, but we still need to use quite a few of them to get the best performance possible. This means some features will only be available on certain platforms, and it is critical that we do NOT make to many assumptions about the storage formats, their size or SIMD width. Just to give a few examples:

On x86, double precision (64-bit) floating-point values always convert to 32-bit integers, while many other platforms use 64-bit, and some cannot use 32-bit integers at all. This means we cannot use a mask (boolean) derived from integer operations to select double-precision floating-point values, and it could get very complex for higher-level code if all these decisions were exposed. Instead, we want to keep integers 32-bit since all algorithms anyway need to work in single precision (w. 32-bit ints).
IBM QPX uses 4-wide SIMD both for single and double precision. Integer support is highly limited, and the storage format means QPX does not use x86-style all-ones masks (which have different widths in single/double) but it uses the sign bit to denote the false value. In particular, this means we cannot use the bit contents for any fancy mask operations.
AVX1 only supports 4-wide 128-bit integer SIMD arithmetics, but the integer conversions can still be done 8-wide which corresponds to the single precision floating-point width. Similarly, with AVX1 conversions between double-precision and integers use the 32-bit 4-wide 128bit registers where we can also do integer arithmetics. AVX2 adds proper arithmetics for 8-wide integers. We would severely limit performance if we had to say that integer support was not present, so instead we stick to 32-bit ints but limit the operations we expose (and do shuffling internally).
For SSE2 through SSE4.1, double precision is 2-wide, but when we convert to integers they will be put in the first two elements of a 4-wide integer type. This means we cannot assume that floating-point SIMD registers and corresponding integer registers (after conversion) have the same width.
The 2-wide SIMD instructions on BlueGene/L and BlueGene/P cannot do any floating-point logical operations (and/andnot/or/xor) whatsoever, which can be a pain when implementing approximations for math functions.
Since boolean values can have different width for float/double and the integers corresponding to float/double, we need to use separate boolean types for all these values and convert between them if we e.g. want to use result of an integer compare to select floating-point values.

While this might sound complicated, it is actually far easier than writing separate SIMD code for 10 architectures in both single & double. The point is not that you need to remember the limitations above, but it is critical that you never assume anything about the SIMD implementation. We typically implement SIMD support for a new architecture in days with this new module, and the extensions required for verlet kernels are also very straightforward (group kernels can be more complex, but those are gradually on their way out). For the higher-level code, the only important thing is to never assume anything about the SIMD architecture. Our general strategy in GROMACS is to split the SIMD coding in three levels:

Base level generic SIMD: The base level SIMD module (which we get by including gromacs/simd/simd.h provides the API to define and manipulate SIMD datatypes. This will be enough for lots of cases, and it is a huge advantage that there is roughly parity between different architectures.
Larger architecture-specific SIMD functions: For some parts of the code this is not enough. In particular, both the group and Verlet kernels do insane amounts of floating-point operations, and since we spend 85-90% of the time in these kernels it is critical that we can optimize them as much as possible. Here, our strategy is first to define larger high-level functions that e.g. take a number of distances and loads the table interactions for this interaction. This way we can move this architecture-specific implementation to the SIMD module, and both achieve a reasonably clean kernel but still optimize a lot.
Architecture-specific kernels (directories/files): When it is absolutely impossible to use a shared implementation we might have to code SIMD (just as GPU code). When this happens, we should create subdirectory or otherwise clearly names a file with a suffix for the SIMD architecture, to clarify to the user that the SIMD file has a direct non-SIMD correspondence. Since this code can be very hard to read, it is important to be explicit and use lots of comments - this is not the type of code where you should use smart optimization with hundreds of preprocessor directives. Keep it simple so other developers can help you support it. The question is not whether you can get a function 20% faster, but whether it justifies the added complexity of the code.

File organization

The SIMD module uses a couple of different files:

gromacs/simd/simd.h: This is the top-level wrapper that you should always include first. It will check the settings made at configuration time and include a suitable low-level implementation (that can be either single, double, or both). It also contains the routines for memory alignment, and based on the current GROMACS precision it will set aliases to 'real' SIMD datatypes (see further down) so the implementations do not have to care about GROMACS-specific details. However, note that you might not get all SIMD support you hoped for: If you compiled GROMACS in double precision but the hardware only supports single-precision SIMD there will not be any SIMD routines for default GROMACS 'real' precision. There are #defines you can use to check this, as described further down.
gromacs/simd/impl_reference.h: This is an example of a low-level implementation. You should never, ever, work directly with these in higher-level code. The reference implementation contains the documentation for all SIMD wrappers, though.
gromacs/simd/simd_math.h: SIMD math functions. All functions in this file have to be designed so they work no matter whether the hardware supports integer SIMD, logical operations on integer or floating-point SIMD, or arithmetic operations on integers. However, a few routines check for defines and use faster algorithms if these features are present.
gromacs/simd/vector_operations.h: This file contains a few rvec-related SIMD functions, e.g. to calculate scalar products, norms, or cross products. They obviously cannot operate on scalar GROMACS rvec types, but use separate SIMD variables for X,Y, and Z vector components.

SIMD datatypes

The SIMD module handles the challenges mentioned in the introduction by introducing a number of datatypes; many of these might map to the same underlying SIMD types, but we need separate types because some architectures use different registers e.g. for boolean types.

Floating-point data

gmx_simd_real_t: This is the SIMD-version of GROMACS' real type, which is set based on the CMake configuration and internally aliased to one of the next two types. Operations on these variables have the suffix _r, e.g. gmx_simd_add_r().
gmx_simd_float_t: This is always single-precision data, but it might not be supported on all architectures. Suffix _f is used for explicit single-precision routines, e.g. gmx_simd_mul_f().
gmx_simd_double_t: This is always double precision when available, and in rare cases you might want to use a specific precision. Suffix _d is used for explicit double-precision routines, e.g. gmx_simd_mul_d()

Integers corresponding to floating-point values

For these types, 'correspond' means that it is the integer type we get when we convert data e.g. from single (or double) precision floating-point SIMD variables. Those need to be different, since many common implementations only use half as many elements for double as for single SIMD variables, and then we only get half the number of integers too.

gmx_simd_int32_t: This is used for integers when converting to/from GROMACS default "real" type. The corresponding routines have suffix _i, e.g. gmx_simd_add_i().
gmx_simd_fint32_t: Integers obtained when converting from single precision, or intended to be converted to single precision floating-point. These are normal integers (not a special conversion type), but since some SIMD architectures such as SSE or AVX use different registers for integer SIMD variables having the same width as float and double, respectively, we need to separate these two types of integers. The actual operations you perform on the are normal ones such as addition or multiplication. The routines operating on these variables have suffix _fi, like gmx_simd_add_fi(). This will also be the widest integer data type if you want to do pure integer SIMD operations, but that will not be supported on all platforms.
gmx_simd_dint32_t: Integers used when converting to/from double. See the preceding item for a detailed explanation. On many architectures, including all x86 ones, this will be a narrower type than gmx_simd_fint32_t. The correspoding routines have suffix _di, like gmx_simd_add_di().

Note that all integer load/stores operations defined here load/store 32-bit integers, even when the internal register storage might be 64-bit, and we set the "width" of the SIMD implementation based on how many float/double/ integers we load/store - even if the internal width could be larger.

Boolean values

We need a separate boolean datatype for masks and comparison results, since we cannot assume they are identical either to integers, floats or double - some implementations use specific predicate registers for booleans.

gmx_simd_bool_t: Results from boolean operations involving reals, and the booleans we use to select between real values. The corresponding routines have suffix _b, like gmx_simd_or_b().
gmx_simd_fbool_t: Booleans specifically for single precision. Corresponding function suffix is _fb, like gmx_simd_or_fb().
gmx_simd_dbool_t: Operations specifically on double. Operations have suffix _db: gmx_simd_or_db()
gmx_simd_ibool_t: Boolean operations on integers corresponding to real (see floating-point descriptions above). Operations on these booleans use suffix _ib, like gmx_simd_or_ib().
gmx_simd_fibool_t: Booleans for integers corresponding to float. Operation suffix is _fib, like gmx_simd_or_fib().
gmx_simd_dibool_t: Booleans for integers corresponding to double. Operation suffix is _dib, like gmx_simd_or_dib().

The subset you should use in practice

If this seems daunting, in practice you should only need to use these types when you start coding:

gmx_simd_real_t: Floating-point data.
gmx_simd_bool_t: Booleans.
gmx_simd_int32_t: Integer data. Might not be supported, so you must check the preprocessor macros described below.

Operations on these types will be defined to either float/double (or corresponding integers) based on the current GROMACS precision, so the documentation is occasionally more detailed for the lower-level actual implementation functions.

SIMD4 Macros

The above should be sufficient for code that works with the full SIMD width. Unfortunately reality is not that simple. Some algorithms like lattice summation need quartets of elements, so even when the SIMD width is >4 we need width-4 SIMD if it is supported. These datatypes and operations use the prefix gmx_simd4_, and availability is indicated by GMX_SIMD4_HAVE_FLOAT and GMX_SIMD4_HAVE_DOUBLE. For now we only support a small subset of SIMD operations for SIMD4, but that is trivial to extend if we need to.

Predefined SIMD preprocessor macros

Functionality-wise, we have a small set of core set of features that we require to be present on all platforms, while more avanced features can be used in the code when defines like e.g. GMX_SIMD_HAVE_LOADU are set.

This is a summary of the currently available preprocessor defines that you should use to check for support when using the corresponding features. We first list the float/double/int defines set by the implementation; in most cases you do not want to check directly for float/double defines, but you should instead use the derived "real" defines set in this file - we list those at the end below.

Preprocessor predefined macro defines set by the low-level implementation. These are only set if they work for all datatypes; GMX_SIMD_HAVE_LOADU thus means we can load both float, double, and integers from unaligned memory, and that the unaligned loads are available for SIMD4 too.

GMX_SIMD_HAVE_FLOAT: Single-precision instructions available.
GMX_SIMD_HAVE_DOUBLE: Double-precision instructions available.
GMX_SIMD_HAVE_HARDWARE: Set when we are NOT emulating SIMD.
GMX_SIMD_HAVE_LOADU: Load from unaligned memory available.
GMX_SIMD_HAVE_STOREU: Store to unaligned memory available.
GMX_SIMD_HAVE_LOGICAL: Support for and/andnot/or/xor on floating-point variables.
GMX_SIMD_HAVE_FMA: Floating-point fused multiply-add. Note: We provide emulated FMA instructions if you do not have FMA support, but in that case you might be able to code it more efficient w/o FMA.
GMX_SIMD_HAVE_FRACTION: Instruction to get decimal fraction. Same as FMA: This denotes hardware support, otherwise instruction will be emulated.
GMX_SIMD_HAVE_FINT32: Integer conversions to/from float available.
GMX_SIMD_HAVE_FINT32_EXTRACT: Support for extracting integer SIMD elements from gmx_simd_fint32_t.
GMX_SIMD_HAVE_FINT32_LOGICAL: Bitwise shifts on gmx_simd_fint32_t.
GMX_SIMD_HAVE_FINT32_ARITHMETICS: Arithmetic ops for gmx_simd_fint32_t.
GMX_SIMD_HAVE_DINT32: Integer conversions to/from double available.
GMX_SIMD_HAVE_DINT32_EXTRACT: Support for extracting integer SIMD elements from gmx_simd_dint32_t.
GMX_SIMD_HAVE_DINT32_LOGICAL: Bitwise shifts on gmx_simd_dint32_t.
GMX_SIMD_HAVE_DINT32_ARITHMETICS: Arithmetic ops for gmx_simd_dint32_t.

There are also two macros specific to SIMD4: GMX_SIMD4_HAVE_FLOAT is set if we can use SIMD4 in single precision, and GMX_SIMD4_HAVE_DOUBLE similarly denotes support for a double-precision SIMD4 implementation. For generic properties (e.g. whether SIMD4 FMA is supported), you should check the normal SIMD macros above.

Implementation properties

Higher-level code can use these macros to find information about the implementation, for instance what the SIMD width is:

GMX_SIMD_FLOAT_WIDTH: Number of elements in gmx_simd_float_t, and practical width of gmx_simd_fint32_t.
GMX_SIMD_DOUBLE_WIDTH: Number of elements in gmx_simd_double_t, and practical width of gmx_simd_dint32_t
GMX_SIMD_RSQRT_BITS: Accuracy (bits) of 1/sqrt(x) lookup step.
GMX_SIMD_RCP_BITS: Accuracy (bits) of 1/x lookup step.

After including the low-level architecture-specific implementation, this header sets the following derived defines based on the current precision; these are the ones you should check for unless you absolutely want to dig deep into the explicit single/double precision implementations:

GMX_SIMD_HAVE_REAL: Set either to GMX_SIMD_HAVE_FLOAT or GMX_SIMD_HAVE_DOUBLE
GMX_SIMD4_HAVE_REAL: Set either to GMX_SIMD4_HAVE_FLOAT or GMX_SIMD4_HAVE_DOUBLE
GMX_SIMD_REAL_WIDTH: Set either to GMX_SIMD_FLOAT_WIDTH or GMX_SIMD_DOUBLE_WIDTH
GMX_SIMD_HAVE_INT32: Set either to GMX_SIMD_HAVE_FINT32 or GMX_SIMD_HAVE_DINT32
GMX_SIMD_INT32_WIDTH: Set either to GMX_SIMD_FINT32_WIDTH or GMX_SIMD_DINT32_WIDTH
GMX_SIMD_HAVE_INT32_EXTRACT: Set either to GMX_SIMD_HAVE_FINT32_EXTRACT or GMX_SIMD_HAVE_DINT32_EXTRACT
GMX_SIMD_HAVE_INT32_LOGICAL: Set either to GMX_SIMD_HAVE_FINT32_LOGICAL or GMX_SIMD_HAVE_DINT32_LOGICAL
GMX_SIMD_HAVE_INT32_ARITHMETICS: Set either to GMX_SIMD_HAVE_FINT32_ARITHMETICS or GMX_SIMD_HAVE_DINT32_ARITHMETICS

For convenience we also define GMX_SIMD4_WIDTH to 4. This will never vary, but using it helps you make it clear that a loop or array refers to the SIMD4 width rather than some other '4'.

While all these defines are available to specify the features of the hardware, we would strongly recommend that you do NOT sprinkle your code with defines - if nothing else it will be a debug nightmare. Instead you can write a slower generic SIMD function that works everywhere, and then override this with faster architecture-specific versions for some implementations. The recommended way to do that is to add a define around the generic function that skips it if the name is already defined. The actual implementations in the lowest-level files are typically defined to an architecture-specific name (such as gmx_simd_sincos_d_sse2) so we can override it (e.g. in SSE4) by simply undefining and setting a new definition. Still, this is an implementation detail you won't have to worry about until you start writing support for a new SIMD architecture.

Automated checking

Having fallback implementations when SIMD is not supported can be a performance problem if the code does not correctly include gromacs/simd/simd.h, particularly after refactoring. make check-source checks the whole code for the use of symbols defined in gromacs/simd/simd.h and requires that files using those symbols do the correct include. Similar checking is done for higher-level SIMD-management headers, e.g. gromacs/ewald/pme-simd.h.