Gromacs
2025.0-dev-20241011-013a99c
|
One important way for GROMACS to achieve high performance is to use modern hardware capabilities where a single assembly instruction operates on multiple data units, essentially short fixed-length vectors (usually 2, 4, 8, or 16 elements). This provides a very efficient way for the CPU to increase floating-point performance, but it is much less versatile than general purpose registers. For this reason it is difficult for the compiler to generate efficient SIMD code, so the user has to organize the data in a way where it is possible to access as vectors, and these vectors often need to be aligned on cache boundaries.
We have supported a number of different SIMD instruction sets in the group kernels for ages, and it is now also present in the Verlet kernels and a few other places. However, with the increased usage and several architectures with different capabilities we now use a vendor-agnostic GROMACS SIMD module, as documented in SIMD intrinsics interface (simd).
The functions in src/gromacs/simd
are intended to be used for writing architecture-independent SIMD intrinsics code. Rather than making assumptions based on architecture, we have introduced a limited number of predefined preprocessor macros that describe the capabilities of the current implementation - these are the ones you need to check when writing SIMD code. As you will see, the functionality exposed by this module as typically a small subset of general SIMD implementations, and in particular we do not even try to expose advanced shuffling or permute operations, simply because we haven't been able to describe those in a generic way that can be implemented efficiently regardless of the hardware. However, the advantage of this approach is that it is straightforward to extend with support for new simd instruction sets in the future, and that will instantly speed up old code too.
To support the more complex stuff in the GROMACS nonbonded kernels and to make it possible to use SIMD intrinsics even for some parts of the code where the data is not in SIMD-friendly layout, we have also added about 10 higher-level utility routines. These perform gather/scatter operations on coordinate triplets, they load table data and aligned pairs (Lennard-Jones parameters), and sum up the forces needed in the outer loop of the nonbonded kernels. They are very straightforward to implement, but since they are performance-critical we want to exploit all features of each architecture, and for this reason they are part of the SIMD implementation.
Finally, for some architectures with large or very large SIMD width (e.g. AVX with 8 elements in single precision, or AVX-512 with 16), the nonbonded kernels can become inefficient. Since all such architectures presently known (AVX, AVX2, AVX512) also provide extensive support for accessing parts of the register, we optionally define a handful of routines to perform load, store, and reduce operations based on half-SIMD-width data, which can improve performance. It is only useful for wide implementations, and it can safely be ignored first when porting to new platforms - they are only needed for the so-called 2xnn SIMD kernels.
Unfortunately there is no standard for SIMD architectures. The available features vary a lot, but we still need to use quite a few of them to get the best performance possible. This means some features will only be available on certain platforms, and it is critical that we do NOT make too many assumptions about the storage formats, their size or SIMD width. Just to give a few examples:
While this might sound complicated, it is actually far easier than writing separate SIMD code for 10 architectures in both single & double. The point is not that you need to remember the limitations above, but it is critical that you never assume anything about the SIMD implementation. We typically implement SIMD support for a new architecture in days with this new module, and the extensions required for Verlet kernels are also very straightforward (group kernels can be more complex, but those are gradually on their way out). For the higher-level code, the only important thing is to never assume anything about the SIMD architecture. Our general strategy in GROMACS is to split the SIMD coding in three levels:
gromacs/simd/simd.h
provides the API to define and manipulate SIMD datatypes. This will be enough for lots of cases, and it is a huge advantage that there is roughly parity between different architectures. The SIMD module uses a couple of different files:
gromacs/simd/simd.h
gromacs/simd/impl_reference/impl_reference.h
gromacs/simd/simd_math.h
gromacs/simd/vector_operations.h
The SIMD module handles the challenges mentioned in the introduction by introducing a number of datatypes; many of these might map to the same underlying SIMD types, but we need separate types because some architectures use different registers e.g. for boolean types.
gmx::SimdReal
gmx::SimdFloat
gmx::SimdDouble
For these types, 'correspond' means that it is the integer type we get when we convert data e.g. from single (or double) precision floating-point SIMD variables. Those need to be different, since many common implementations only use half as many elements for double as for single SIMD variables, and then we only get half the number of integers too.
gmx::SimdInt32
gmx::SimdFInt32
gmx::SimdDInt32
gmx::SimdFInt32
. Note that all integer load/stores operations defined here load/store 32-bit integers, even when the internal register storage might be 64-bit, and we set the "width" of the SIMD implementation based on how many float/double/ integers we load/store - even if the internal width could be larger.
We need a separate boolean datatype for masks and comparison results, since we cannot assume they are identical either to integers, floats or double - some implementations use specific predicate registers for booleans.
gmx::SimdBool
B
, like gmx::simdOrB()
. gmx::SimdFBool
gmx::SimdDBool
gmx::SimdIBool
gmx::SimdFIBool
gmx::SimdDIBool
Note: You should NOT try to store and load boolean SIMD types to memory - that is the whole reason why there are no store or load operations provided for them. While it will be technically possible to achieve by defining objects inside a structure and then doing a placement new with aligned memory, this can be a very expensive operation on platforms where special single-bit predicate registers are used to represent booleans. You will need to find a more portable algorithm for your code instead.
If this seems daunting, in practice you should only need to use these types when you start coding:
gmx::SimdReal
gmx::SimdBool
gmx::SimdInt32
Operations on these types will be defined to either float/double (or corresponding integers) based on the current GROMACS precision, so the documentation is occasionally more detailed for the lower-level actual implementation functions.
Note that it is critical for these types to be aligned in memory. This should always be the case when you declare variables on the stack, but unfortunately some compilers (at least clang-3.7 on OS X) appear to be buggy when our SIMD datatypes are placed inside a structure. Somewhere in the processes where this structure includes our class, which in turn includes the actual SIMD datatype, the alignment appears to be lost. Thus, even though the compiler will not warn you, until further notice we need to avoid putting the SIMD datatypes into other structures. This is particular severe when allocating memory on the heap, but it occurs for stack structures/classes too.
The above should be sufficient for code that works with the full SIMD width. Unfortunately reality is not that simple. Some algorithms like lattice summation need quartets of elements, so even when the SIMD width is >4 we need width-4 SIMD if it is supported. The availability of SIMD4 is indicated by GMX_SIMD4_HAVE_FLOAT and GMX_SIMD4_HAVE_DOUBLE. For now we only support a small subset of SIMD operations for SIMD4. Because SIMD4 doesn't scale with increasingly large SIMD width it should be avoided for all new code and SIMD4N should be used instead.
Some code, like lattice summation, has inner loops which are smaller than the full SIMD width. In GROMACS algorithms 3 and 4 iterations are common because of PME order and three dimensions. This makes 4 an important special case. Vectorizing such loops efficiently requires to collapse the two most inner loops and using e.g. one 8-wide SIMD vector for 2 outer and 4 inner iterations or one 16-wide SIMD vector for 4 outer and 4 inner iterations. For this SIMD4N functions are provided. The availability of these function is indicated by GMX_SIMD_HAVE_4NSIMD_UTIL_FLOAT and GMX_SIMD_HAVE_4NSIMD_UTIL_DOUBLE. These functions return the type alias Simd4NFloat / Simd4NDouble which is either the normal SIMD type or the SIMD4 type and thus only supports the operations the SIMD4 type supports.
Functionality-wise, we have a small set of core features that we require to be present on all platforms, while more avanced features can be used in the code when defines like e.g. GMX_SIMD_HAVE_LOADU have the value 1.
This is a summary of the currently available preprocessor defines that you should use to check for support when using the corresponding features. We first list the float/double/int defines set by the implementation; in most cases you do not want to check directly for float/double defines, but you should instead use the derived "real" defines set in this file - we list those at the end below.
Preprocessor predefined macro defines set by the low-level implementation. These only have the value 1 if they work for all datatypes; GMX_SIMD_HAVE_LOADU thus means we can load both float, double, and integers from unaligned memory, and that the unaligned loads are available for SIMD4 too.
gmx::SimdFInt32
. gmx::SimdFInt32
. gmx::SimdFInt32
. gmx::SimdDInt32
. gmx::SimdDInt32
. gmx::SimdDInt32
. There are also two macros specific to SIMD4: GMX_SIMD4_HAVE_FLOAT is set if we can use SIMD4 in single precision, and GMX_SIMD4_HAVE_DOUBLE similarly denotes support for a double-precision SIMD4 implementation. For generic properties (e.g. whether SIMD4 FMA is supported), you should check the normal SIMD macros above.
Higher-level code can use these macros to find information about the implementation, for instance what the SIMD width is:
gmx::SimdFloat
, and practical width of gmx::SimdFInt32
. gmx::SimdDouble
, and practical width of gmx::SimdDInt32
After including the low-level architecture-specific implementation, this header sets the following derived defines based on the current precision; these are the ones you should check for unless you absolutely want to dig deep into the explicit single/double precision implementations:
For convenience we also define GMX_SIMD4_WIDTH to 4. This will never vary, but using it helps you make it clear that a loop or array refers to the SIMD4 width rather than some other '4'.
While all these defines are available to specify the features of the hardware, we would strongly recommend that you do NOT sprinkle your code with defines - if nothing else it will be a debug nightmare. Instead you can write a slower generic SIMD function that works everywhere, and then override this with faster architecture-specific versions for some implementations. The recommended way to do that is to add a define around the generic function that skips it if the name is already defined. The actual implementations in the lowest-level files are typically defined to an architecture-specific name (such as simdSinCosD_Sse2
) so we can override it (e.g. in SSE4) by simply undefining and setting a new definition. Still, this is an implementation detail you won't have to worry about until you start writing support for a new SIMD architecture.
We rely on C++ overloading, so the name of a function is usually identical regardless of what datatype it operates on. There are a few exceptions to this for functions that do not take arguments but only return a value, e.g. setZero(), since overloading only works if the formal parameters are different. To solve this, we use different low-level function names in these cases, but then create proxy objects in the high-level gromacs/simd/simd.h
so that you can still get the functionality by simply writing setZero() in the code.
Having fallback implementations when SIMD is not supported can be a performance problem if the code does not correctly include gromacs/simd/simd.h
, particularly after refactoring. make check-source
checks the whole code for the use of symbols defined in gromacs/simd/simd.h
and requires that files using those symbols do the correct include. Similar checking is done for higher-level SIMD-management headers, e.g. gromacs/ewald/pme_simd.h
.
In addition to the low-level SIMD instructions, GROMACS comes with a fairly extensive SIMD math library in gromacs/simd/simd_math.h
to support various mathematical functions. The functions are available both in single and double precision (overloaded on the usual math function names), and we also provide a special version of functions that use double precision arguments, but that only evaluate the result to single precision accuracy. This is useful when you don’t need highly accurate results, but you want to avoid the overhead of doing multiple single/double conversions, or if the hardware architecture only provides a double precision SIMD implementation.
For a few functions such as the square root and exponential that are performance-critical, we provide additional tempate parameters where the default choice is to execute the normal function version, but it is also possible to choose an unsafe execution path that completely bypass all argument checking. Make absolutely sure your arguments always fulfil the restrictions listed in the documentation of such a function before using it, and it might even be a good idea to add a note before each call to an unsafe function justifying why that flavor is fine to use here.