Gromacs
5.1.1
|
One important way for GROMACS to achieve high performance is to use modern hardware capabilities where a single assembly instruction operates on multiple data units, essentially short fixed-length vectors (usually 2,4,8, or 16 elements). This provides a very efficient way for the CPU to increase floating-point performance, but it is much less versatile than general purpose registers. For this reason it is difficult for the compiler to generate efficient SIMD code, so the user has to organize the data in a way where it is possible to access as vectors, and these vectors often need to be aligned on cache boundaries.
We have supported a number of different SIMD instruction sets in the group kernels for ages, and it is now also present in the verlet kernels and a few other places. However, with the increased usage and several architectures with different capabilities we now use a vendor-agnostic GROMACS SIMD module, as documented in SIMD intrinsics interface (simd).
The macros in src/gromacs/simd
are intended to be used for writing architecture-independent SIMD intrinsics code. Rather than making assumptions based on architecture, we have introduced a limited number of predefined preprocessor macros that describe the capabilities of the current implementation - these are the ones you need to check when writing SIMD code. As you will see, the functionality exposed by this module as typically a small subset of general SIMD implementations, and in particular we do not even try to expose advanced shuffling or permute operations, simply because we haven't been able to describe those in a generic way that can be implemented efficiently regardless of the hardware. However, the advantage of this approach is that it is straightforward to extend with support for new simd instruction sets in the future, and that will instantly speed up old code too.
Unfortunately there is no standard for SIMD architectures. The available features vary a lot, but we still need to use quite a few of them to get the best performance possible. This means some features will only be available on certain platforms, and it is critical that we do NOT make to many assumptions about the storage formats, their size or SIMD width. Just to give a few examples:
While this might sound complicated, it is actually far easier than writing separate SIMD code for 10 architectures in both single & double. The point is not that you need to remember the limitations above, but it is critical that you never assume anything about the SIMD implementation. We typically implement SIMD support for a new architecture in days with this new module, and the extensions required for verlet kernels are also very straightforward (group kernels can be more complex, but those are gradually on their way out). For the higher-level code, the only important thing is to never assume anything about the SIMD architecture. Our general strategy in GROMACS is to split the SIMD coding in three levels:
gromacs/simd/simd.h
provides the API to define and manipulate SIMD datatypes. This will be enough for lots of cases, and it is a huge advantage that there is roughly parity between different architectures. The SIMD module uses a couple of different files:
gromacs/simd/simd.h
gromacs/simd/impl_reference.h
gromacs/simd/simd_math.h
gromacs/simd/vector_operations.h
The SIMD module handles the challenges mentioned in the introduction by introducing a number of datatypes; many of these might map to the same underlying SIMD types, but we need separate types because some architectures use different registers e.g. for boolean types.
gmx_simd_real_t
_r
, e.g. gmx_simd_add_r()
. gmx_simd_float_t
_f
is used for explicit single-precision routines, e.g. gmx_simd_mul_f()
. gmx_simd_double_t
_d
is used for explicit double-precision routines, e.g. gmx_simd_mul_d()
For these types, 'correspond' means that it is the integer type we get when we convert data e.g. from single (or double) precision floating-point SIMD variables. Those need to be different, since many common implementations only use half as many elements for double as for single SIMD variables, and then we only get half the number of integers too.
gmx_simd_int32_t
_i
, e.g. gmx_simd_add_i()
. gmx_simd_fint32_t
_fi
, like gmx_simd_add_fi()
. This will also be the widest integer data type if you want to do pure integer SIMD operations, but that will not be supported on all platforms. gmx_simd_dint32_t
gmx_simd_fint32_t
. The correspoding routines have suffix _di
, like gmx_simd_add_di()
. Note that all integer load/stores operations defined here load/store 32-bit integers, even when the internal register storage might be 64-bit, and we set the "width" of the SIMD implementation based on how many float/double/ integers we load/store - even if the internal width could be larger.
We need a separate boolean datatype for masks and comparison results, since we cannot assume they are identical either to integers, floats or double - some implementations use specific predicate registers for booleans.
gmx_simd_bool_t
_b
, like gmx_simd_or_b()
. gmx_simd_fbool_t
_fb
, like gmx_simd_or_fb()
. gmx_simd_dbool_t
_db
: gmx_simd_or_db()
gmx_simd_ibool_t
_ib
, like gmx_simd_or_ib()
. gmx_simd_fibool_t
_fib
, like gmx_simd_or_fib()
. gmx_simd_dibool_t
_dib
, like gmx_simd_or_dib()
. If this seems daunting, in practice you should only need to use these types when you start coding:
gmx_simd_real_t
gmx_simd_bool_t
gmx_simd_int32_t
Operations on these types will be defined to either float/double (or corresponding integers) based on the current GROMACS precision, so the documentation is occasionally more detailed for the lower-level actual implementation functions.
The above should be sufficient for code that works with the full SIMD width. Unfortunately reality is not that simple. Some algorithms like lattice summation need quartets of elements, so even when the SIMD width is >4 we need width-4 SIMD if it is supported. These datatypes and operations use the prefix gmx_simd4_
, and availability is indicated by GMX_SIMD4_HAVE_FLOAT
and GMX_SIMD4_HAVE_DOUBLE
. For now we only support a small subset of SIMD operations for SIMD4, but that is trivial to extend if we need to.
Functionality-wise, we have a small set of core set of features that we require to be present on all platforms, while more avanced features can be used in the code when defines like e.g. GMX_SIMD_HAVE_LOADU
are set.
This is a summary of the currently available preprocessor defines that you should use to check for support when using the corresponding features. We first list the float/double/int defines set by the implementation; in most cases you do not want to check directly for float/double defines, but you should instead use the derived "real" defines set in this file - we list those at the end below.
Preprocessor predefined macro defines set by the low-level implementation. These are only set if they work for all datatypes; GMX_SIMD_HAVE_LOADU
thus means we can load both float, double, and integers from unaligned memory, and that the unaligned loads are available for SIMD4 too.
GMX_SIMD_HAVE_FLOAT
GMX_SIMD_HAVE_DOUBLE
GMX_SIMD_HAVE_HARDWARE
GMX_SIMD_HAVE_LOADU
GMX_SIMD_HAVE_STOREU
GMX_SIMD_HAVE_LOGICAL
GMX_SIMD_HAVE_FMA
GMX_SIMD_HAVE_FRACTION
GMX_SIMD_HAVE_FINT32
GMX_SIMD_HAVE_FINT32_EXTRACT
gmx_simd_fint32_t
. GMX_SIMD_HAVE_FINT32_LOGICAL
gmx_simd_fint32_t
. GMX_SIMD_HAVE_FINT32_ARITHMETICS
gmx_simd_fint32_t
. GMX_SIMD_HAVE_DINT32
GMX_SIMD_HAVE_DINT32_EXTRACT
gmx_simd_dint32_t
. GMX_SIMD_HAVE_DINT32_LOGICAL
gmx_simd_dint32_t
. GMX_SIMD_HAVE_DINT32_ARITHMETICS
gmx_simd_dint32_t
. There are also two macros specific to SIMD4: GMX_SIMD4_HAVE_FLOAT
is set if we can use SIMD4 in single precision, and GMX_SIMD4_HAVE_DOUBLE
similarly denotes support for a double-precision SIMD4 implementation. For generic properties (e.g. whether SIMD4 FMA is supported), you should check the normal SIMD macros above.
Higher-level code can use these macros to find information about the implementation, for instance what the SIMD width is:
GMX_SIMD_FLOAT_WIDTH
gmx_simd_float_t
, and practical width of gmx_simd_fint32_t
. GMX_SIMD_DOUBLE_WIDTH
gmx_simd_double_t
, and practical width of gmx_simd_dint32_t
GMX_SIMD_RSQRT_BITS
GMX_SIMD_RCP_BITS
After including the low-level architecture-specific implementation, this header sets the following derived defines based on the current precision; these are the ones you should check for unless you absolutely want to dig deep into the explicit single/double precision implementations:
GMX_SIMD_HAVE_REAL
GMX_SIMD_HAVE_FLOAT
or GMX_SIMD_HAVE_DOUBLE
GMX_SIMD4_HAVE_REAL
GMX_SIMD4_HAVE_FLOAT
or GMX_SIMD4_HAVE_DOUBLE
GMX_SIMD_REAL_WIDTH
GMX_SIMD_FLOAT_WIDTH
or GMX_SIMD_DOUBLE_WIDTH
GMX_SIMD_HAVE_INT32
GMX_SIMD_HAVE_FINT32
or GMX_SIMD_HAVE_DINT32
GMX_SIMD_INT32_WIDTH
GMX_SIMD_FINT32_WIDTH
or GMX_SIMD_DINT32_WIDTH
GMX_SIMD_HAVE_INT32_EXTRACT
GMX_SIMD_HAVE_FINT32_EXTRACT
or GMX_SIMD_HAVE_DINT32_EXTRACT
GMX_SIMD_HAVE_INT32_LOGICAL
GMX_SIMD_HAVE_FINT32_LOGICAL
or GMX_SIMD_HAVE_DINT32_LOGICAL
GMX_SIMD_HAVE_INT32_ARITHMETICS
GMX_SIMD_HAVE_FINT32_ARITHMETICS
or GMX_SIMD_HAVE_DINT32_ARITHMETICS
For convenience we also define GMX_SIMD4_WIDTH
to 4. This will never vary, but using it helps you make it clear that a loop or array refers to the SIMD4 width rather than some other '4'.
While all these defines are available to specify the features of the hardware, we would strongly recommend that you do NOT sprinkle your code with defines - if nothing else it will be a debug nightmare. Instead you can write a slower generic SIMD function that works everywhere, and then override this with faster architecture-specific versions for some implementations. The recommended way to do that is to add a define around the generic function that skips it if the name is already defined. The actual implementations in the lowest-level files are typically defined to an architecture-specific name (such as gmx_simd_sincos_d_sse2
) so we can override it (e.g. in SSE4) by simply undefining and setting a new definition. Still, this is an implementation detail you won't have to worry about until you start writing support for a new SIMD architecture.
Having fallback implementations when SIMD is not supported can be a performance problem if the code does not correctly include gromacs/simd/simd.h
, particularly after refactoring. make check-source
checks the whole code for the use of symbols defined in gromacs/simd/simd.h
and requires that files using those symbols do the correct include. Similar checking is done for higher-level SIMD-management headers, e.g. gromacs/ewald/pme-simd.h
.