Gromacs  2020.4
 All Classes Namespaces Files Functions Variables Typedefs Enumerations Enumerator Friends Macros Groups Pages
Variables
pme_gpu_constants.h File Reference
#include "config.h"
#include "gromacs/gpu_utils/cuda_arch_utils.cuh"
+ Include dependency graph for pme_gpu_constants.h:
+ This graph shows which files directly or indirectly include this file:

Description

This file defines the PME GPU compile-time constants/macros, used both in device and host code.

As OpenCL C is not aware of constexpr, most of this file is forwarded to the OpenCL kernel compilation as defines with same names, for the sake of code similarity.

Todo:
The values are currently common to both CUDA and OpenCL implementations, but should be reconsidered when we tune the OpenCL implementation. See Redmine #2528.
Author
Aleksei Iupinov a.yup.nosp@m.inov.nosp@m.@gmai.nosp@m.l.co.nosp@m.m

Variables

constexpr bool c_usePadding = true
 false: The atom data GPU buffers are sized precisely according to the number of atoms. (Except GPU spline data layout which is regardless intertwined for 2 atoms per warp). The atom index checks in the spread/gather code potentially hinder the performance. true: The atom data GPU buffers are padded with zeroes so that the possible number of atoms fitting in is divisible by c_pmeAtomDataAlignment. The atom index checks are not performed. There should be a performance win, but how big is it, remains to be seen. Additional cudaMemsetAsync calls are done occasionally (only charges/coordinates; spline data is always recalculated now). More...
 
constexpr bool c_skipNeutralAtoms = false
 false: Atoms with zero charges are processed by PME. Could introduce some overhead. true: Atoms with zero charges are not processed by PME. Adds branching to the spread/gather. Could be good for performance in specific systems with lots of neutral atoms. More...
 
constexpr int c_virialAndEnergyCount = 7
 Number of PME solve output floating point numbers. 6 for symmetric virial matrix + 1 for reciprocal energy.
 
constexpr int c_pmeGpuOrder = 4
 PME order parameter. More...
 
constexpr int c_pmeSpreadGatherThreadsPerAtom = c_pmeGpuOrder * c_pmeGpuOrder
 The number of GPU threads used for computing spread/gather contributions of a single atom as function of the PME order. The assumption is currently that any thread processes only a single atom's contributions. TODO: this assumption leads to minimum execution width of 16. See Redmine #2516.
 
constexpr int c_pmeSpreadGatherThreadsPerAtom4ThPerAtom = c_pmeGpuOrder
 Number of threads per atom when order threads are used.
 
constexpr int c_pmeSpreadGatherMinWarpSize = c_pmeSpreadGatherThreadsPerAtom
 Minimum execution width of the PME spread and gather kernels. More...
 
constexpr int c_pmeSpreadGatherMinWarpSize4ThPerAtom = c_pmeSpreadGatherThreadsPerAtom4ThPerAtom
 Minimum warp size if order threads pera atom are used instead of order^2.
 
constexpr int c_pmeAtomDataAlignment = 64
 Atom data alignment (in terms of number of atoms). This is the least common multiple of number of atoms processed by a single block/workgroup of the spread and gather kernels. If the GPU atom data buffers are padded (c_usePadding == true), Then the numbers of atoms which would fit in the padded GPU buffers have to be divisible by this. There are debug asserts for this divisibility in pme_gpu_spread() and pme_gpu_gather().
 
constexpr int c_spreadMaxWarpsPerBlock = 8
 Spreading max block width in warps picked among powers of 2 (2, 4, 8, 16) for max. occupancy and min. runtime in most cases.
 
constexpr int c_solveMaxWarpsPerBlock = 8
 Solving kernel max block width in warps picked among powers of 2 (2, 4, 8, 16) for max. More...
 
constexpr int c_gatherMaxWarpsPerBlock = 4
 Gathering max block width in warps - picked empirically among 2, 4, 8, 16 for max. occupancy and min. runtime.
 
constexpr int c_pmeSpreadGatherAtomsPerWarp = (warp_size / c_pmeSpreadGatherThreadsPerAtom)
 The number of atoms processed by a single warp in spread/gather. This macro depends on the templated order parameter (2 atoms per warp for order 4 and warp_size of 32). It is mostly used for spline data layout tweaked for coalesced access.
 
constexpr int c_pmeSpreadGatherAtomsPerWarp4ThPerAtom
 number of atoms per warp when order threads are used per atom More...
 
constexpr int c_spreadMaxThreadsPerBlock = c_spreadMaxWarpsPerBlock * warp_size
 Spreading max block size in threads.
 
constexpr int c_solveMaxThreadsPerBlock = (c_solveMaxWarpsPerBlock * warp_size)
 Solving kernel max block size in threads.
 
constexpr int c_gatherMaxThreadsPerBlock = c_gatherMaxWarpsPerBlock * warp_size
 Gathering max block size in threads.
 
constexpr int c_gatherMinBlocksPerMP = 0 / c_gatherMaxThreadsPerBlock
 Gathering min blocks per CUDA multiprocessor.
 

Variable Documentation

constexpr int c_pmeGpuOrder = 4

PME order parameter.

Note that the GPU code, unlike the CPU, only supports order 4.

constexpr int c_pmeSpreadGatherAtomsPerWarp4ThPerAtom
Initial value:
=
constexpr int c_pmeSpreadGatherThreadsPerAtom4ThPerAtom
Number of threads per atom when order threads are used.
Definition: pme_gpu_constants.h:131

number of atoms per warp when order threads are used per atom

constexpr int c_pmeSpreadGatherMinWarpSize = c_pmeSpreadGatherThreadsPerAtom

Minimum execution width of the PME spread and gather kernels.

Due to the one thread per atom and order=4 implementation constraints, order^2 threads should execute without synchronization needed. See c_pmeSpreadGatherThreadsPerAtom

constexpr bool c_skipNeutralAtoms = false

false: Atoms with zero charges are processed by PME. Could introduce some overhead. true: Atoms with zero charges are not processed by PME. Adds branching to the spread/gather. Could be good for performance in specific systems with lots of neutral atoms.

Todo:
Estimate performance differences.
constexpr int c_solveMaxWarpsPerBlock = 8

Solving kernel max block width in warps picked among powers of 2 (2, 4, 8, 16) for max.

occupancy and min. runtime (560Ti (CC2.1), 660Ti (CC3.0) and 750 (CC5.0)))

constexpr bool c_usePadding = true

false: The atom data GPU buffers are sized precisely according to the number of atoms. (Except GPU spline data layout which is regardless intertwined for 2 atoms per warp). The atom index checks in the spread/gather code potentially hinder the performance. true: The atom data GPU buffers are padded with zeroes so that the possible number of atoms fitting in is divisible by c_pmeAtomDataAlignment. The atom index checks are not performed. There should be a performance win, but how big is it, remains to be seen. Additional cudaMemsetAsync calls are done occasionally (only charges/coordinates; spline data is always recalculated now).

Todo:
Estimate performance differences