Gromacs  2025-dev-20240913-b871546
 All Classes Namespaces Files Functions Variables Typedefs Enumerations Enumerator Friends Macros Groups Pages
List of all members | Public Types | Public Member Functions | Public Attributes
PmeGpuProgramImpl Struct Reference

#include <gromacs/ewald/pme_gpu_program_impl.h>

Description

PME GPU persistent host program/kernel data, which should be initialized once for the whole execution.

Primary purpose of this is to not recompile GPU kernels for each OpenCL unit test, while the relevant GPU context (e.g. cl_context) instance persists. In CUDA, this just assigns the kernel function pointers. This also implicitly relies on the fact that reasonable share of the kernels are always used. If there were more template parameters, even smaller share of all possible kernels would be used.

Todo:
In future if we would need to react to either user input or auto-tuning to compile different kernels, then we might wish to revisit the number of kernels we pre-compile, and/or the management of their lifetime.

This also doesn't manage cuFFT/clFFT kernels, which depend on the PME grid dimensions.

Public Types

using PmeKernelHandle = ISyclKernelFunctor *
 Conveniently all the PME kernels use the same single argument type.
 

Public Member Functions

 PmeGpuProgramImpl (const DeviceContext &deviceContext)
 Constructor for the given device.
 
 GMX_DISALLOW_COPY_AND_ASSIGN (PmeGpuProgramImpl)
 
int warpSize () const
 Return the warp size for which the kernels were compiled.
 

Public Attributes

const DeviceContext & deviceContext_
 A handle to the device context (managed externally). More...
 
size_t warpSize_
 Maximum synchronous GPU thread group execution width. "Warp" is a CUDA term which we end up reusing in OpenCL kernels as well. For CUDA, this is a static value that comes from gromacs/gpu_utils/cuda_arch_utils.cuh; for OpenCL, we have to query it dynamically.
 
PmeKernelHandle nvshmemSignalKern
 
size_t spreadWorkGroupSize
 Spread/spline kernels are compiled only for order of 4. More...
 
PmeKernelHandle splineKernelSingle
 
PmeKernelHandle splineKernelThPerAtom4Single
 
PmeKernelHandle spreadKernelSingle
 
PmeKernelHandle spreadKernelThPerAtom4Single
 
PmeKernelHandle splineAndSpreadKernelSingle
 
PmeKernelHandle splineAndSpreadKernelThPerAtom4Single
 
PmeKernelHandle splineAndSpreadKernelWriteSplinesSingle
 
PmeKernelHandle splineAndSpreadKernelWriteSplinesThPerAtom4Single
 
PmeKernelHandle splineKernelDual
 
PmeKernelHandle splineKernelThPerAtom4Dual
 
PmeKernelHandle spreadKernelDual
 
PmeKernelHandle spreadKernelThPerAtom4Dual
 
PmeKernelHandle splineAndSpreadKernelDual
 
PmeKernelHandle splineAndSpreadKernelThPerAtom4Dual
 
PmeKernelHandle splineAndSpreadKernelWriteSplinesDual
 
PmeKernelHandle splineAndSpreadKernelWriteSplinesThPerAtom4Dual
 
size_t gatherWorkGroupSize
 Same for gather: hardcoded X/Y unwrap parameters, order of 4, plus it can either reduce with previous forces in the host buffer, or ignore them. More...
 
PmeKernelHandle gatherKernelSingle
 
PmeKernelHandle gatherKernelThPerAtom4Single
 
PmeKernelHandle gatherKernelReadSplinesSingle
 
PmeKernelHandle gatherKernelReadSplinesThPerAtom4Single
 
PmeKernelHandle gatherKernelDual
 
PmeKernelHandle gatherKernelThPerAtom4Dual
 
PmeKernelHandle gatherKernelReadSplinesDual
 
PmeKernelHandle gatherKernelReadSplinesThPerAtom4Dual
 
size_t solveMaxWorkGroupSize
 Solve kernel doesn't care about the interpolation order, but can optionally compute energy and virial, and supports XYZ and YZX grid orderings. More...
 
PmeKernelHandle solveYZXKernelA
 
PmeKernelHandle solveXYZKernelA
 
PmeKernelHandle solveYZXEnergyKernelA
 
PmeKernelHandle solveXYZEnergyKernelA
 
PmeKernelHandle solveYZXKernelB
 
PmeKernelHandle solveXYZKernelB
 
PmeKernelHandle solveYZXEnergyKernelB
 
PmeKernelHandle solveXYZEnergyKernelB
 

Member Data Documentation

const DeviceContext& PmeGpuProgramImpl::deviceContext_

A handle to the device context (managed externally).

Only used in OpenCL for JIT compilation.

size_t PmeGpuProgramImpl::gatherWorkGroupSize

Same for gather: hardcoded X/Y unwrap parameters, order of 4, plus it can either reduce with previous forces in the host buffer, or ignore them.

Also similarly to the gather we can use either order(4) or order*order (16) threads per atom and either recalculate the splines or read the ones written by the spread The kernels are templated separately for using one or two grids (required for calculating energies and virial).

size_t PmeGpuProgramImpl::solveMaxWorkGroupSize

Solve kernel doesn't care about the interpolation order, but can optionally compute energy and virial, and supports XYZ and YZX grid orderings.

The kernels are templated separately for grids in state A and B.

size_t PmeGpuProgramImpl::spreadWorkGroupSize

Spread/spline kernels are compiled only for order of 4.

There are multiple versions of each kernel, paramaretized according to Number of threads per atom. Using either order(4) or order*order (16) threads per atom is supported If the spline data is written in the spline/spread kernel and loaded in the gather or recalculated in the gather. Spreading kernels also have hardcoded X/Y indices wrapping parameters, as a placeholder for implementing 1/2D decomposition. The kernels are templated separately for spreading on one grid (one or two sets of coefficients) or on two grids (required for energy and virial calculations).


The documentation for this struct was generated from the following files: