Gromacs  2025-dev-20240812-545ca5b
 All Classes Namespaces Files Functions Variables Typedefs Enumerations Enumerator Friends Macros Groups Pages
Namespaces | Classes | Typedefs | Enumerations | Functions | Variables
Nbnxm Namespace Reference

Description

Namespace for non-bonded kernels.

Namespaces

 anonymous_namespace{grid.cpp}
 

Classes

struct  KernelBenchOptions
 The options for the kernel benchmarks. More...
 
struct  BoundingBox
 Bounding box for a nbnxm atom cluster. More...
 
struct  GpuTimers
 GPU region timers used for timing GPU kernels and H2D/D2H transfers. More...
 
class  GpuPairlistSorting
 Sorted pair list on GPU and data required for performing the sorting. More...
 
class  GpuPairlist
 GPU pair list structure. More...
 
struct  BoundingBox1D
 Bounding box for one dimension of a grid cell. More...
 
class  Grid
 A pair-search grid object for one domain decomposition zone. More...
 
class  GridSet
 Holds a set of search grids for the local + non-local DD zones. More...
 
struct  GridSetData
 Struct that holds grid data that is shared over all grids. More...
 
struct  GridWork
 Working arrays for constructing a grid. More...
 
struct  EnergyFunctionProperties
 Set of boolean constants mimicking preprocessor macros. More...
 

Typedefs

using GpuPairlistByLocality = gmx::EnumerationArray< InteractionLocality, std::unique_ptr< GpuPairlist >>
 
using mode = sycl::access_mode
 

Enumerations

enum  BenchMarkKernels : int {
  SimdAuto, SimdNo, Simd4XM, Simd2XMM,
  Count
}
 Enum for selecting the SIMD kernel type for benchmarks.
 
enum  BenchMarkCombRule : int { RuleGeom, RuleLB, RuleNone, Count }
 Enum for selecting the combination rule for kernel benchmarks.
 
enum  BenchMarkCoulomb : int { Pme, ReactionField, Count }
 Enum for selecting coulomb type for kernel benchmarks.
 
enum  ElecType : int {
  ElecType::Cut, ElecType::RF, ElecType::EwaldTab, ElecType::EwaldTabTwin,
  ElecType::EwaldAna, ElecType::EwaldAnaTwin, ElecType::Count
}
 Nbnxm electrostatic GPU kernel flavors. More...
 
enum  VdwType : int {
  VdwType::Cut, VdwType::CutCombGeom, VdwType::CutCombLB, VdwType::FSwitch,
  VdwType::PSwitch, VdwType::EwaldGeom, VdwType::EwaldLB, VdwType::Count
}
 Nbnxm VdW GPU kernel flavors. More...
 
enum  KernelType : int {
  NotSet = 0, Cpu4x4_PlainC, Cpu4xN_Simd_4xN, Cpu4xN_Simd_2xNN,
  Gpu8x8x8, Cpu8x8x8_PlainC, Count
}
 Nonbonded NxN kernel types: plain C, CPU SIMD, GPU, GPU emulation.
 
enum  EwaldExclusionType : int { NotSet = 0, Table, Analytical, DecidedByGpuModule }
 Ewald exclusion types.
 
enum  NonbondedResource : int { Cpu, Gpu, EmulateGpu }
 Resources that can be used to execute non-bonded kernels on.
 

Functions

static std::optional< std::string > checkKernelSetup (const KernelBenchOptions &options)
 Checks the kernel setup. More...
 
static KernelType translateBenchmarkEnum (const BenchMarkKernels &kernel)
 Helper to translate between the different enumeration values.
 
static KernelSetup getKernelSetup (const KernelBenchOptions &options)
 Returns the kernel setup.
 
static interaction_const_t setupInteractionConst (const KernelBenchOptions &options)
 Return an interaction constants struct with members used in the benchmark set appropriately.
 
static LJCombinationRule convertLJCombinationRule (const BenchMarkCombRule combRule)
 Converts the benchmark LJ comb.rule. enum to the corresponding NBNxM enum.
 
static std::unique_ptr
< nonbonded_verlet_t
setupNbnxmForBenchInstance (const KernelBenchOptions &options, const gmx::BenchmarkSystem &system)
 Sets up and returns a Nbnxm object for the given benchmark options and system.
 
static void expandSimdOptionAndPushBack (const KernelBenchOptions &options, std::vector< KernelBenchOptions > *optionsList)
 Add the options instance to the list for all requested kernel SIMD types.
 
static void setupAndRunInstance (const gmx::BenchmarkSystem &system, const KernelBenchOptions &options, const bool doWarmup)
 Sets up and runs the requested benchmark instance and prints the results.
 
void bench (int sizeFactor, const KernelBenchOptions &options)
 Sets up and runs one or more Nbnxm kernel benchmarks. More...
 
template<typename T >
static std::enable_if_t
< std::is_same_v< T,
gmx::BasicVector< float >
>, gmx::BasicVector< float > > 
loadBoundingBoxCorner (const BoundingBox::Corner &corner)
 Loads a corner of a bounding box into a float vector.
 
static gmx::BasicVector< float > max (const gmx::BasicVector< float > &v1, const gmx::BasicVector< float > &v2)
 Return the element-wise max of two 3-float vectors, needed to share code with SIMD.
 
static float dotProduct (const gmx::BasicVector< float > &v1, const gmx::BasicVector< float > &v2)
 Return the dot product of two 3-float vectors, needed to share code with SIMD.
 
static float clusterBoundingBoxDistance2 (const BoundingBox &bb_i, const BoundingBox &bb_j)
 Returns the distance^2 between two bounding boxes. More...
 
void cuda_set_cacheconfig ()
 Set up the cache configuration for the non-bonded kernels.
 
static void countPruneKernelTime (Nbnxm::GpuTimers *timers, gmx_wallclock_gpu_nbnxn_t *timings, const InteractionLocality iloc)
 Count pruning kernel time if either kernel has been triggered. More...
 
static void gpu_reduce_staged_outputs (const NBStagingData &nbst, const InteractionLocality iLocality, const bool reduceEnergies, const bool reduceFshift, real *e_lj, real *e_el, rvec *fshift)
 Reduce data staged internally in the nbnxn module. More...
 
template<typename GpuPairlist >
static void gpu_accumulate_timings (gmx_wallclock_gpu_nbnxn_t *timings, Nbnxm::GpuTimers *timers, const GpuPairlist *plist, AtomLocality atomLocality, const gmx::StepWorkload &stepWork, bool doTiming)
 Do the per-step timing accounting of the nonbonded tasks. More...
 
bool gpu_try_finish_task (NbnxmGpu *nb, const gmx::StepWorkload &stepWork, const AtomLocality aloc, real *e_lj, real *e_el, gmx::ArrayRef< gmx::RVec > shiftForces, GpuTaskCompletion completionKind)
 Attempts to complete nonbonded GPU task. More...
 
float gpu_wait_finish_task (NbnxmGpu *nb, const gmx::StepWorkload &stepWork, AtomLocality aloc, real *e_lj, real *e_el, gmx::ArrayRef< gmx::RVec > shiftForces, gmx_wallcycle *wcycle)
 Wait for the asynchronously launched nonbonded tasks and data transfers to finish. More...
 
static bool canSkipNonbondedWork (const NbnxmGpu &nb, InteractionLocality iloc)
 An early return condition for empty NB GPU workloads. More...
 
static gmx::Range< int > getGpuAtomRange (const NBAtomDataGpu *atomData, const AtomLocality atomLocality)
 Calculate atom range and return start index and length. More...
 
NbnxmGpugpu_init (const gmx::DeviceStreamManager gmx_unused &deviceStreamManager, const interaction_const_t gmx_unused *ic, const PairlistParams gmx_unused &listParams, const nbnxn_atomdata_t gmx_unused *nbat, bool gmx_unused bLocalAndNonlocal)
 Initializes the data structures related to GPU nonbonded calculations. More...
 
void gpu_init_pairlist (NbnxmGpu gmx_unused *nb, const struct NbnxnPairlistGpu gmx_unused *h_nblist, gmx::InteractionLocality gmx_unused iloc)
 Initializes pair-list data for GPU, called at every pair search step. More...
 
void gpu_init_atomdata (NbnxmGpu gmx_unused *nb, const nbnxn_atomdata_t gmx_unused *nbat)
 Initializes atom-data on the GPU, called at every pair search step. More...
 
void gpu_pme_loadbal_update_param (struct nonbonded_verlet_t gmx_unused *nbv, const interaction_const_t gmx_unused &ic)
 Re-generate the GPU Ewald force table, resets rlist, and update the electrostatic type switching to twin cut-off (or back) if needed.
 
void gpu_upload_shiftvec (NbnxmGpu gmx_unused *nb, const nbnxn_atomdata_t gmx_unused *nbatom)
 Uploads shift vector to the GPU if the box is dynamic (otherwise just returns). More...
 
void gpu_clear_outputs (NbnxmGpu gmx_unused *nb, bool gmx_unused computeVirial)
 Clears GPU outputs: nonbonded force, shift force and energy. More...
 
void gpu_free (NbnxmGpu gmx_unused *nb)
 Frees all GPU resources used for the nonbonded calculations. More...
 
struct gmx_wallclock_gpu_nbnxn_tgpu_get_timings (NbnxmGpu gmx_unused *nb)
 Returns the GPU timings structure or NULL if GPU is not used or timing is off. More...
 
void gpu_reset_timings (struct nonbonded_verlet_t gmx_unused *nbv)
 Resets nonbonded GPU timings. More...
 
int gpu_min_ci_balanced (NbnxmGpu gmx_unused *nb)
 Calculates the minimum size of proximity lists to improve SM load balance with GPU non-bonded kernels. More...
 
bool gpu_is_kernel_ewald_analytical (const NbnxmGpu gmx_unused *nb)
 Returns if analytical Ewald GPU kernels are used. More...
 
NBAtomDataGpugpuGetNBAtomData (NbnxmGpu gmx_unused *nb)
 Returns an opaque pointer to the GPU NBNXM atom data.
 
DeviceBuffer< gmx::RVecgpu_get_f (NbnxmGpu gmx_unused *nb)
 Returns forces device buffer.
 
static real gridAtomDensity (int numAtoms, const gmx::RVec &gridBoundingBoxSize)
 Returns the atom density (> 0) of a rectangular grid.
 
static std::array< real, DIM-1 > getTargetCellLength (const Grid::Geometry &geometry, const real atomDensity)
 Get approximate dimensions of each cell. Returns the length along X and Y.
 
static int getMaxNumCells (const Grid::Geometry &geometry, const int numAtoms, const int numColumns)
 
static void sort_atoms (int dim, gmx_bool Backwards, int gmx_unused dd_zone, bool gmx_unused relevantAtomsAreWithinGridBounds, int *a, int n, gmx::ArrayRef< const gmx::RVec > x, real h0, real invh, int n_per_h, gmx::ArrayRef< int > sort)
 Sorts particle index a on coordinates x along dim. More...
 
static float R2F_D (const float x)
 Returns x.
 
static float R2F_U (const float x)
 Returns x.
 
static void calc_bounding_box (const int numAtoms, const int stride, const real *x, BoundingBox *bb)
 Computes the bounding box for na coordinates in order x,y,z, bb order xyz0.
 
template<int packSize>
static void calcBoundingBoxXPacked (const int numAtoms, const real *x, BoundingBox *bb)
 Computes the bounding box for packed coordinates. More...
 
template<int packSize>
static gmx_unused void calcBoundingBoxHalves (const int numAtoms, const real *x, BoundingBox *bb, BoundingBox *bbj)
 Computes the whole plus half bounding boxes for packed coordinates. More...
 
static void combine_bounding_box_pairs (const Grid &grid, gmx::ArrayRef< const BoundingBox > bb, gmx::ArrayRef< BoundingBox > bbj)
 Combines pairs of consecutive bounding boxes.
 
static void print_bbsizes_simple (FILE *fp, const Grid &grid)
 Prints the average bb size, used for debug output.
 
static void print_bbsizes_supersub (FILE *fp, const Grid &grid)
 Prints the average bb size, used for debug output.
 
static void sort_cluster_on_flag (int numAtomsInCluster, int atomStart, int atomEnd, gmx::ArrayRef< const int32_t > atomInfo, gmx::ArrayRef< int > order, int *flags)
 Set non-bonded interaction flags for the current cluster. More...
 
static void setCellAndAtomCount (gmx::ArrayRef< int > cell, int cellIndex, gmx::ArrayRef< int > cxy_na, int atomIndex)
 Sets the cell index in the cell array for atom atomIndex and increments the atom count for the grid column.
 
static void resizeForNumberOfCells (const int numNbnxnAtoms, const int numAtomsMoved, const int ddZone, GridSetData *gridSetData, nbnxn_atomdata_t *nbat)
 Resizes grid and atom data which depend on the number of cells.
 
real generateAndFill2DGrid (Grid *grid, gmx::ArrayRef< GridWork > gridWork, gmx::HostVector< int > *cells, const rvec lowerCorner, const rvec upperCorner, const gmx::UpdateGroupsCog *updateGroupsCog, gmx::Range< int > atomRange, real *atomDensity, real maxAtomGroupRadius, gmx::ArrayRef< const gmx::RVec > x, int ddZone, const int *move, int numAtomsMoved, bool computeGridDensityRatio)
 Sets the 2D search grid dimensions puts the atoms on the 2D grid. More...
 
static int numGrids (const GridSet::DomainSetup &domainSetup)
 Returns the number of search grids.
 
static int getGridOffset (gmx::ArrayRef< const Grid > grids, int gridIndex)
 
const char * lookup_kernel_name (Nbnxm::KernelType kernelType)
 Return a string identifying the kernel type. More...
 
std::unique_ptr
< nonbonded_verlet_t
init_nb_verlet (const gmx::MDLogger &mdlog, const t_inputrec &inputrec, const t_forcerec &forcerec, const t_commrec *commrec, const gmx_hw_info_t &hardwareInfo, bool useGpuForNonbonded, const gmx::DeviceStreamManager *deviceStreamManager, const gmx_mtop_t &mtop, gmx::ObservablesReducerBuilder *observablesReducerBuilder, gmx::ArrayRef< const gmx::RVec > coordinates, matrix box, gmx_wallcycle *wcycle)
 Creates an Nbnxm object.
 
static constexpr int sc_iClusterSize (const KernelType kernelType)
 The nbnxn i-cluster size in atoms for the given NBNxM kernel type.
 
static constexpr int sc_jClusterSize (const KernelType kernelType)
 The nbnxn j-cluster size in atoms for the given NBNxM kernel type. More...
 
static constexpr bool kernelTypeUsesSimplePairlist (const KernelType kernelType)
 Returns whether the pair-list corresponding to nb_kernel_type is simple.
 
static constexpr bool kernelTypeIsSimd (const KernelType kernelType)
 Returns whether a SIMD kernel is in use.
 
static bool useLjCombRule (const enum VdwType vdwType)
 Returns true if LJ combination rules are used in the non-bonded kernels. More...
 
void gpu_copy_xq_to_gpu (NbnxmGpu gmx_unused *nb, const struct nbnxn_atomdata_t gmx_unused *nbdata, gmx::AtomLocality gmx_unused aloc)
 Launch asynchronously the xq buffer host to device copy. More...
 
void gpu_launch_kernel (NbnxmGpu gmx_unused *nb, const gmx::StepWorkload gmx_unused &stepWork, gmx::InteractionLocality gmx_unused iloc)
 Launch asynchronously the nonbonded force calculations. More...
 
void gpu_launch_kernel_pruneonly (NbnxmGpu gmx_unused *nb, gmx::InteractionLocality gmx_unused iloc, int gmx_unused numParts)
 Launch asynchronously the nonbonded prune-only kernel. More...
 
void gpu_launch_cpyback (NbnxmGpu gmx_unused *nb, nbnxn_atomdata_t gmx_unused *nbatom, const gmx::StepWorkload gmx_unused &stepWork, gmx::AtomLocality gmx_unused aloc)
 Launch asynchronously the download of short-range forces from the GPU (and energies/shift forces if required).
 
bool gpu_try_finish_task (NbnxmGpu gmx_unused *nb, const gmx::StepWorkload gmx_unused &stepWork, gmx::AtomLocality gmx_unused aloc, real gmx_unused *e_lj, real gmx_unused *e_el, gmx::ArrayRef< gmx::RVec > gmx_unused shiftForces, GpuTaskCompletion gmx_unused completionKind)
 Attempts to complete nonbonded GPU task. More...
 
float gpu_wait_finish_task (NbnxmGpu gmx_unused *nb, const gmx::StepWorkload gmx_unused &stepWork, gmx::AtomLocality gmx_unused aloc, real gmx_unused *e_lj, real gmx_unused *e_el, gmx::ArrayRef< gmx::RVec > gmx_unused shiftForces, gmx_wallcycle gmx_unused *wcycle)
 Completes the nonbonded GPU task blocking until GPU tasks and data transfers to finish. More...
 
void nbnxn_gpu_init_x_to_nbat_x (const Nbnxm::GridSet gmx_unused &gridSet, NbnxmGpu gmx_unused *gpu_nbv)
 Initialization for X buffer operations on GPU. Called on the NS step and performs (re-)allocations and memory copies. !
 
void nbnxn_gpu_x_to_nbat_x (const Nbnxm::Grid gmx_unused &grid, NbnxmGpu gmx_unused *gpu_nbv, DeviceBuffer< gmx::RVec > gmx_unused d_x, GpuEventSynchronizer gmx_unused *xReadyOnDevice, gmx::AtomLocality gmx_unused locality, int gmx_unused gridId, int gmx_unused numColumnsMax, bool gmx_unused mustInsertNonLocalDependency)
 X buffer operations on GPU: performs conversion from rvec to nb format. More...
 
void nbnxnInsertNonlocalGpuDependency (NbnxmGpu gmx_unused *nb, gmx::InteractionLocality gmx_unused interactionLocality)
 Sync the nonlocal stream with dependent tasks in the local queue. More...
 
void setupGpuShortRangeWork (NbnxmGpu gmx_unused *nb, const gmx::ListedForcesGpu gmx_unused *listedForcesGpu, gmx::InteractionLocality gmx_unused iLocality)
 Set up internal flags that indicate what type of short-range work there is. More...
 
bool haveGpuShortRangeWork (const NbnxmGpu gmx_unused *nb, gmx::InteractionLocality gmx_unused interactionLocality)
 Returns true if there is GPU short-range work for the given interaction locality. More...
 
void nbnxn_gpu_x_to_nbat_x (const Nbnxm::Grid &grid, NbnxmGpu *nb, DeviceBuffer< gmx::RVec > d_x, GpuEventSynchronizer *xReadyOnDevice, const gmx::AtomLocality locality, int gridId, int numColumnsMax, bool mustInsertNonLocalDependency)
 
void launchNbnxmKernelTransformXToXq (const Grid &grid, NbnxmGpu *nb, DeviceBuffer< Float3 > d_x, const DeviceStream &deviceStream, unsigned int numColumnsMax, int gridId)
 Launch coordinate layout conversion kernel. More...
 
static void init_ewald_coulomb_force_table (const EwaldCorrectionTables &tables, NBParamGpu *nbp, const DeviceContext &deviceContext)
 
static bool useTabulatedEwaldByDefault (const DeviceInformation &deviceInfo)
 
static ElecType nbnxn_gpu_pick_ewald_kernel_type (const interaction_const_t &ic, const DeviceInformation &deviceInfo)
 
static void set_cutoff_parameters (NBParamGpu *nbp, const interaction_const_t &ic, const PairlistParams &listParams)
 
static void init_timings (gmx_wallclock_gpu_nbnxn_t *t)
 
static void initAtomdataFirst (NBAtomDataGpu *atomdata, int numTypes, const DeviceContext &deviceContext, const DeviceStream &localStream)
 Initialize atomdata first time; it only gets filled at pair-search.
 
static VdwType nbnxmGpuPickVdwKernelType (const interaction_const_t &ic, LJCombinationRule ljCombinationRule)
 
static ElecType nbnxmGpuPickElectrostaticsKernelType (const interaction_const_t &ic, const DeviceInformation &deviceInfo)
 
static void initNbparam (NBParamGpu *nbp, const interaction_const_t &ic, const PairlistParams &listParams, const nbnxn_atomdata_t::Params &nbatParams, const DeviceContext &deviceContext)
 Initialize the nonbonded parameter data structure.
 
static GpuPairlistByLocality initializeGpuLists (bool localAndNonLocal)
 
NbnxmGpugpu_init (const gmx::DeviceStreamManager &deviceStreamManager, const interaction_const_t *ic, const PairlistParams &listParams, const nbnxn_atomdata_t *nbat, const bool bLocalAndNonlocal)
 
void gpu_pme_loadbal_update_param (nonbonded_verlet_t *nbv, const interaction_const_t &ic)
 
void gpu_upload_shiftvec (NbnxmGpu *nb, const nbnxn_atomdata_t *nbatom)
 
void gpu_init_pairlist (NbnxmGpu *nb, const NbnxnPairlistGpu *h_plist, const InteractionLocality iloc)
 This function is documented in the header file.
 
void gpu_init_atomdata (NbnxmGpu *nb, const nbnxn_atomdata_t *nbat)
 
void gpu_clear_outputs (NbnxmGpu *nb, bool computeVirial)
 
gmx_wallclock_gpu_nbnxn_tgpu_get_timings (NbnxmGpu *nb)
 This function is documented in the header file.
 
void gpu_reset_timings (nonbonded_verlet_t *nbv)
 This function is documented in the header file.
 
bool gpu_is_kernel_ewald_analytical (const NbnxmGpu *nb)
 
void setupGpuShortRangeWork (NbnxmGpu *nb, const gmx::ListedForcesGpu *listedForcesGpu, const gmx::InteractionLocality iLocality)
 
bool haveGpuShortRangeWork (const NbnxmGpu *nb, const gmx::InteractionLocality interactionLocality)
 
void gpu_launch_cpyback (NbnxmGpu *nb, struct nbnxn_atomdata_t *nbatom, const gmx::StepWorkload &stepWork, const AtomLocality atomLocality)
 Launch asynchronously the download of nonbonded forces from the GPU (and energies/shift forces if required).
 
void nbnxnInsertNonlocalGpuDependency (NbnxmGpu *nb, const InteractionLocality interactionLocality)
 
void gpu_copy_xq_to_gpu (NbnxmGpu *nb, const nbnxn_atomdata_t *nbatom, const AtomLocality atomLocality)
 Launch asynchronously the xq buffer host to device copy.
 
void nbnxn_gpu_init_x_to_nbat_x (const Nbnxm::GridSet &gridSet, NbnxmGpu *gpu_nbv)
 
void gpu_free (NbnxmGpu *nb)
 This function is documented in the header file.
 
NBAtomDataGpugpuGetNBAtomData (NbnxmGpu *nb)
 
DeviceBuffer< gmx::RVecgpu_get_f (NbnxmGpu *nb)
 
void gpu_init_platform_specific (NbnxmGpu *nb)
 Initializes the NBNXM GPU data structures.
 
void gpu_free_platform_specific (NbnxmGpu *nb)
 Releases the NBNXM GPU data structures. More...
 
void getExclusiveScanWorkingArraySize (size_t &scan_size, GpuPairlist *d_plist, const DeviceStream &deviceStream)
 Calculates working memory required for exclusive sum, used in neighbour list sorting on GPU.
 
static bool nbnxn_simd_supported (const gmx::MDLogger &mdlog, const t_inputrec &inputrec)
 Returns whether CPU SIMD support exists for the given inputrec. More...
 
static KernelSetup pick_nbnxn_kernel_cpu (const t_inputrec gmx_unused &inputrec, const gmx_hw_info_t gmx_unused &hardwareInfo)
 Returns the most suitable CPU kernel type and Ewald handling.
 
static KernelSetup pick_nbnxn_kernel (const gmx::MDLogger &mdlog, gmx_bool use_simd_kernels, const gmx_hw_info_t &hardwareInfo, const NonbondedResource &nonbondedResource, const t_inputrec &inputrec)
 Returns the most suitable kernel type and Ewald handling.
 
static int getMinimumIlistCountForGpuBalancing (NbnxmGpu *nbnxmGpu)
 Gets and returns the minimum i-list count for balacing based on the GPU used or env.var. when set.
 
static std::optional
< LJCombinationRule
chooseLJCombinationRule (const t_forcerec &forcerec)
 Returns the LJ combination rule choices for the LJ pair parameters.
 
static LJCombinationRule chooseLJPmeCombinationRule (const t_forcerec &forcerec)
 Returns the LJ combination rule choices for the LJ PME-grid parameters.
 
static void validate_global_work_size (const KernelLaunchConfig &config, int work_dim, const DeviceInformation *dinfo)
 Validates the input global work size parameter.
 
static cl_kernel selectPruneKernel (cl_kernel kernel_pruneonly[], bool firstPrunePass)
 Return a pointer to the prune kernel version to be executed at the current invocation. More...
 
static cl_kernel select_nbnxn_kernel (NbnxmGpu *nb, enum ElecType elecType, enum VdwType vdwType, bool bDoEne, bool bDoPrune)
 Return a pointer to the kernel version to be executed at the current step. OpenCL kernel objects are cached in nb. If the requested kernel is not found in the cache, it will be created and the cache will be updated.
 
static int calc_shmem_required_nonbonded (enum VdwType vdwType, bool bPrefetchLjParam)
 Calculates the amount of shared memory required by the nonbonded kernel in use.
 
static void fillin_ocl_structures (NBParamGpu *nbp, cl_nbparam_params_t *nbparams_params)
 Initializes data structures that are going to be sent to the OpenCL device. More...
 
void gpu_launch_kernel (NbnxmGpu *nb, const gmx::StepWorkload &stepWork, const Nbnxm::InteractionLocality iloc)
 Launch GPU kernel. More...
 
static int calc_shmem_required_prune (const int num_threads_z)
 Calculates the amount of shared memory required by the prune kernel. More...
 
void gpu_launch_kernel_pruneonly (NbnxmGpu *nb, const InteractionLocality iloc, const int numParts)
 Launch the pairlist prune only kernel for the given locality. numParts tells in how many parts, i.e. calls the list will be pruned.
 
static cl_kernel nbnxn_gpu_create_kernel (NbnxmGpu *nb, const char *kernel_name)
 Initializes the OpenCL kernel pointers of the nbnxn_ocl_ptr_t input data structure.
 
static void nbnxn_gpu_init_kernels (NbnxmGpu *nb)
 Initializes the OpenCL kernel pointers of the nbnxn_ocl_ptr_t input data structure.
 
static void free_kernel (cl_kernel *kernel_ptr)
 Releases an OpenCL kernel pointer.
 
static void free_kernels (cl_kernel *kernels, int count)
 Releases a list of OpenCL kernel pointers.
 
static void freeGpuProgram (cl_program program)
 Free the OpenCL program. More...
 
int gpu_min_ci_balanced (NbnxmGpu *nb)
 This function is documented in the header file.
 
template<typename T , int iClusterSize, int jClusterSize>
constexpr std::array< T,
iClusterSize/jClusterSize > 
diagonalMaskJSmallerI ()
 Returns a diagonal interaction mask with atoms j<i masked out. More...
 
template<typename T , int iClusterSize, int jClusterSize>
constexpr std::array< T,
jClusterSize/iClusterSize > 
diagonalMaskJLargerI ()
 Returns a diagonal interaction mask with atoms j>i masked out. More...
 
template<int iClusterSize, int jClusterSize>
static gmx_unused uint32_t getImask (const bool maskOutSubDiagonal, const int ci, const int cj)
 Returns a diagonal or off-diagonal interaction mask. More...
 
template<ClusterDistanceKernelType kernelType>
static gmx_unused constexpr int sc_iClusterSizeSimd ()
 
template<ClusterDistanceKernelType kernelType>
static gmx_unused constexpr int sc_jClusterSizeSimd ()
 
template<ClusterDistanceKernelType kernelType>
static gmx_unused constexpr int sc_xStride ()
 Stride of the packed x coordinate array.
 
template<ClusterDistanceKernelType kernelType>
static gmx_unused int xIndexFromCi (int ci)
 Returns the nbnxn coordinate data index given the i-cluster index.
 
template<ClusterDistanceKernelType kernelType>
static gmx_unused int xIndexFromCj (int cj)
 Returns the nbnxn coordinate data index given the j-cluster index.
 
template<ClusterDistanceKernelType kernelType, int jSubClusterIndex>
static gmx_unused int cjFromCi (int ci)
 Returns the j-cluster index given the i-cluster index. More...
 
void setICellCoordinatesSimd4xM (int gmx_unused ci, const gmx::RVec gmx_unused &shift, int gmx_unused stride, const real gmx_unused *x, NbnxmPairlistCpuWork gmx_unused *work)
 
void setICellCoordinatesSimd2xMM (int gmx_unused ci, const gmx::RVec gmx_unused &shift, int gmx_unused stride, const real gmx_unused *x, NbnxmPairlistCpuWork gmx_unused *work)
 
void makeClusterListSimd4xM (const Grid gmx_unused &jGrid, NbnxnPairlistCpu gmx_unused *nbl, int gmx_unused icluster, int gmx_unused firstCell, int gmx_unused lastCell, bool gmx_unused excludeSubDiagonal, const real gmx_unused *gmx_restrict x_j, real gmx_unused rlist2, float gmx_unused rbb2, int gmx_unused *gmx_restrict numDistanceChecks)
 
void makeClusterListSimd2xMM (const Grid gmx_unused &jGrid, NbnxnPairlistCpu gmx_unused *nbl, int gmx_unused icluster, int gmx_unused firstCell, int gmx_unused lastCell, bool gmx_unused excludeSubDiagonal, const real gmx_unused *gmx_restrict x_j, real gmx_unused rlist2, float gmx_unused rbb2, int gmx_unused *gmx_restrict numDistanceChecks)
 
void setICellCoordinatesSimd4xM (int ci, const gmx::RVec &shift, int gmx_unused stride, const real *x, NbnxmPairlistCpuWork *work)
 Copies PBC shifted i-cell packed atom coordinates to working array for the 4xM layout.
 
void setICellCoordinatesSimd2xMM (int ci, const gmx::RVec &shift, int gmx_unused stride, const real *x, NbnxmPairlistCpuWork *work)
 Copies PBC shifted i-cell packed atom coordinates to working array for the 2xMM layout.
 
void makeClusterListSimd4xM (const Grid &jGrid, NbnxnPairlistCpu *nbl, int icluster, int firstCell, int lastCell, bool excludeSubDiagonal, const real *gmx_restrict x_j, real rlist2, float rbb2, int *gmx_restrict numDistanceChecks)
 SIMD code for checking and adding cluster-pairs to the list using the 4xM layout. More...
 
void makeClusterListSimd2xMM (const Grid &jGrid, NbnxnPairlistCpu *nbl, int icluster, int firstCell, int lastCell, bool excludeSubDiagonal, const real *gmx_restrict x_j, real rlist2, float rbb2, int *gmx_restrict numDistanceChecks)
 SIMD code for checking and adding cluster-pairs to the list using the 2xMM layout. More...
 
static auto nbnxmKernelTransformXToXq (Float4 *__restrict__ gm_xq, const Float3 *__restrict__ gm_x, const int *__restrict__ gm_atomIndex, const int *__restrict__ gm_numAtoms, const int *__restrict__ gm_cellIndex, int cellOffset, int numAtomsPerCell, int columnsOffset)
 SYCL kernel for transforming position coordinates from rvec to nbnxm layout. More...
 
static void launchSciSortOnGpu (GpuPairlist *plist, const DeviceStream &deviceStream)
 
template<int workGroupSize, int nElements>
static auto nbnxnKernelExclusivePrefixSum (const int *__restrict__ gm_input, int *__restrict__ gm_output)
 SYCL exclusive prefix sum kernel for list sorting. More...
 
static auto nbnxnKernelBucketSciSort (const nbnxn_sci_t *__restrict__ gm_sci, const int *__restrict__ gm_sciCount, int *__restrict__ gm_sciOffset, nbnxn_sci_t *__restrict__ gm_sciSorted)
 SYCL bucket sci sort kernel. More...
 
template<int workGroupSize>
static void launchPrefixSumKernel (sycl::queue &q, GpuPairlistSorting *sorting)
 
static void launchBucketSortKernel (sycl::queue &q, GpuPairlist *plist)
 
static int getNbnxmSubGroupSize (const DeviceInformation &deviceInfo)
 
template<int subGroupSize, bool doPruneNBL, bool doCalcEnergies>
void launchNbnxmKernelHelper (NbnxmGpu *nb, const gmx::StepWorkload &stepWork, const InteractionLocality iloc)
 
template<int subGroupSize>
void launchNbnxmKernel (NbnxmGpu *nb, const gmx::StepWorkload &stepWork, InteractionLocality iloc, bool doPrune)
 Launch SYCL NBNXM kernel. More...
 
static Float2 convertSigmaEpsilonToC6C12 (const float sigma, const float epsilon)
 Convert sigma and epsilon VdW parameters to c6,c12 pair.
 
template<bool doCalcEnergies>
static void ljForceSwitch (const shift_consts_t dispersionShift, const shift_consts_t repulsionShift, const float rVdwSwitch, const float c6, const float c12, const float rInv, const float r2, sycl::private_ptr< float > fInvR, sycl::private_ptr< float > eLJ)
 Calculate force and energy for a pair of atoms, VdW force-switch flavor.
 
template<enum VdwType vdwType>
static float calculateLJEwaldC6Grid (const sycl::global_ptr< const Float2 > a_nbfpComb, const int typeI, const int typeJ)
 Fetch C6 grid contribution coefficients and return the product of these.
 
template<bool doCalcEnergies, enum VdwType vdwType>
static void ljEwaldComb (const sycl::global_ptr< const Float2 > a_nbfpComb, const float sh_lj_ewald, const int typeI, const int typeJ, const float r2, const float r2Inv, const float lje_coeff2, const float lje_coeff6_6, const float int_bit, sycl::private_ptr< float > fInvR, sycl::private_ptr< float > eLJ)
 Calculate LJ-PME grid force contribution with geometric or LB combination rule.
 
template<bool doCalcEnergies>
static void ljPotentialSwitch (const switch_consts_t vdwSwitch, const float rVdwSwitch, const float rInv, const float r2, sycl::private_ptr< float > fInvR, sycl::private_ptr< float > eLJ)
 Apply potential switch.
 
static float pmeCorrF (const float z2)
 Calculate analytical Ewald correction term.
 
template<typename T >
static T lerp (T d0, T d1, T t)
 Linear interpolation using exactly two FMA operations. More...
 
static float interpolateCoulombForceR (const sycl::global_ptr< const float > a_coulombTab, const float coulombTabScale, const float r)
 Interpolate Ewald coulomb force correction using the F*r table.
 
static void reduceForceJShuffle (Float3 f, const sycl::nd_item< 3 > &itemIdx, const int tidxi, const int aidx, sycl::global_ptr< Float3 > a_f)
 Reduce c_clSize j-force components using shifts and atomically accumulate into a_f. More...
 
template<int subGroupSize, int groupSize>
static float groupReduce (const sycl::nd_item< 3 > itemIdx, const unsigned int tidxi, sycl::local_ptr< float > sm_buf, float valueToReduce)
 Do workgroup-level reduction of a single float. More...
 
static void reduceForceJGeneric (sycl::local_ptr< float > sm_buf, Float3 f, const sycl::nd_item< 3 > &itemIdx, const int tidxi, const int tidxj, const int aidx, sycl::global_ptr< Float3 > a_f)
 Reduce c_clSize j-force components using local memory and atomically accumulate into a_f. More...
 
template<bool useShuffleReduction>
static void reduceForceJ (sycl::local_ptr< float > sm_buf, Float3 f, const sycl::nd_item< 3 > itemIdx, const int tidxi, const int tidxj, const int aidx, sycl::global_ptr< Float3 > a_f)
 Reduce c_clSize j-force components using either shifts or local memory and atomically accumulate into a_f.
 
template<typename FCiBufferWrapperX , typename FCiBufferWrapperY , typename FCiBufferWrapperZ >
static void reduceForceIAndFShiftGeneric (sycl::local_ptr< float > sm_buf, const FCiBufferWrapperX &fCiBufX, const FCiBufferWrapperY &fCiBufY, const FCiBufferWrapperZ &fCiBufZ, const bool calcFShift, const sycl::nd_item< 3 > itemIdx, const int tidxi, const int tidxj, const int sci, const int shift, sycl::global_ptr< Float3 > a_f, sycl::global_ptr< Float3 > a_fShift)
 Local memory-based i-force reduction. More...
 
template<int numShuffleReductionSteps, typename FCiBufferWrapperX , typename FCiBufferWrapperY , typename FCiBufferWrapperZ >
static std::enable_if_t
< numShuffleReductionSteps!=1,
void > 
reduceForceIAndFShiftShuffles (const FCiBufferWrapperX &fCiBufX, const FCiBufferWrapperY &fCiBufY, const FCiBufferWrapperZ &fCiBufZ, const bool calcFShift, const sycl::nd_item< 3 > itemIdx, const int tidxi, const int tidxj, const int sci, const int shift, sycl::global_ptr< Float3 > a_f, sycl::global_ptr< Float3 > a_fShift)
 Shuffle-based i-force reduction. More...
 
template<int numShuffleReductionSteps, typename FCiBufferWrapperX , typename FCiBufferWrapperY , typename FCiBufferWrapperZ >
static std::enable_if_t
< numShuffleReductionSteps==1,
void > 
reduceForceIAndFShiftShuffles (const FCiBufferWrapperX &fCiBufX, const FCiBufferWrapperY &fCiBufY, const FCiBufferWrapperZ &fCiBufZ, const bool calcFShift, const sycl::nd_item< 3 > itemIdx, const int tidxi, const int tidxj, const int sci, const int shift, sycl::global_ptr< Float3 > a_f, sycl::global_ptr< Float3 > a_fShift)
 reduceForceIAndFShiftShuffles specialization for single-step reduction (e.g., Intel iGPUs). More...
 
template<bool useShuffleReduction, int subGroupSize, typename FCiBufferWrapperX , typename FCiBufferWrapperY , typename FCiBufferWrapperZ >
static void reduceForceIAndFShift (sycl::local_ptr< float > sm_buf, const FCiBufferWrapperX &fCiBufX, const FCiBufferWrapperY &fCiBufY, const FCiBufferWrapperZ &fCiBufZ, const bool calcFShift, const sycl::nd_item< 3 > itemIdx, const int tidxi, const int tidxj, const int sci, const int shift, sycl::global_ptr< Float3 > a_f, sycl::global_ptr< Float3 > a_fShift)
 Final i-force reduction. More...
 
template<int subGroupSize, bool doPruneNBL, bool doCalcEnergies, enum ElecType elecType, enum VdwType vdwType>
static auto nbnxmKernel (sycl::handler &cgh, const Float4 *__restrict__ gm_xq, Float3 *__restrict__ gm_f, const Float3 *__restrict__ gm_shiftVec, Float3 *__restrict__ gm_fShift, float *__restrict__ gm_energyElec, float *__restrict__ gm_energyVdw, nbnxn_cj_packed_t *__restrict__ gm_plistCJPacked, const nbnxn_sci_t *__restrict__ gm_plistSci, const nbnxn_excl_t *__restrict__ gm_plistExcl, const Float2 *__restrict__ gm_ljComb, const int *__restrict__ gm_atomTypes, const Float2 *__restrict__ gm_nbfp, const Float2 *__restrict__ gm_nbfpComb, const float *__restrict__ gm_coulombTab, int *__restrict__ gm_sciHistogram, int *__restrict__ gm_sciCount, const int numTypes, const float rCoulombSq, const float rVdwSq, const float twoKRf, const float ewaldBeta, const float rlistOuterSq, const float ewaldShift, const float epsFac, const float ewaldCoeffLJ_2, const float cRF, const shift_consts_t dispersionShift, const shift_consts_t repulsionShift, const switch_consts_t vdwSwitch, const float rVdwSwitch, const float ljEwaldShift, const float coulombTabScale, const bool calcShift)
 Main kernel for NBNXM. More...
 
template<int subGroupSize, bool doPruneNBL, bool doCalcEnergies, enum ElecType elecType, enum VdwType vdwType, class... Args>
static void launchNbnxmKernel (const DeviceStream &deviceStream, const int numSci, Args &&...args)
 NBNXM kernel launch code.
 
template<int subGroupSize, bool doPruneNBL, bool doCalcEnergies, class... Args>
void chooseAndLaunchNbnxmKernel (enum ElecType elecType, enum VdwType vdwType, Args &&...args)
 Select templated kernel and launch it.
 
template<bool haveFreshList>
auto nbnxmKernelPruneOnly (sycl::handler &cgh, const int numSci, const int numParts, const Float4 *__restrict__ gm_xq, const Float3 *__restrict__ gm_shiftVec, nbnxn_cj_packed_t *__restrict__ gm_plistCJPacked, const nbnxn_sci_t *__restrict__ gm_plistSci, unsigned int *__restrict__ gm_plistIMask, int *__restrict__ gm_rollingPruningPart, int *__restrict__ gm_sciHistogram, int *__restrict__ gm_sciCount, const float rlistOuterSq, const float rlistInnerSq)
 Prune-only kernel for NBNXM. More...
 
template<bool haveFreshList, class... Args>
void launchNbnxmKernelPruneOnly (const DeviceStream &deviceStream, const int numSciInPartMax, Args &&...args)
 Leap Frog SYCL prune-only kernel launch code.
 
template<class... Args>
void chooseAndLaunchNbnxmKernelPruneOnly (bool haveFreshList, Args &&...args)
 Select templated kernel and launch it.
 
void launchNbnxmKernelPruneOnly (NbnxmGpu *nb, const InteractionLocality iloc, const int numParts, const int numSciInPartMax)
 Launch SYCL NBNXM prune-only kernel. More...
 

Variables

static constexpr int c_numBoundingBoxBounds1D = 2
 The number of bounds along one dimension of a bounding box.
 
static constexpr int c_sortGridRatio = 4
 Ratio of grid cells to atoms.
 
static constexpr int c_sortGridMaxSizeFactor = c_sortGridRatio + 1
 Maximum ratio of holes used, in the worst case all particles end up in the last hole and we need num. atoms extra holes at the end.
 
constexpr int c_numElecTypes = static_cast<int>(ElecType::Count)
 Number of possible ElecType values.
 
constexpr int c_numVdwTypes = static_cast<int>(VdwType::Count)
 Number of possible VdwType values.
 
static constexpr int c_nbnxnGpuExclSize
 The fixed size of the exclusion mask array for a half GPU cluster pair. More...
 
static const char * nb_kfunc_noener_noprune_ptr [c_numElecTypes][c_numVdwTypes]
 Force-only kernel function names.
 
static const char * nb_kfunc_ener_noprune_ptr [c_numElecTypes][c_numVdwTypes]
 Force + energy kernel function pointers.
 
static const char * nb_kfunc_noener_prune_ptr [c_numElecTypes][c_numVdwTypes]
 Force + pruning kernel function pointers.
 
static const char * nb_kfunc_ener_prune_ptr [c_numElecTypes][c_numVdwTypes]
 Force + energy + pruning kernel function pointers.
 
static unsigned int gpu_min_ci_balanced_factor = 50
 This parameter should be determined heuristically from the kernel execution times. More...
 
constexpr bool c_avoidFloatingPointAtomics = (c_clSize == 4)
 Should we avoid FP atomics to the same location from the same work-group? More...
 
static constexpr int c_syclPruneKernelJPackedConcurrency = 4
 Prune kernel's jPacked processing concurrency. More...
 
static constexpr int c_clSize = c_nbnxnGpuClusterSize
 Convenience constants.
 
template<enum VdwType vdwType>
constexpr bool ljComb = EnergyFunctionProperties<ElecType::Count, vdwType>().vdwComb
 Templated constants to shorten kernel function declaration.
 
template<enum ElecType elecType>
constexpr bool elecEwald = EnergyFunctionProperties<elecType, VdwType::Count>().elecEwald
 
template<enum ElecType elecType>
constexpr bool elecEwaldTab = EnergyFunctionProperties<elecType, VdwType::Count>().elecEwaldTab
 
template<enum VdwType vdwType>
constexpr bool ljEwald = EnergyFunctionProperties<ElecType::Count, vdwType>().vdwEwald
 

Enumeration Type Documentation

enum Nbnxm::ElecType : int
strong

Nbnxm electrostatic GPU kernel flavors.

Types of electrostatics implementations available in the GPU non-bonded force kernels. These represent both the electrostatics types implemented by the kernels (cut-off, RF, and Ewald - a subset of what's defined in enums.h) as well as encode implementation details analytical/tabulated and single or twin cut-off (for Ewald kernels). Note that the cut-off and RF kernels have only analytical flavor and unlike in the CPU kernels, the tabulated kernels are ATM Ewald-only.

The row-order of pointers to different electrostatic kernels defined in nbnxn_cuda.cu by the nb_*_kfunc_ptr function pointer table should match the order of enumerated types below.

Enumerator
Cut 

Plain cut-off.

RF 

Reaction field.

EwaldTab 

Tabulated Ewald with single cut-off.

EwaldTabTwin 

Tabulated Ewald with twin cut-off.

EwaldAna 

Analytical Ewald with single cut-off.

EwaldAnaTwin 

Analytical Ewald with twin cut-off.

Count 

Number of valid values.

enum Nbnxm::VdwType : int
strong

Nbnxm VdW GPU kernel flavors.

The enumerates values correspond to the LJ implementations in the GPU non-bonded kernels.

The column-order of pointers to different electrostatic kernels defined in nbnxn_cuda_ocl.cpp/.cu by the nb_*_kfunc_ptr function pointer table should match the order of enumerated types below.

Enumerator
Cut 

Plain cut-off.

CutCombGeom 

Cut-off with geometric combination rules.

CutCombLB 

Cut-off with Lorentz-Berthelot combination rules.

FSwitch 

Smooth force switch.

PSwitch 

Smooth potential switch.

EwaldGeom 

Ewald with geometric combination rules.

EwaldLB 

Ewald with Lorentz-Berthelot combination rules.

Count 

Number of valid values.

Function Documentation

void Nbnxm::bench ( int  sizeFactor,
const KernelBenchOptions &  options 
)

Sets up and runs one or more Nbnxm kernel benchmarks.

The simulated system is a box of 1000 SPC/E water molecules scaled by the factor sizeFactor, which has to be a power of 2. One or more benchmarks are run, as specified by options. Benchmark settings and timings are printed to stdout.

Parameters
[in]sizeFactorHow much should the system size be increased.
[in]optionsHow the benchmark will be run.
static int Nbnxm::calc_shmem_required_prune ( const int  num_threads_z)
inlinestatic

Calculates the amount of shared memory required by the prune kernel.

Note that for the sake of simplicity we use the CUDA terminology "shared memory" for OpenCL local memory.

Parameters
[in]num_threads_zcjPacked concurrency equal to the number of threads/work items in the 3-rd dimension.
Returns
the amount of local memory in bytes required by the pruning kernel
template<int packSize>
static gmx_unused void Nbnxm::calcBoundingBoxHalves ( const int  numAtoms,
const real x,
BoundingBox bb,
BoundingBox bbj 
)
static

Computes the whole plus half bounding boxes for packed coordinates.

Template Parameters
packSizeThe pack size for the coordinates, also the number of atoms per cell
Parameters
[in]numAtomsThe actual number of atoms in this cell
[in]xPacked coodinates
[out]bbPointer to the bounding box for the whole cell
[out]bbjPointer to the bounding boxes for the two halves of the cell
template<int packSize>
static void Nbnxm::calcBoundingBoxXPacked ( const int  numAtoms,
const real x,
BoundingBox bb 
)
static

Computes the bounding box for packed coordinates.

Template Parameters
packSizeThe pack size for the coordinates, also the number of atoms per cell
Parameters
[in]numAtomsThe actual number of atoms in this cell
[in]xPacked coodinates
[out]bbPointer to the bounding box
static bool Nbnxm::canSkipNonbondedWork ( const NbnxmGpu nb,
InteractionLocality  iloc 
)
inlinestatic

An early return condition for empty NB GPU workloads.

This is currently used for non-local kernels/transfers only. Skipping the local kernel is more complicated, since the local part of the force array also depends on the non-local kernel. The skip of the local kernel is taken care of separately.

static std::optional<std::string> Nbnxm::checkKernelSetup ( const KernelBenchOptions &  options)
static

Checks the kernel setup.

Returns an error string when the kernel is not available.

template<ClusterDistanceKernelType kernelType, int jSubClusterIndex>
static gmx_unused int Nbnxm::cjFromCi ( int  ci)
inlinestatic

Returns the j-cluster index given the i-cluster index.

Template Parameters
kernelTypeThe kernel type
jSubClusterIndexThe j-sub-cluster index (0/1), used when size(j-cluster) < size(i-cluster)
Parameters
[in]ciThe i-cluster index
static float Nbnxm::clusterBoundingBoxDistance2 ( const BoundingBox bb_i,
const BoundingBox bb_j 
)
inlinestatic

Returns the distance^2 between two bounding boxes.

Uses 4-wide SIMD operations when available.

Parameters
[in]bb_iFirst bounding box, has to be aligned for 4-wide SIMD
[in]bb_jSecond bounding box, has to be aligned for 4-wide SIMD
static void Nbnxm::countPruneKernelTime ( Nbnxm::GpuTimers timers,
gmx_wallclock_gpu_nbnxn_t timings,
const InteractionLocality  iloc 
)
static

Count pruning kernel time if either kernel has been triggered.

We do the accounting for either of the two pruning kernel flavors:

  • 1st pass prune: ran during the current step (prior to the force kernel);
  • rolling prune: ran at the end of the previous step (prior to the current step H2D xq);

Note that the resetting of Nbnxm::GpuTimers::didPrune and Nbnxm::GpuTimers::didRollingPrune should happen after calling this function.

Parameters
[in]timersstructs with GPU timer objects
[in,out]timingsGPU task timing data
[in]ilocinteraction locality
template<typename T , int iClusterSize, int jClusterSize>
constexpr std::array<T, jClusterSize / iClusterSize> Nbnxm::diagonalMaskJLargerI ( )

Returns a diagonal interaction mask with atoms j>i masked out.

Template Parameters
TInteger type, should have at least iClusterSize*jClusterSize bits
iClusterSizeThe i-cluster size
jClusterSizeThe j-cluster size

Condition: jClusterSize >= iClusterSize

template<typename T , int iClusterSize, int jClusterSize>
constexpr std::array<T, iClusterSize / jClusterSize> Nbnxm::diagonalMaskJSmallerI ( )

Returns a diagonal interaction mask with atoms j<i masked out.

Template Parameters
TInteger type, should have at least iClusterSize*jClusterSize bits
iClusterSizeThe i-cluster size
jClusterSizeThe j-cluster size

Condition: jClusterSize <= iClusterSize

static void Nbnxm::fillin_ocl_structures ( NBParamGpu nbp,
cl_nbparam_params_t nbparams_params 
)
static

Initializes data structures that are going to be sent to the OpenCL device.

The device can't use the same data structures as the host for two main reasons:

  • OpenCL restrictions (pointers are not accepted inside data structures)
  • some host side fields are not needed for the OpenCL kernels.

This function is called before the launch of both nbnxn and prune kernels.

static void Nbnxm::freeGpuProgram ( cl_program  program)
static

Free the OpenCL program.

The function releases the OpenCL program assuciated with the device that the calling PP rank is running on.

Parameters
program[in] OpenCL program to release.
real Nbnxm::generateAndFill2DGrid ( Grid grid,
gmx::ArrayRef< GridWork >  gridWork,
gmx::HostVector< int > *  cells,
const rvec  lowerCorner,
const rvec  upperCorner,
const gmx::UpdateGroupsCog updateGroupsCog,
gmx::Range< int >  atomRange,
real atomDensity,
real  maxAtomGroupRadius,
gmx::ArrayRef< const gmx::RVec x,
int  ddZone,
const int *  move,
int  numAtomsMoved,
bool  computeGridDensityRatio 
)

Sets the 2D search grid dimensions puts the atoms on the 2D grid.

Parameters
[in,out]gridThe pair search grid for one DD zone
[in,out]gridWorkWorking data for each thread
[in,out]cellsThe grid cell list
[in]lowerCornerThe minimum Cartesian coordinates of the grid
[in]upperCornerThe maximum Cartesian coordinates of the grid
[in]updateGroupsCogThe center of geometry of update groups, can be nullptr
[in]atomRangeThe range of atoms to put on this grid
[in,out]atomDensityThe atom density, will be computed when <= 0
[in]maxAtomGroupRadiusThe maximum radius of atom groups
[in]xThe coordinates of the atoms
[in]ddZoneThe domain decomposition zone
[in]moveTells whether atoms have moved to another DD domain
[in]numAtomsMovedThe number of atoms that moved to another DD domain
[in]computeGridDensityRatioWhen true, return the grid density ratio
Returns
When computeGridDensityRatio==true, the ratio of the effective 2D grid density and the uniform grid density
static gmx::Range<int> Nbnxm::getGpuAtomRange ( const NBAtomDataGpu atomData,
const AtomLocality  atomLocality 
)
inlinestatic

Calculate atom range and return start index and length.

Parameters
[in]atomDataAtom descriptor data structure
[in]atomLocalityAtom locality specifier
Returns
Range of indexes for selected locality.
template<int iClusterSize, int jClusterSize>
static gmx_unused uint32_t Nbnxm::getImask ( const bool  maskOutSubDiagonal,
const int  ci,
const int  cj 
)
static

Returns a diagonal or off-diagonal interaction mask.

Template Parameters
iClusterSizeThe i-cluster size
jClusterSizeThe j-cluster size
Parameters
[in]maskOutSubDiagonalWhether to mask out the sub-diagonal interactions
[in]ciThe i-cluster index
[in]cjThe j-cluster index
template<typename GpuPairlist >
static void Nbnxm::gpu_accumulate_timings ( gmx_wallclock_gpu_nbnxn_t timings,
Nbnxm::GpuTimers timers,
const GpuPairlist *  plist,
AtomLocality  atomLocality,
const gmx::StepWorkload stepWork,
bool  doTiming 
)
inlinestatic

Do the per-step timing accounting of the nonbonded tasks.

Does timing accumulation and call-count increments for the nonbonded kernels. Note that this function should be called after the current step's nonbonded nonbonded tasks have completed with the exception of the rolling pruning kernels that are accounted for during the following step.

NOTE: if timing with multiple GPUs (streams) becomes possible, the counters could end up being inconsistent due to not being incremented on some of the node when this is skipped on empty local domains!

Template Parameters
GpuPairlistPair list type
Parameters
[out]timingsPointer to the NB GPU timings data
[in]timersPointer to GPU timers data
[in]plistPointer to the pair list data
[in]atomLocalityAtom locality specifier
[in]stepWorkForce schedule flags
[in]doTimingTrue if timing is enabled.
void Nbnxm::gpu_clear_outputs ( NbnxmGpu gmx_unused nb,
bool gmx_unused  computeVirial 
)

Clears GPU outputs: nonbonded force, shift force and energy.

void Nbnxm::gpu_copy_xq_to_gpu ( NbnxmGpu gmx_unused nb,
const struct nbnxn_atomdata_t gmx_unused nbdata,
gmx::AtomLocality gmx_unused  aloc 
)

Launch asynchronously the xq buffer host to device copy.

The nonlocal copy is skipped if there is no dependent work to do, neither non-local nonbonded interactions nor bonded GPU work.

Parameters
[in]nbGPU nonbonded data.
[in]nbdataHost-side atom data structure.
[in]alocAtom locality flag.
void Nbnxm::gpu_free ( NbnxmGpu gmx_unused nb)

Frees all GPU resources used for the nonbonded calculations.

void Nbnxm::gpu_free_platform_specific ( NbnxmGpu nb)

Releases the NBNXM GPU data structures.

This function is documented in the header file.

struct gmx_wallclock_gpu_nbnxn_t* Nbnxm::gpu_get_timings ( NbnxmGpu gmx_unused nb)

Returns the GPU timings structure or NULL if GPU is not used or timing is off.

NbnxmGpu* Nbnxm::gpu_init ( const gmx::DeviceStreamManager gmx_unused deviceStreamManager,
const interaction_const_t gmx_unused ic,
const PairlistParams gmx_unused listParams,
const nbnxn_atomdata_t gmx_unused nbat,
bool gmx_unused  bLocalAndNonlocal 
)

Initializes the data structures related to GPU nonbonded calculations.

void Nbnxm::gpu_init_atomdata ( NbnxmGpu gmx_unused nb,
const nbnxn_atomdata_t gmx_unused nbat 
)

Initializes atom-data on the GPU, called at every pair search step.

void Nbnxm::gpu_init_pairlist ( NbnxmGpu gmx_unused nb,
const struct NbnxnPairlistGpu gmx_unused h_nblist,
gmx::InteractionLocality gmx_unused  iloc 
)

Initializes pair-list data for GPU, called at every pair search step.

bool Nbnxm::gpu_is_kernel_ewald_analytical ( const NbnxmGpu gmx_unused nb)

Returns if analytical Ewald GPU kernels are used.

void Nbnxm::gpu_launch_kernel ( NbnxmGpu gmx_unused nb,
const gmx::StepWorkload gmx_unused stepWork,
gmx::InteractionLocality gmx_unused  iloc 
)

Launch asynchronously the nonbonded force calculations.

Also launches the initial pruning of a fresh list after search.

The local and non-local interaction calculations are launched in two separate streams. If there is no work (i.e. empty pair list), the force kernel launch is omitted.

void Nbnxm::gpu_launch_kernel ( NbnxmGpu nb,
const gmx::StepWorkload stepWork,
const Nbnxm::InteractionLocality  iloc 
)

Launch GPU kernel.

As we execute nonbonded workload in separate queues, before launching the kernel we need to make sure that he following operations have completed:

  • atomdata allocation and related H2D transfers (every nstlist step);
  • pair list H2D transfer (every nstlist step);
  • shift vector H2D transfer (every nstlist step);
  • force (+shift force and energy) output clearing (every step).

These operations are issued in the local queue at the beginning of the step and therefore always complete before the local kernel launch. The non-local kernel is launched after the local on the same device/context, so this is inherently scheduled after the operations in the local stream (including the above "misc_ops"). However, for the sake of having a future-proof implementation, we use the misc_ops_done event to record the point in time when the above operations are finished and synchronize with this event in the non-local stream.

void Nbnxm::gpu_launch_kernel_pruneonly ( NbnxmGpu gmx_unused nb,
gmx::InteractionLocality gmx_unused  iloc,
int gmx_unused  numParts 
)

Launch asynchronously the nonbonded prune-only kernel.

The local and non-local list pruning are launched in their separate streams.

Notes for future scheduling tuning: Currently we schedule the dynamic pruning between two MD steps after both local and nonlocal force D2H transfers completed. We could launch already after the cpyback is launched, but we want to avoid prune kernels (especially in the non-local high prio-stream) competing with nonbonded work.

However, this is not ideal as this schedule does not expose the available concurrency. The dynamic pruning kernel:

  • should be allowed to overlap with any task other than force compute, including transfers (F D2H and the next step's x H2D as well as force clearing).
  • we'd prefer to avoid competition with non-bonded force kernels belonging to the same rank and ideally other ranks too.

In the most general case, the former would require scheduling pruning in a separate stream and adding additional event sync points to ensure that force kernels read consistent pair list data. This would lead to some overhead (due to extra cudaStreamWaitEvent calls, 3-5 us/call) which we might be able to live with. The gains from additional overlap might not be significant as long as update+constraints anyway takes longer than pruning, but there will still be use-cases where more overlap may help (e.g. multiple ranks per GPU, no/hbonds only constraints). The above second point is harder to address given that multiple ranks will often share a GPU. Ranks that complete their nonbondeds sooner can schedule pruning earlier and without a third priority level it is difficult to avoid some interference of prune kernels with force tasks (in particular preemption of low-prio local force task).

Parameters
[in,out]nbGPU nonbonded data.
[in]ilocInteraction locality flag.
[in]numPartsNumber of parts the pair list is split into in the rolling kernel.
int Nbnxm::gpu_min_ci_balanced ( NbnxmGpu gmx_unused nb)

Calculates the minimum size of proximity lists to improve SM load balance with GPU non-bonded kernels.

static void Nbnxm::gpu_reduce_staged_outputs ( const NBStagingData nbst,
const InteractionLocality  iLocality,
const bool  reduceEnergies,
const bool  reduceFshift,
real e_lj,
real e_el,
rvec fshift 
)
inlinestatic

Reduce data staged internally in the nbnxn module.

Shift forces and electrostatic/LJ energies copied from the GPU into a module-internal staging area are immediately reduced (CPU-side buffers passed) after having waited for the transfers' completion.

Note that this function should always be called after the transfers into the staging buffers has completed.

Parameters
[in]nbstNonbonded staging data
[in]iLocalityInteraction locality specifier
[in]reduceEnergiesTrue if energy reduction should be done
[in]reduceFshiftTrue if shift force reduction should be done
[out]e_ljVariable to accumulate LJ energy into
[out]e_elVariable to accumulate electrostatic energy into
[out]fshiftPointer to the array of shift forces to accumulate into
void Nbnxm::gpu_reset_timings ( struct nonbonded_verlet_t gmx_unused nbv)

Resets nonbonded GPU timings.

bool Nbnxm::gpu_try_finish_task ( NbnxmGpu gmx_unused nb,
const gmx::StepWorkload gmx_unused stepWork,
gmx::AtomLocality gmx_unused  aloc,
real gmx_unused e_lj,
real gmx_unused e_el,
gmx::ArrayRef< gmx::RVec > gmx_unused  shiftForces,
GpuTaskCompletion gmx_unused  completionKind 
)

Attempts to complete nonbonded GPU task.

This function attempts to complete the nonbonded task (both GPU and CPU auxiliary work). Success, i.e. that the tasks completed and results are ready to be consumed, is signaled by the return value (always true if blocking wait mode requested).

The completionKind parameter controls whether the behavior is non-blocking (achieved by passing GpuTaskCompletion::Check) or blocking wait until the results are ready (when GpuTaskCompletion::Wait is passed). As the "Check" mode the function will return immediately if the GPU stream still contain tasks that have not completed, it allows more flexible overlapping of work on the CPU with GPU execution.

Note that it is only safe to use the results, and to continue to the next MD step when this function has returned true which indicates successful completion of

  • All nonbonded GPU tasks: both compute and device transfer(s)
  • auxiliary tasks: updating the internal module state (timing accumulation, list pruning states) and
  • internal staging reduction of (fshift, e_el, e_lj).

In GpuTaskCompletion::Check mode this function does the timing and keeps correct count for the nonbonded task (incrementing only once per task), in the GpuTaskCompletion::Wait mode timing is expected to be done in the caller.

TODO: improve the handling of outputs e.g. by ensuring that this function explcitly returns the force buffer (instead of that being passed only to nbnxn_gpu_launch_cpyback()) and by returning the energy and Fshift contributions for some external/centralized reduction.

Parameters
[in]nbThe nonbonded data GPU structure
[in]stepWorkStep schedule flags
[in]alocAtom locality identifier
[out]e_ljPointer to the LJ energy output to accumulate into
[out]e_elPointer to the electrostatics energy output to accumulate into
[out]shiftForcesShift forces buffer to accumulate into
[in]completionKindIndicates whether nnbonded task completion should only be checked rather than waited for
Returns
True if the nonbonded tasks associated with aloc locality have completed
bool Nbnxm::gpu_try_finish_task ( NbnxmGpu nb,
const gmx::StepWorkload stepWork,
const AtomLocality  aloc,
real e_lj,
real e_el,
gmx::ArrayRef< gmx::RVec shiftForces,
GpuTaskCompletion  completionKind 
)

Attempts to complete nonbonded GPU task.

See documentation in nbnxm_gpu.h for details.

Todo:
Move into shared source file, perhaps including cuda_runtime.h if needed for any remaining CUDA-specific objects.
void Nbnxm::gpu_upload_shiftvec ( NbnxmGpu gmx_unused nb,
const nbnxn_atomdata_t gmx_unused nbatom 
)

Uploads shift vector to the GPU if the box is dynamic (otherwise just returns).

float Nbnxm::gpu_wait_finish_task ( NbnxmGpu gmx_unused nb,
const gmx::StepWorkload gmx_unused stepWork,
gmx::AtomLocality gmx_unused  aloc,
real gmx_unused e_lj,
real gmx_unused e_el,
gmx::ArrayRef< gmx::RVec > gmx_unused  shiftForces,
gmx_wallcycle gmx_unused wcycle 
)

Completes the nonbonded GPU task blocking until GPU tasks and data transfers to finish.

Also does timing accounting and reduction of the internal staging buffers. As this is called at the end of the step, it also resets the pair list and pruning flags.

Parameters
[in]nbThe nonbonded data GPU structure
[in]stepWorkStep schedule flags
[in]alocAtom locality identifier
[out]e_ljPointer to the LJ energy output to accumulate into
[out]e_elPointer to the electrostatics energy output to accumulate into
[out]shiftForcesShift forces buffer to accumulate into
[out]wcyclePointer to wallcycle data structure
float Nbnxm::gpu_wait_finish_task ( NbnxmGpu nb,
const gmx::StepWorkload stepWork,
AtomLocality  aloc,
real e_lj,
real e_el,
gmx::ArrayRef< gmx::RVec shiftForces,
gmx_wallcycle *  wcycle 
)

Wait for the asynchronously launched nonbonded tasks and data transfers to finish.

Also does timing accounting and reduction of the internal staging buffers. As this is called at the end of the step, it also resets the pair list and pruning flags.

Parameters
[in]nbThe nonbonded data GPU structure
[in]stepWorkForce schedule flags
[in]alocAtom locality identifier
[out]e_ljPointer to the LJ energy output to accumulate into
[out]e_elPointer to the electrostatics energy output to accumulate into
[out]shiftForcesShift forces buffer to accumulate into
[out]wcyclePointer to wallcycle data structure
Returns
The number of cycles the gpu wait took
template<int subGroupSize, int groupSize>
static float Nbnxm::groupReduce ( const sycl::nd_item< 3 >  itemIdx,
const unsigned int  tidxi,
sycl::local_ptr< float >  sm_buf,
float  valueToReduce 
)
inlinestatic

Do workgroup-level reduction of a single float.

While SYCL has sycl::reduce_over_group, it currently (oneAPI 2021.3.0) uses a very large shared memory buffer, which leads to a reduced occupancy.

Note
The caller must make sure there are no races when reusing the sm_buf.
Template Parameters
subGroupSizeSize of a sub-group.
groupSizeSize of a work-group.
Parameters
itemIdxCurrent thread's sycl::nd_item.
tidxiCurrent thread's linearized local index.
sm_bufAccessor for local reduction buffer.
valueToReduceCurrent thread's value. Must have length of at least 1.
Returns
For thread with tidxi 0: sum of all valueToReduce. Other threads: unspecified.
bool Nbnxm::haveGpuShortRangeWork ( const NbnxmGpu gmx_unused nb,
gmx::InteractionLocality gmx_unused  interactionLocality 
)

Returns true if there is GPU short-range work for the given interaction locality.

Note that as, unlike nonbonded tasks, bonded tasks are not split into local/nonlocal, and therefore if there are GPU offloaded bonded interactions, this function will return true for both local and nonlocal atom range.

Parameters
[in,out]nbPointer to the nonbonded GPU data structure
[in]interactionLocalityInteraction locality identifier
Returns
Whether there is short range work for a given locality.
template<int subGroupSize>
void Nbnxm::launchNbnxmKernel ( NbnxmGpu nb,
const gmx::StepWorkload stepWork,
InteractionLocality  iloc,
bool  doPrune 
)

Launch SYCL NBNXM kernel.

Parameters
nbNon-bonded parameters.
stepWorkWorkload flags for the current step.
ilocInteraction locality.
doPruneWhether to do neighborlist pruning.
void Nbnxm::launchNbnxmKernelPruneOnly ( NbnxmGpu nb,
const InteractionLocality  iloc,
const int  numParts,
const int  numSciInPartMax 
)

Launch SYCL NBNXM prune-only kernel.

Parameters
nbNon-bonded parameters.
ilocInteraction locality.
numPartsTotal number of rolling-prune parts.
numSciInPartMaxMaximum number of superclusters in a part.
void Nbnxm::launchNbnxmKernelTransformXToXq ( const Grid grid,
NbnxmGpu nb,
DeviceBuffer< Float3 d_x,
const DeviceStream deviceStream,
unsigned int  numColumnsMax,
int  gridId 
)

Launch coordinate layout conversion kernel.

Parameters
[in]gridPair-search grid.
[in,out]nbNbnxm main structure.
[in]d_xSource atom coordinates.
[in]deviceStreamDevice stream for kernel submission.
[in]numColumnsMaxMax. number of columns per grid for offset calculation in nb.
[in]gridIdGrid index for offset calculation in nb.
template<typename T >
static T Nbnxm::lerp ( d0,
d1,
t 
)
inlinestatic

Linear interpolation using exactly two FMA operations.

Implements numeric equivalent of: (1-t)*d0 + t*d1.

const char * Nbnxm::lookup_kernel_name ( Nbnxm::KernelType  kernelType)

Return a string identifying the kernel type.

Parameters
[in]kernelTypenonbonded kernel type, takes values from the nbnxn_kernel_type enum
Returns
a string identifying the kernel corresponding to the type passed as argument
void Nbnxm::makeClusterListSimd2xMM ( const Grid jGrid,
NbnxnPairlistCpu nbl,
int  icluster,
int  firstCell,
int  lastCell,
bool  excludeSubDiagonal,
const real *gmx_restrict  x_j,
real  rlist2,
float  rbb2,
int *gmx_restrict  numDistanceChecks 
)

SIMD code for checking and adding cluster-pairs to the list using the 2xMM layout.

Checks bounding box distances and possibly atom pair distances. This is an accelerated version of make_cluster_list_simple.

Parameters
[in]jGridThe j-grid
[in,out]nblThe pair-list to store the cluster pairs in
[in]iclusterThe index of the i-cluster
[in]firstCellThe first cluster in the j-range, using i-cluster size indexing
[in]lastCellThe last cluster in the j-range, using i-cluster size indexing
[in]excludeSubDiagonalExclude atom pairs with i-index > j-index
[in]x_jCoordinates for the j-atom, in SIMD packed format
[in]rlist2The squared list cut-off
[in]rbb2The squared cut-off for putting cluster-pairs in the list based on bounding box distance only
[in,out]numDistanceChecksThe number of distance checks performed
void Nbnxm::makeClusterListSimd4xM ( const Grid jGrid,
NbnxnPairlistCpu nbl,
int  icluster,
int  firstCell,
int  lastCell,
bool  excludeSubDiagonal,
const real *gmx_restrict  x_j,
real  rlist2,
float  rbb2,
int *gmx_restrict  numDistanceChecks 
)

SIMD code for checking and adding cluster-pairs to the list using the 4xM layout.

Checks bounding box distances and possibly atom pair distances. This is an accelerated version of make_cluster_list_simple.

Parameters
[in]jGridThe j-grid
[in,out]nblThe pair-list to store the cluster pairs in
[in]iclusterThe index of the i-cluster
[in]firstCellThe first cluster in the j-range, using i-cluster size indexing
[in]lastCellThe last cluster in the j-range, using i-cluster size indexing
[in]excludeSubDiagonalExclude atom pairs with i-index > j-index
[in]x_jCoordinates for the j-atom, in SIMD packed format
[in]rlist2The squared list cut-off
[in]rbb2The squared cut-off for putting cluster-pairs in the list based on bounding box distance only
[in,out]numDistanceChecksThe number of distance checks performed
template<int subGroupSize, bool doPruneNBL, bool doCalcEnergies, enum ElecType elecType, enum VdwType vdwType>
static auto Nbnxm::nbnxmKernel ( sycl::handler &  cgh,
const Float4 *__restrict__  gm_xq,
Float3 *__restrict__  gm_f,
const Float3 *__restrict__  gm_shiftVec,
Float3 *__restrict__  gm_fShift,
float *__restrict__  gm_energyElec,
float *__restrict__  gm_energyVdw,
nbnxn_cj_packed_t *__restrict__  gm_plistCJPacked,
const nbnxn_sci_t *__restrict__  gm_plistSci,
const nbnxn_excl_t *__restrict__  gm_plistExcl,
const Float2 *__restrict__  gm_ljComb,
const int *__restrict__  gm_atomTypes,
const Float2 *__restrict__  gm_nbfp,
const Float2 *__restrict__  gm_nbfpComb,
const float *__restrict__  gm_coulombTab,
int *__restrict__  gm_sciHistogram,
int *__restrict__  gm_sciCount,
const int  numTypes,
const float  rCoulombSq,
const float  rVdwSq,
const float  twoKRf,
const float  ewaldBeta,
const float  rlistOuterSq,
const float  ewaldShift,
const float  epsFac,
const float  ewaldCoeffLJ_2,
const float  cRF,
const shift_consts_t  dispersionShift,
const shift_consts_t  repulsionShift,
const switch_consts_t  vdwSwitch,
const float  rVdwSwitch,
const float  ljEwaldShift,
const float  coulombTabScale,
const bool  calcShift 
)
static

Main kernel for NBNXM.

template<bool haveFreshList>
auto Nbnxm::nbnxmKernelPruneOnly ( sycl::handler &  cgh,
const int  numSci,
const int  numParts,
const Float4 *__restrict__  gm_xq,
const Float3 *__restrict__  gm_shiftVec,
nbnxn_cj_packed_t *__restrict__  gm_plistCJPacked,
const nbnxn_sci_t *__restrict__  gm_plistSci,
unsigned int *__restrict__  gm_plistIMask,
int *__restrict__  gm_rollingPruningPart,
int *__restrict__  gm_sciHistogram,
int *__restrict__  gm_sciCount,
const float  rlistOuterSq,
const float  rlistInnerSq 
)

Prune-only kernel for NBNXM.

static auto Nbnxm::nbnxmKernelTransformXToXq ( Float4 *__restrict__  gm_xq,
const Float3 *__restrict__  gm_x,
const int *__restrict__  gm_atomIndex,
const int *__restrict__  gm_numAtoms,
const int *__restrict__  gm_cellIndex,
int  cellOffset,
int  numAtomsPerCell,
int  columnsOffset 
)
static

SYCL kernel for transforming position coordinates from rvec to nbnxm layout.

Parameters
[out]gm_xqCoordinates buffer in nbnxm layout.
[in]gm_xCoordinates buffer.
[in]gm_atomIndexAtom index mapping.
[in]gm_numAtomsArray of number of atoms.
[in]gm_cellIndexArray of cell indices.
[in]cellOffsetFirst cell.
[in]numAtomsPerCellNumber of atoms per cell.
[in]columnsOffsetIndex if the first column in the cell.
void Nbnxm::nbnxn_gpu_x_to_nbat_x ( const Nbnxm::Grid gmx_unused grid,
NbnxmGpu gmx_unused gpu_nbv,
DeviceBuffer< gmx::RVec > gmx_unused  d_x,
GpuEventSynchronizer gmx_unused xReadyOnDevice,
gmx::AtomLocality gmx_unused  locality,
int gmx_unused  gridId,
int gmx_unused  numColumnsMax,
bool gmx_unused  mustInsertNonLocalDependency 
)

X buffer operations on GPU: performs conversion from rvec to nb format.

Parameters
[in]gridGrid to be converted.
[in,out]gpu_nbvThe nonbonded data GPU structure.
[in]d_xDevice-side coordinates in plain rvec format.
[in]xReadyOnDeviceEvent synchronizer indicating that the coordinates are ready in the device memory.
[in]localityCopy coordinates for local or non-local atoms.
[in]gridIdIndex of the grid being converted.
[in]numColumnsMaxMaximum number of columns in the grid.
[in]mustInsertNonLocalDependencyWhether synchronization between local and non-local streams should be added. Typically, true if and only if that is the last grid in gridset.
static bool Nbnxm::nbnxn_simd_supported ( const gmx::MDLogger mdlog,
const t_inputrec &  inputrec 
)
static

Returns whether CPU SIMD support exists for the given inputrec.

If the return value is FALSE and fplog/cr != NULL, prints a fallback message to fplog/stderr.

void Nbnxm::nbnxnInsertNonlocalGpuDependency ( NbnxmGpu gmx_unused nb,
gmx::InteractionLocality gmx_unused  interactionLocality 
)

Sync the nonlocal stream with dependent tasks in the local queue.

As the point where the local stream tasks can be considered complete happens at the same call point where the nonlocal stream should be synced with the the local, this function records the event if called with the local stream as argument and inserts in the GPU stream a wait on the event on the nonlocal.

Parameters
[in]nbThe nonbonded data GPU structure
[in]interactionLocalityLocal or NonLocal sync point
static auto Nbnxm::nbnxnKernelBucketSciSort ( const nbnxn_sci_t *__restrict__  gm_sci,
const int *__restrict__  gm_sciCount,
int *__restrict__  gm_sciOffset,
nbnxn_sci_t *__restrict__  gm_sciSorted 
)
static

SYCL bucket sci sort kernel.

Sorts sci in order from most to least neighbours, using the count sort algorithm

Unlike the cpu version of sci sort, this kernel uses counts which only contain pairs which have not been masked out, giving an ordering which more accurately represents the work which will be done in the non bonded force kernel. The counts themselves are generated in the prune kernel.

Parameters
gm_sciUnsorted pair list.
gm_sciCountTotal number of sci with exactly i neighbours
gm_sciOffsetExclusive prefix sum of gm_sciCount. gm_sciOffset[i] is the offset that the first sci with i neighbours will have in the sorted sci list. All other sci with i neighbours will be placed randomly in positions gm_sciOffset[i] to gm_sciOffset[i+1] exclusive.
gm_sciSortedSorted pair list.
template<int workGroupSize, int nElements>
static auto Nbnxm::nbnxnKernelExclusivePrefixSum ( const int *__restrict__  gm_input,
int *__restrict__  gm_output 
)
static

SYCL exclusive prefix sum kernel for list sorting.

As of oneAPI 2024.1, oneapi::dpl::experimental::exclusive_scan_async for inputs <= 16384 elements simply launches a single work-group and uses sycl::joint_exclusive_scan. We have, somewhat arbitrary, input size of 8192, so we're fine replicating the same approach.

NVIDIA's CUB uses fancier approach ("Single-pass Parallel Prefix Scan with Decoupled Look-back"), but we are unlikely to need it here since this kernel is very small anyway.

Template Parameters
workGroupSizeSize of the (only) work-group.
nElementsInput array size; should be a multiple of workGroupSize.
Parameters
gm_inputInput data buffer, should contain nElements elements of type int.
gm_outputOutput data buffer, should have enough space for nElements elements of type int.
Warning
This kernel should be launched with only a single work-group of size workGroupSize.
This kernel is inefficient for large inputs (oneDPL uses it with <= 16384 elements).
template<bool useShuffleReduction, int subGroupSize, typename FCiBufferWrapperX , typename FCiBufferWrapperY , typename FCiBufferWrapperZ >
static void Nbnxm::reduceForceIAndFShift ( sycl::local_ptr< float >  sm_buf,
const FCiBufferWrapperX &  fCiBufX,
const FCiBufferWrapperY &  fCiBufY,
const FCiBufferWrapperZ &  fCiBufZ,
const bool  calcFShift,
const sycl::nd_item< 3 >  itemIdx,
const int  tidxi,
const int  tidxj,
const int  sci,
const int  shift,
sycl::global_ptr< Float3 a_f,
sycl::global_ptr< Float3 a_fShift 
)
inlinestatic

Final i-force reduction.

Reduce c_nbnxnGpuNumClusterPerSupercluster i-force components stored in fCiBuf[] accumulating atomically into a_f. If calcFShift is true, further reduce shift forces and atomically accumulate into a_fShift.

This implementation works only with power of two array sizes.

template<typename FCiBufferWrapperX , typename FCiBufferWrapperY , typename FCiBufferWrapperZ >
static void Nbnxm::reduceForceIAndFShiftGeneric ( sycl::local_ptr< float >  sm_buf,
const FCiBufferWrapperX &  fCiBufX,
const FCiBufferWrapperY &  fCiBufY,
const FCiBufferWrapperZ &  fCiBufZ,
const bool  calcFShift,
const sycl::nd_item< 3 >  itemIdx,
const int  tidxi,
const int  tidxj,
const int  sci,
const int  shift,
sycl::global_ptr< Float3 a_f,
sycl::global_ptr< Float3 a_fShift 
)
inlinestatic

Local memory-based i-force reduction.

Note that this reduction is unoptimized and some of the barrier synchronization used could be avoided on >=8-wide architectures.

template<int numShuffleReductionSteps, typename FCiBufferWrapperX , typename FCiBufferWrapperY , typename FCiBufferWrapperZ >
static std::enable_if_t<numShuffleReductionSteps != 1, void> Nbnxm::reduceForceIAndFShiftShuffles ( const FCiBufferWrapperX &  fCiBufX,
const FCiBufferWrapperY &  fCiBufY,
const FCiBufferWrapperZ &  fCiBufZ,
const bool  calcFShift,
const sycl::nd_item< 3 >  itemIdx,
const int  tidxi,
const int  tidxj,
const int  sci,
const int  shift,
sycl::global_ptr< Float3 a_f,
sycl::global_ptr< Float3 a_fShift 
)
inlinestatic

Shuffle-based i-force reduction.

We need to reduce c_clSize values spaced c_clSize threads apart (hardware threads are consecutive for tidxi, have stride c_clSize for tidxj).

We can have up to three reduction steps done with shuffles:

One step (e.g, Intel iGPU, c_clSize == 4, subGroupSize == 8): handled in a separate specialization. Two steps (e.g., NVIDIA, c_clSize == 8, subGroupSize == 32): after two shuffle reduction steps, we do atomicFetchAdd from each sub-group. Three steps (e.g., AMD CDNA, c_clSize == 8, subGroupSize == 64): similar to the two-step approach, but we have two times less atomicFetchAdd's.

template<int numShuffleReductionSteps, typename FCiBufferWrapperX , typename FCiBufferWrapperY , typename FCiBufferWrapperZ >
static std::enable_if_t<numShuffleReductionSteps == 1, void> Nbnxm::reduceForceIAndFShiftShuffles ( const FCiBufferWrapperX &  fCiBufX,
const FCiBufferWrapperY &  fCiBufY,
const FCiBufferWrapperZ &  fCiBufZ,
const bool  calcFShift,
const sycl::nd_item< 3 >  itemIdx,
const int  tidxi,
const int  tidxj,
const int  sci,
const int  shift,
sycl::global_ptr< Float3 a_f,
sycl::global_ptr< Float3 a_fShift 
)
inlinestatic

reduceForceIAndFShiftShuffles specialization for single-step reduction (e.g., Intel iGPUs).

We have three components to reduce, but only one reduction step, so it is not possible to gather the components in fx of different threads, like we do with 2 and more reduction steps.

Therefore, first even threads handle X and odd threads handle Y; then, even threads additionally handle Z. This also requires the second fShift buffer register.

After one step of reduction using shuffles is complete, we use atomicAdd to accumulate the results in the global memory. That causes a lot of atomic operations on a single memory location, which is poorly handled by some hardware (e.g., Intel Gen9-11 and Xe LP). This can be remediated by using local memory reduction after shuffles, but that's a TODO.

static void Nbnxm::reduceForceJGeneric ( sycl::local_ptr< float >  sm_buf,
Float3  f,
const sycl::nd_item< 3 > &  itemIdx,
const int  tidxi,
const int  tidxj,
const int  aidx,
sycl::global_ptr< Float3 a_f 
)
inlinestatic

Reduce c_clSize j-force components using local memory and atomically accumulate into a_f.

c_clSize consecutive threads hold the force components of a j-atom which we reduced in cl_Size steps using shift and atomically accumulate them into a_f.

TODO: implement binary reduction flavor for the case where cl_Size is power of two.

static void Nbnxm::reduceForceJShuffle ( Float3  f,
const sycl::nd_item< 3 > &  itemIdx,
const int  tidxi,
const int  aidx,
sycl::global_ptr< Float3 a_f 
)
inlinestatic

Reduce c_clSize j-force components using shifts and atomically accumulate into a_f.

c_clSize consecutive threads hold the force components of a j-atom which we reduced in log2(cl_Size) steps using shift and atomically accumulate them into a_f.

static constexpr int Nbnxm::sc_jClusterSize ( const KernelType  kernelType)
static

The nbnxn j-cluster size in atoms for the given NBNxM kernel type.

Note
When including this file in files compiled for SYCL devices only, this function can not be called for SIMD kernel types. This is asserted.
static cl_kernel Nbnxm::selectPruneKernel ( cl_kernel  kernel_pruneonly[],
bool  firstPrunePass 
)
inlinestatic

Return a pointer to the prune kernel version to be executed at the current invocation.

Parameters
[in]kernel_pruneonlyarray of prune kernel objects
[in]firstPrunePasstrue if the first pruning pass is being executed
void Nbnxm::setupGpuShortRangeWork ( NbnxmGpu gmx_unused nb,
const gmx::ListedForcesGpu gmx_unused listedForcesGpu,
gmx::InteractionLocality gmx_unused  iLocality 
)

Set up internal flags that indicate what type of short-range work there is.

As nonbondeds and bondeds share input/output buffers and GPU queues, both are considered when checking for work in the current domain.

This function is expected to be called every time the work-distribution can change (i.e. at search/domain decomposition steps).

Parameters
[in,out]nbPointer to the nonbonded GPU data structure
[in]listedForcesGpuPointer to the GPU bonded data structure
[in]iLocalityInteraction locality identifier
static void Nbnxm::sort_atoms ( int  dim,
gmx_bool  Backwards,
int gmx_unused  dd_zone,
bool gmx_unused  relevantAtomsAreWithinGridBounds,
int *  a,
int  n,
gmx::ArrayRef< const gmx::RVec x,
real  h0,
real  invh,
int  n_per_h,
gmx::ArrayRef< int >  sort 
)
static

Sorts particle index a on coordinates x along dim.

Backwards tells if we want decreasing iso increasing coordinates. h0 is the minimum of the coordinate range. invh is the 1/length of the sorting range. n_per_h (>=n) is the expected average number of particles per 1/invh sort is the sorting work array. sort should have a size of at least n_per_h*c_sortGridRatio + n, or easier, allocate at least n*c_sortGridMaxSizeFactor elements.

static void Nbnxm::sort_cluster_on_flag ( int  numAtomsInCluster,
int  atomStart,
int  atomEnd,
gmx::ArrayRef< const int32_t >  atomInfo,
gmx::ArrayRef< int >  order,
int *  flags 
)
static

Set non-bonded interaction flags for the current cluster.

Sorts atoms on LJ coefficients: !=0 first, ==0 at the end.

static bool Nbnxm::useLjCombRule ( const enum VdwType  vdwType)
inlinestatic

Returns true if LJ combination rules are used in the non-bonded kernels.

Parameters
[in]vdwTypeThe VdW interaction/implementation type as defined by VdwType enumeration.
Returns
Whether combination rules are used by the run.

Variable Documentation

constexpr bool Nbnxm::c_avoidFloatingPointAtomics = (c_clSize == 4)

Should we avoid FP atomics to the same location from the same work-group?

Intel GPUs without native floating-point operations emulate them via CAS-loop, which is very, very slow when two threads from the same group write to the same global location. We don't specialize the kernels by vendor, so we use c_clSize == 4 as a proxy to detect such devices.

constexpr int Nbnxm::c_nbnxnGpuExclSize
static
Initial value:
=
c_nbnxnGpuClusterSize * c_nbnxnGpuClusterSize / c_nbnxnGpuClusterpairSplit

The fixed size of the exclusion mask array for a half GPU cluster pair.

constexpr int Nbnxm::c_syclPruneKernelJPackedConcurrency = 4
static

Prune kernel's jPacked processing concurrency.

The GMX_NBNXN_PRUNE_KERNEL_JPACKED_CONCURRENCY macro allows compile-time override.

unsigned int Nbnxm::gpu_min_ci_balanced_factor = 50
static

This parameter should be determined heuristically from the kernel execution times.

This value is best for small systems on a single AMD Radeon R9 290X (and about 5% faster than 40, which is the default for CUDA devices). Larger simulation systems were quite insensitive to the value of this parameter.