Gromacs
2024.3
|
Namespace for non-bonded kernels.
Classes | |
struct | EnergyFunctionProperties |
Set of boolean constants mimicking preprocessor macros. More... | |
Typedefs | |
using | mode = sycl::access_mode |
Enumerations | |
enum | BenchMarkKernels : int { SimdAuto, SimdNo, Simd4XM, Simd2XMM, Count } |
Enum for selecting the SIMD kernel type for benchmarks. | |
enum | BenchMarkCombRule : int { RuleGeom, RuleLB, RuleNone, Count } |
Enum for selecting the combination rule for kernel benchmarks. | |
enum | BenchMarkCoulomb : int { Pme, ReactionField, Count } |
Enum for selecting coulomb type for kernel benchmarks. | |
enum | ElecType : int { ElecType::Cut, ElecType::RF, ElecType::EwaldTab, ElecType::EwaldTabTwin, ElecType::EwaldAna, ElecType::EwaldAnaTwin, ElecType::Count } |
Nbnxm electrostatic GPU kernel flavors. More... | |
enum | VdwType : int { VdwType::Cut, VdwType::CutCombGeom, VdwType::CutCombLB, VdwType::FSwitch, VdwType::PSwitch, VdwType::EwaldGeom, VdwType::EwaldLB, VdwType::Count } |
Nbnxm VdW GPU kernel flavors. More... | |
enum | KernelType : int { NotSet = 0, Cpu4x4_PlainC, Cpu4xN_Simd_4xN, Cpu4xN_Simd_2xNN, Gpu8x8x8, Cpu8x8x8_PlainC, Count } |
Nonbonded NxN kernel types: plain C, CPU SIMD, GPU, GPU emulation. | |
enum | EwaldExclusionType : int { NotSet = 0, Table, Analytical, DecidedByGpuModule } |
Ewald exclusion types. | |
enum | NonbondedResource : int { Cpu, Gpu, EmulateGpu } |
Resources that can be used to execute non-bonded kernels on. | |
Functions | |
static std::optional< std::string > | checkKernelSetup (const KernelBenchOptions &options) |
Checks the kernel setup. More... | |
static KernelType | translateBenchmarkEnum (const BenchMarkKernels &kernel) |
Helper to translate between the different enumeration values. | |
static KernelSetup | getKernelSetup (const KernelBenchOptions &options) |
Returns the kernel setup. | |
static interaction_const_t | setupInteractionConst (const KernelBenchOptions &options) |
Return an interaction constants struct with members used in the benchmark set appropriately. | |
static std::unique_ptr < nonbonded_verlet_t > | setupNbnxmForBenchInstance (const KernelBenchOptions &options, const gmx::BenchmarkSystem &system) |
Sets up and returns a Nbnxm object for the given benchmark options and system. | |
static void | expandSimdOptionAndPushBack (const KernelBenchOptions &options, std::vector< KernelBenchOptions > *optionsList) |
Add the options instance to the list for all requested kernel SIMD types. | |
static void | setupAndRunInstance (const gmx::BenchmarkSystem &system, const KernelBenchOptions &options, const bool doWarmup) |
Sets up and runs the requested benchmark instance and prints the results. | |
void | bench (int sizeFactor, const KernelBenchOptions &options) |
Sets up and runs one or more Nbnxm kernel benchmarks. More... | |
void | cuda_set_cacheconfig () |
Set up the cache configuration for the non-bonded kernels. | |
static void | countPruneKernelTime (Nbnxm::GpuTimers *timers, gmx_wallclock_gpu_nbnxn_t *timings, const InteractionLocality iloc) |
Count pruning kernel time if either kernel has been triggered. More... | |
static void | gpu_reduce_staged_outputs (const NBStagingData &nbst, const InteractionLocality iLocality, const bool reduceEnergies, const bool reduceFshift, real *e_lj, real *e_el, rvec *fshift) |
Reduce data staged internally in the nbnxn module. More... | |
template<typename GpuPairlist > | |
static void | gpu_accumulate_timings (gmx_wallclock_gpu_nbnxn_t *timings, Nbnxm::GpuTimers *timers, const GpuPairlist *plist, AtomLocality atomLocality, const gmx::StepWorkload &stepWork, bool doTiming) |
Do the per-step timing accounting of the nonbonded tasks. More... | |
bool | gpu_try_finish_task (NbnxmGpu *nb, const gmx::StepWorkload &stepWork, const AtomLocality aloc, real *e_lj, real *e_el, gmx::ArrayRef< gmx::RVec > shiftForces, GpuTaskCompletion completionKind) |
Attempts to complete nonbonded GPU task. More... | |
float | gpu_wait_finish_task (NbnxmGpu *nb, const gmx::StepWorkload &stepWork, AtomLocality aloc, real *e_lj, real *e_el, gmx::ArrayRef< gmx::RVec > shiftForces, gmx_wallcycle *wcycle) |
Wait for the asynchronously launched nonbonded tasks and data transfers to finish. More... | |
static bool | canSkipNonbondedWork (const NbnxmGpu &nb, InteractionLocality iloc) |
An early return condition for empty NB GPU workloads. More... | |
static gmx::Range< int > | getGpuAtomRange (const NBAtomDataGpu *atomData, const AtomLocality atomLocality) |
Calculate atom range and return start index and length. More... | |
NbnxmGpu * | gpu_init (const gmx::DeviceStreamManager gmx_unused &deviceStreamManager, const interaction_const_t gmx_unused *ic, const PairlistParams gmx_unused &listParams, const nbnxn_atomdata_t gmx_unused *nbat, bool gmx_unused bLocalAndNonlocal) |
Initializes the data structures related to GPU nonbonded calculations. More... | |
void | gpu_init_pairlist (NbnxmGpu gmx_unused *nb, const struct NbnxnPairlistGpu gmx_unused *h_nblist, gmx::InteractionLocality gmx_unused iloc) |
Initializes pair-list data for GPU, called at every pair search step. More... | |
void | gpu_init_atomdata (NbnxmGpu gmx_unused *nb, const nbnxn_atomdata_t gmx_unused *nbat) |
Initializes atom-data on the GPU, called at every pair search step. More... | |
void | gpu_pme_loadbal_update_param (struct nonbonded_verlet_t gmx_unused *nbv, const interaction_const_t gmx_unused &ic) |
Re-generate the GPU Ewald force table, resets rlist, and update the electrostatic type switching to twin cut-off (or back) if needed. | |
void | gpu_upload_shiftvec (NbnxmGpu gmx_unused *nb, const nbnxn_atomdata_t gmx_unused *nbatom) |
Uploads shift vector to the GPU if the box is dynamic (otherwise just returns). More... | |
void | gpu_clear_outputs (NbnxmGpu gmx_unused *nb, bool gmx_unused computeVirial) |
Clears GPU outputs: nonbonded force, shift force and energy. More... | |
void | gpu_free (NbnxmGpu gmx_unused *nb) |
Frees all GPU resources used for the nonbonded calculations. More... | |
struct gmx_wallclock_gpu_nbnxn_t * | gpu_get_timings (NbnxmGpu gmx_unused *nb) |
Returns the GPU timings structure or NULL if GPU is not used or timing is off. More... | |
void | gpu_reset_timings (struct nonbonded_verlet_t gmx_unused *nbv) |
Resets nonbonded GPU timings. More... | |
int | gpu_min_ci_balanced (NbnxmGpu gmx_unused *nb) |
Calculates the minimum size of proximity lists to improve SM load balance with GPU non-bonded kernels. More... | |
bool | gpu_is_kernel_ewald_analytical (const NbnxmGpu gmx_unused *nb) |
Returns if analytical Ewald GPU kernels are used. More... | |
NBAtomDataGpu * | gpuGetNBAtomData (NbnxmGpu gmx_unused *nb) |
Returns an opaque pointer to the GPU NBNXM atom data. | |
DeviceBuffer< gmx::RVec > | gpu_get_f (NbnxmGpu gmx_unused *nb) |
Returns forces device buffer. | |
static real | gridAtomDensity (int numAtoms, const rvec lowerCorner, const rvec upperCorner) |
Returns the atom density (> 0) of a rectangular grid. | |
static std::array< real, DIM-1 > | getTargetCellLength (const Grid::Geometry &geometry, const real atomDensity) |
Get approximate dimensions of each cell. Returns the length along X and Y. | |
static int | getMaxNumCells (const Grid::Geometry &geometry, const int numAtoms, const int numColumns) |
static void | sort_atoms (int dim, gmx_bool Backwards, int gmx_unused dd_zone, bool gmx_unused relevantAtomsAreWithinGridBounds, int *a, int n, gmx::ArrayRef< const gmx::RVec > x, real h0, real invh, int n_per_h, gmx::ArrayRef< int > sort) |
Sorts particle index a on coordinates x along dim. More... | |
static float | R2F_D (const float x) |
Returns x. | |
static float | R2F_U (const float x) |
Returns x. | |
static void | calc_bounding_box (int na, int stride, const real *x, BoundingBox *bb) |
Computes the bounding box for na coordinates in order x,y,z, bb order xyz0. | |
static void | calc_bounding_box_x_x4 (int na, const real *x, BoundingBox *bb) |
Computes the bounding box for na packed coordinates, bb order xyz0. | |
static void | calc_bounding_box_x_x8 (int na, const real *x, BoundingBox *bb) |
Computes the bounding box for na coordinates, bb order xyz0. | |
static gmx_unused void | calc_bounding_box_x_x4_halves (int na, const real *x, BoundingBox *bb, BoundingBox *bbj) |
Computes the bounding box for na packed coordinates, bb order xyz0. | |
static void | combine_bounding_box_pairs (const Grid &grid, gmx::ArrayRef< const BoundingBox > bb, gmx::ArrayRef< BoundingBox > bbj) |
Combines pairs of consecutive bounding boxes. | |
static void | print_bbsizes_simple (FILE *fp, const Grid &grid) |
Prints the average bb size, used for debug output. | |
static void | print_bbsizes_supersub (FILE *fp, const Grid &grid) |
Prints the average bb size, used for debug output. | |
static void | sort_cluster_on_flag (int numAtomsInCluster, int atomStart, int atomEnd, gmx::ArrayRef< const int64_t > atomInfo, gmx::ArrayRef< int > order, int *flags) |
Set non-bonded interaction flags for the current cluster. More... | |
static void | setCellAndAtomCount (gmx::ArrayRef< int > cell, int cellIndex, gmx::ArrayRef< int > cxy_na, int atomIndex) |
Sets the cell index in the cell array for atom atomIndex and increments the atom count for the grid column. | |
static void | resizeForNumberOfCells (const int numNbnxnAtoms, const int numAtomsMoved, GridSetData *gridSetData, nbnxn_atomdata_t *nbat) |
Resizes grid and atom data which depend on the number of cells. | |
real | generateAndFill2DGrid (Grid *grid, gmx::ArrayRef< GridWork > gridWork, gmx::HostVector< int > *cells, const rvec lowerCorner, const rvec upperCorner, const gmx::UpdateGroupsCog *updateGroupsCog, gmx::Range< int > atomRange, real *atomDensity, real maxAtomGroupRadius, bool haveFep, gmx::ArrayRef< const gmx::RVec > x, int ddZone, const int *move, int numAtomsMoved, bool computeGridDensityRatio) |
Sets the 2D search grid dimensions puts the atoms on the 2D grid. More... | |
static int | numGrids (const GridSet::DomainSetup &domainSetup) |
Returns the number of search grids. | |
static int | getGridOffset (gmx::ArrayRef< const Grid > grids, int gridIndex) |
const char * | lookup_kernel_name (Nbnxm::KernelType kernelType) |
Return a string identifying the kernel type. More... | |
std::unique_ptr < nonbonded_verlet_t > | init_nb_verlet (const gmx::MDLogger &mdlog, const t_inputrec &inputrec, const t_forcerec &forcerec, const t_commrec *commrec, const gmx_hw_info_t &hardwareInfo, bool useGpuForNonbonded, const gmx::DeviceStreamManager *deviceStreamManager, const gmx_mtop_t &mtop, gmx::ObservablesReducerBuilder *observablesReducerBuilder, gmx::ArrayRef< const gmx::RVec > coordinates, matrix box, gmx_wallcycle *wcycle) |
Creates an Nbnxm object. | |
static bool | kernelTypeUsesSimplePairlist (const KernelType kernelType) |
Returns whether the pair-list corresponding to nb_kernel_type is simple. | |
static bool | kernelTypeIsSimd (const KernelType kernelType) |
Returns whether a SIMD kernel is in use. | |
static bool | useLjCombRule (const enum VdwType vdwType) |
Returns true if LJ combination rules are used in the non-bonded kernels. More... | |
void | gpu_copy_xq_to_gpu (NbnxmGpu gmx_unused *nb, const struct nbnxn_atomdata_t gmx_unused *nbdata, gmx::AtomLocality gmx_unused aloc) |
Launch asynchronously the xq buffer host to device copy. More... | |
void | gpu_launch_kernel (NbnxmGpu gmx_unused *nb, const gmx::StepWorkload gmx_unused &stepWork, gmx::InteractionLocality gmx_unused iloc) |
Launch asynchronously the nonbonded force calculations. More... | |
void | gpu_launch_kernel_pruneonly (NbnxmGpu gmx_unused *nb, gmx::InteractionLocality gmx_unused iloc, int gmx_unused numParts) |
Launch asynchronously the nonbonded prune-only kernel. More... | |
void | gpu_launch_cpyback (NbnxmGpu gmx_unused *nb, nbnxn_atomdata_t gmx_unused *nbatom, const gmx::StepWorkload gmx_unused &stepWork, gmx::AtomLocality gmx_unused aloc) |
Launch asynchronously the download of short-range forces from the GPU (and energies/shift forces if required). | |
bool | gpu_try_finish_task (NbnxmGpu gmx_unused *nb, const gmx::StepWorkload gmx_unused &stepWork, gmx::AtomLocality gmx_unused aloc, real gmx_unused *e_lj, real gmx_unused *e_el, gmx::ArrayRef< gmx::RVec > gmx_unused shiftForces, GpuTaskCompletion gmx_unused completionKind) |
Attempts to complete nonbonded GPU task. More... | |
float | gpu_wait_finish_task (NbnxmGpu gmx_unused *nb, const gmx::StepWorkload gmx_unused &stepWork, gmx::AtomLocality gmx_unused aloc, real gmx_unused *e_lj, real gmx_unused *e_el, gmx::ArrayRef< gmx::RVec > gmx_unused shiftForces, gmx_wallcycle gmx_unused *wcycle) |
Completes the nonbonded GPU task blocking until GPU tasks and data transfers to finish. More... | |
void | nbnxn_gpu_init_x_to_nbat_x (const Nbnxm::GridSet gmx_unused &gridSet, NbnxmGpu gmx_unused *gpu_nbv) |
Initialization for X buffer operations on GPU. Called on the NS step and performs (re-)allocations and memory copies. ! | |
void | nbnxn_gpu_x_to_nbat_x (const Nbnxm::Grid gmx_unused &grid, NbnxmGpu gmx_unused *gpu_nbv, DeviceBuffer< gmx::RVec > gmx_unused d_x, GpuEventSynchronizer gmx_unused *xReadyOnDevice, gmx::AtomLocality gmx_unused locality, int gmx_unused gridId, int gmx_unused numColumnsMax, bool gmx_unused mustInsertNonLocalDependency) |
X buffer operations on GPU: performs conversion from rvec to nb format. More... | |
void | nbnxnInsertNonlocalGpuDependency (NbnxmGpu gmx_unused *nb, gmx::InteractionLocality gmx_unused interactionLocality) |
Sync the nonlocal stream with dependent tasks in the local queue. More... | |
void | setupGpuShortRangeWork (NbnxmGpu gmx_unused *nb, const gmx::ListedForcesGpu gmx_unused *listedForcesGpu, gmx::InteractionLocality gmx_unused iLocality) |
Set up internal flags that indicate what type of short-range work there is. More... | |
bool | haveGpuShortRangeWork (const NbnxmGpu gmx_unused *nb, gmx::InteractionLocality gmx_unused interactionLocality) |
Returns true if there is GPU short-range work for the given interaction locality. More... | |
void | nbnxn_gpu_x_to_nbat_x (const Nbnxm::Grid &grid, NbnxmGpu *nb, DeviceBuffer< gmx::RVec > d_x, GpuEventSynchronizer *xReadyOnDevice, const gmx::AtomLocality locality, int gridId, int numColumnsMax, bool mustInsertNonLocalDependency) |
void | launchNbnxmKernelTransformXToXq (const Grid &grid, NbnxmGpu *nb, DeviceBuffer< Float3 > d_x, const DeviceStream &deviceStream, unsigned int numColumnsMax, int gridId) |
Launch coordinate layout conversion kernel. More... | |
static void | init_ewald_coulomb_force_table (const EwaldCorrectionTables &tables, NBParamGpu *nbp, const DeviceContext &deviceContext) |
static bool | useTabulatedEwaldByDefault (const DeviceInformation &deviceInfo) |
static ElecType | nbnxn_gpu_pick_ewald_kernel_type (const interaction_const_t &ic, const DeviceInformation &deviceInfo) |
static void | set_cutoff_parameters (NBParamGpu *nbp, const interaction_const_t &ic, const PairlistParams &listParams) |
static void | initPlistSorting (gpuPlistSorting *sorting) |
static void | init_plist (gpu_plist *pl) |
static void | init_timings (gmx_wallclock_gpu_nbnxn_t *t) |
static void | initAtomdataFirst (NBAtomDataGpu *atomdata, int numTypes, const DeviceContext &deviceContext, const DeviceStream &localStream) |
Initialize atomdata first time; it only gets filled at pair-search. | |
static VdwType | nbnxmGpuPickVdwKernelType (const interaction_const_t &ic, LJCombinationRule ljCombinationRule) |
static ElecType | nbnxmGpuPickElectrostaticsKernelType (const interaction_const_t &ic, const DeviceInformation &deviceInfo) |
static void | initNbparam (NBParamGpu *nbp, const interaction_const_t &ic, const PairlistParams &listParams, const nbnxn_atomdata_t::Params &nbatParams, const DeviceContext &deviceContext) |
Initialize the nonbonded parameter data structure. | |
NbnxmGpu * | gpu_init (const gmx::DeviceStreamManager &deviceStreamManager, const interaction_const_t *ic, const PairlistParams &listParams, const nbnxn_atomdata_t *nbat, const bool bLocalAndNonlocal) |
void | gpu_pme_loadbal_update_param (nonbonded_verlet_t *nbv, const interaction_const_t &ic) |
void | gpu_upload_shiftvec (NbnxmGpu *nb, const nbnxn_atomdata_t *nbatom) |
void | gpu_init_pairlist (NbnxmGpu *nb, const NbnxnPairlistGpu *h_plist, const InteractionLocality iloc) |
This function is documented in the header file. | |
void | gpu_init_atomdata (NbnxmGpu *nb, const nbnxn_atomdata_t *nbat) |
void | gpu_clear_outputs (NbnxmGpu *nb, bool computeVirial) |
gmx_wallclock_gpu_nbnxn_t * | gpu_get_timings (NbnxmGpu *nb) |
This function is documented in the header file. | |
void | gpu_reset_timings (nonbonded_verlet_t *nbv) |
This function is documented in the header file. | |
bool | gpu_is_kernel_ewald_analytical (const NbnxmGpu *nb) |
void | setupGpuShortRangeWork (NbnxmGpu *nb, const gmx::ListedForcesGpu *listedForcesGpu, const gmx::InteractionLocality iLocality) |
bool | haveGpuShortRangeWork (const NbnxmGpu *nb, const gmx::InteractionLocality interactionLocality) |
void | gpu_launch_cpyback (NbnxmGpu *nb, struct nbnxn_atomdata_t *nbatom, const gmx::StepWorkload &stepWork, const AtomLocality atomLocality) |
Launch asynchronously the download of nonbonded forces from the GPU (and energies/shift forces if required). | |
void | nbnxnInsertNonlocalGpuDependency (NbnxmGpu *nb, const InteractionLocality interactionLocality) |
void | gpu_copy_xq_to_gpu (NbnxmGpu *nb, const nbnxn_atomdata_t *nbatom, const AtomLocality atomLocality) |
Launch asynchronously the xq buffer host to device copy. | |
void | nbnxn_gpu_init_x_to_nbat_x (const Nbnxm::GridSet &gridSet, NbnxmGpu *gpu_nbv) |
void | gpu_free (NbnxmGpu *nb) |
This function is documented in the header file. | |
NBAtomDataGpu * | gpuGetNBAtomData (NbnxmGpu *nb) |
DeviceBuffer< gmx::RVec > | gpu_get_f (NbnxmGpu *nb) |
void | gpu_init_platform_specific (NbnxmGpu *nb) |
Initializes the NBNXM GPU data structures. | |
void | gpu_free_platform_specific (NbnxmGpu *nb) |
Releases the NBNXM GPU data structures. More... | |
static bool | nbnxn_simd_supported (const gmx::MDLogger &mdlog, const t_inputrec &inputrec) |
Returns whether CPU SIMD support exists for the given inputrec. More... | |
static KernelSetup | pick_nbnxn_kernel_cpu (const t_inputrec gmx_unused &inputrec, const gmx_hw_info_t gmx_unused &hardwareInfo) |
Returns the most suitable CPU kernel type and Ewald handling. | |
static KernelSetup | pick_nbnxn_kernel (const gmx::MDLogger &mdlog, gmx_bool use_simd_kernels, const gmx_hw_info_t &hardwareInfo, const NonbondedResource &nonbondedResource, const t_inputrec &inputrec) |
Returns the most suitable kernel type and Ewald handling. | |
static int | getMinimumIlistCountForGpuBalancing (NbnxmGpu *nbnxmGpu) |
Gets and returns the minimum i-list count for balacing based on the GPU used or env.var. when set. | |
static int | getENbnxnInitCombRule (const t_forcerec &forcerec) |
static void | validate_global_work_size (const KernelLaunchConfig &config, int work_dim, const DeviceInformation *dinfo) |
Validates the input global work size parameter. | |
static cl_kernel | selectPruneKernel (cl_kernel kernel_pruneonly[], bool firstPrunePass) |
Return a pointer to the prune kernel version to be executed at the current invocation. More... | |
static cl_kernel | select_nbnxn_kernel (NbnxmGpu *nb, enum ElecType elecType, enum VdwType vdwType, bool bDoEne, bool bDoPrune) |
Return a pointer to the kernel version to be executed at the current step. OpenCL kernel objects are cached in nb. If the requested kernel is not found in the cache, it will be created and the cache will be updated. | |
static int | calc_shmem_required_nonbonded (enum VdwType vdwType, bool bPrefetchLjParam) |
Calculates the amount of shared memory required by the nonbonded kernel in use. | |
static void | fillin_ocl_structures (NBParamGpu *nbp, cl_nbparam_params_t *nbparams_params) |
Initializes data structures that are going to be sent to the OpenCL device. More... | |
void | gpu_launch_kernel (NbnxmGpu *nb, const gmx::StepWorkload &stepWork, const Nbnxm::InteractionLocality iloc) |
Launch GPU kernel. More... | |
static int | calc_shmem_required_prune (const int num_threads_z) |
Calculates the amount of shared memory required by the prune kernel. More... | |
void | gpu_launch_kernel_pruneonly (NbnxmGpu *nb, const InteractionLocality iloc, const int numParts) |
Launch the pairlist prune only kernel for the given locality. numParts tells in how many parts, i.e. calls the list will be pruned. | |
static cl_kernel | nbnxn_gpu_create_kernel (NbnxmGpu *nb, const char *kernel_name) |
Initializes the OpenCL kernel pointers of the nbnxn_ocl_ptr_t input data structure. | |
static void | nbnxn_gpu_init_kernels (NbnxmGpu *nb) |
Initializes the OpenCL kernel pointers of the nbnxn_ocl_ptr_t input data structure. | |
static void | free_kernel (cl_kernel *kernel_ptr) |
Releases an OpenCL kernel pointer. | |
static void | free_kernels (cl_kernel *kernels, int count) |
Releases a list of OpenCL kernel pointers. | |
static void | freeGpuProgram (cl_program program) |
Free the OpenCL program. More... | |
int | gpu_min_ci_balanced (NbnxmGpu *nb) |
This function is documented in the header file. | |
static auto | nbnxmKernelTransformXToXq (Float4 *__restrict__ gm_xq, const Float3 *__restrict__ gm_x, const int *__restrict__ gm_atomIndex, const int *__restrict__ gm_numAtoms, const int *__restrict__ gm_cellIndex, int cellOffset, int numAtomsPerCell, int columnsOffset) |
SYCL kernel for transforming position coordinates from rvec to nbnxm layout. More... | |
static int | getNbnxmSubGroupSize (const DeviceInformation &deviceInfo) |
template<int subGroupSize, bool doPruneNBL, bool doCalcEnergies> | |
void | launchNbnxmKernelHelper (NbnxmGpu *nb, const gmx::StepWorkload &stepWork, const InteractionLocality iloc) |
template<int subGroupSize> | |
void | launchNbnxmKernel (NbnxmGpu *nb, const gmx::StepWorkload &stepWork, const InteractionLocality iloc) |
Launch SYCL NBNXM kernel. More... | |
static Float2 | convertSigmaEpsilonToC6C12 (const float sigma, const float epsilon) |
Convert sigma and epsilon VdW parameters to c6 ,c12 pair. | |
template<bool doCalcEnergies> | |
static void | ljForceSwitch (const shift_consts_t dispersionShift, const shift_consts_t repulsionShift, const float rVdwSwitch, const float c6, const float c12, const float rInv, const float r2, sycl::private_ptr< float > fInvR, sycl::private_ptr< float > eLJ) |
Calculate force and energy for a pair of atoms, VdW force-switch flavor. | |
template<enum VdwType vdwType> | |
static float | calculateLJEwaldC6Grid (const sycl::global_ptr< const Float2 > a_nbfpComb, const int typeI, const int typeJ) |
Fetch C6 grid contribution coefficients and return the product of these. | |
template<bool doCalcEnergies, enum VdwType vdwType> | |
static void | ljEwaldComb (const sycl::global_ptr< const Float2 > a_nbfpComb, const float sh_lj_ewald, const int typeI, const int typeJ, const float r2, const float r2Inv, const float lje_coeff2, const float lje_coeff6_6, const float int_bit, sycl::private_ptr< float > fInvR, sycl::private_ptr< float > eLJ) |
Calculate LJ-PME grid force contribution with geometric or LB combination rule. | |
template<bool doCalcEnergies> | |
static void | ljPotentialSwitch (const switch_consts_t vdwSwitch, const float rVdwSwitch, const float rInv, const float r2, sycl::private_ptr< float > fInvR, sycl::private_ptr< float > eLJ) |
Apply potential switch. | |
static float | pmeCorrF (const float z2) |
Calculate analytical Ewald correction term. | |
template<typename T > | |
static T | lerp (T d0, T d1, T t) |
Linear interpolation using exactly two FMA operations. More... | |
static float | interpolateCoulombForceR (const sycl::global_ptr< const float > a_coulombTab, const float coulombTabScale, const float r) |
Interpolate Ewald coulomb force correction using the F*r table. | |
static void | reduceForceJShuffle (Float3 f, const sycl::nd_item< 3 > itemIdx, const int tidxi, const int aidx, sycl::global_ptr< Float3 > a_f) |
Reduce c_clSize j-force components using shifts and atomically accumulate into a_f. More... | |
template<int subGroupSize, int groupSize> | |
static float | groupReduce (const sycl::nd_item< 3 > itemIdx, const unsigned int tidxi, sycl::local_ptr< float > sm_buf, float valueToReduce) |
Do workgroup-level reduction of a single float . More... | |
static void | reduceForceJGeneric (sycl::local_ptr< float > sm_buf, Float3 f, const sycl::nd_item< 3 > itemIdx, const int tidxi, const int tidxj, const int aidx, sycl::global_ptr< Float3 > a_f) |
Reduce c_clSize j-force components using local memory and atomically accumulate into a_f. More... | |
template<bool useShuffleReduction> | |
static void | reduceForceJ (sycl::local_ptr< float > sm_buf, Float3 f, const sycl::nd_item< 3 > itemIdx, const int tidxi, const int tidxj, const int aidx, sycl::global_ptr< Float3 > a_f) |
Reduce c_clSize j-force components using either shifts or local memory and atomically accumulate into a_f. | |
template<typename FCiBufferWrapperX , typename FCiBufferWrapperY , typename FCiBufferWrapperZ > | |
static void | reduceForceIAndFShiftGeneric (sycl::local_ptr< float > sm_buf, const FCiBufferWrapperX &fCiBufX, const FCiBufferWrapperY &fCiBufY, const FCiBufferWrapperZ &fCiBufZ, const bool calcFShift, const sycl::nd_item< 3 > itemIdx, const int tidxi, const int tidxj, const int sci, const int shift, sycl::global_ptr< Float3 > a_f, sycl::global_ptr< Float3 > a_fShift) |
Local memory-based i-force reduction. More... | |
template<int numShuffleReductionSteps, typename FCiBufferWrapperX , typename FCiBufferWrapperY , typename FCiBufferWrapperZ > | |
static std::enable_if_t < numShuffleReductionSteps!=1, void > | reduceForceIAndFShiftShuffles (const FCiBufferWrapperX &fCiBufX, const FCiBufferWrapperY &fCiBufY, const FCiBufferWrapperZ &fCiBufZ, const bool calcFShift, const sycl::nd_item< 3 > itemIdx, const int tidxi, const int tidxj, const int sci, const int shift, sycl::global_ptr< Float3 > a_f, sycl::global_ptr< Float3 > a_fShift) |
Shuffle-based i-force reduction. More... | |
template<int numShuffleReductionSteps, typename FCiBufferWrapperX , typename FCiBufferWrapperY , typename FCiBufferWrapperZ > | |
static std::enable_if_t < numShuffleReductionSteps==1, void > | reduceForceIAndFShiftShuffles (const FCiBufferWrapperX &fCiBufX, const FCiBufferWrapperY &fCiBufY, const FCiBufferWrapperZ &fCiBufZ, const bool calcFShift, const sycl::nd_item< 3 > itemIdx, const int tidxi, const int tidxj, const int sci, const int shift, sycl::global_ptr< Float3 > a_f, sycl::global_ptr< Float3 > a_fShift) |
reduceForceIAndFShiftShuffles specialization for single-step reduction (e.g., Intel iGPUs). More... | |
template<bool useShuffleReduction, int subGroupSize, typename FCiBufferWrapperX , typename FCiBufferWrapperY , typename FCiBufferWrapperZ > | |
static void | reduceForceIAndFShift (sycl::local_ptr< float > sm_buf, const FCiBufferWrapperX &fCiBufX, const FCiBufferWrapperY &fCiBufY, const FCiBufferWrapperZ &fCiBufZ, const bool calcFShift, const sycl::nd_item< 3 > itemIdx, const int tidxi, const int tidxj, const int sci, const int shift, sycl::global_ptr< Float3 > a_f, sycl::global_ptr< Float3 > a_fShift) |
Final i-force reduction. More... | |
template<int subGroupSize, bool doPruneNBL, bool doCalcEnergies, enum ElecType elecType, enum VdwType vdwType> | |
static auto | nbnxmKernel (sycl::handler &cgh, const Float4 *__restrict__ gm_xq, Float3 *__restrict__ gm_f, const Float3 *__restrict__ gm_shiftVec, Float3 *__restrict__ gm_fShift, float *__restrict__ gm_energyElec, float *__restrict__ gm_energyVdw, nbnxn_cj_packed_t *__restrict__ gm_plistCJPacked, const nbnxn_sci_t *__restrict__ gm_plistSci, const nbnxn_excl_t *__restrict__ gm_plistExcl, const Float2 *__restrict__ gm_ljComb, const int *__restrict__ gm_atomTypes, const Float2 *__restrict__ gm_nbfp, const Float2 *__restrict__ gm_nbfpComb, const float *__restrict__ gm_coulombTab, const int numTypes, const float rCoulombSq, const float rVdwSq, const float twoKRf, const float ewaldBeta, const float rlistOuterSq, const float ewaldShift, const float epsFac, const float ewaldCoeffLJ_2, const float cRF, const shift_consts_t dispersionShift, const shift_consts_t repulsionShift, const switch_consts_t vdwSwitch, const float rVdwSwitch, const float ljEwaldShift, const float coulombTabScale, const bool calcShift) |
Main kernel for NBNXM. More... | |
template<int subGroupSize, bool doPruneNBL, bool doCalcEnergies, enum ElecType elecType, enum VdwType vdwType, class... Args> | |
static void | launchNbnxmKernel (const DeviceStream &deviceStream, const int numSci, Args &&...args) |
NBNXM kernel launch code. | |
template<int subGroupSize, bool doPruneNBL, bool doCalcEnergies, class... Args> | |
void | chooseAndLaunchNbnxmKernel (enum ElecType elecType, enum VdwType vdwType, Args &&...args) |
Select templated kernel and launch it. | |
template<bool haveFreshList> | |
auto | nbnxmKernelPruneOnly (sycl::handler &cgh, const Float4 *__restrict__ gm_xq, const Float3 *__restrict__ gm_shiftVec, nbnxn_cj_packed_t *__restrict__ gm_plistCJPacked, const nbnxn_sci_t *__restrict__ gm_plistSci, unsigned int *__restrict__ gm_plistIMask, const float rlistOuterSq, const float rlistInnerSq, const int numParts, const int part) |
Prune-only kernel for NBNXM. More... | |
template<bool haveFreshList, class... Args> | |
void | launchNbnxmKernelPruneOnly (const DeviceStream &deviceStream, const int numSciInPart, Args &&...args) |
Leap Frog SYCL prune-only kernel launch code. | |
template<class... Args> | |
void | chooseAndLaunchNbnxmKernelPruneOnly (bool haveFreshList, Args &&...args) |
Select templated kernel and launch it. | |
void | launchNbnxmKernelPruneOnly (NbnxmGpu *nb, const InteractionLocality iloc, const int numParts, const int part, const int numSciInPart) |
Launch SYCL NBNXM prune-only kernel. More... | |
Variables | |
static constexpr int | c_numBoundingBoxBounds1D = 2 |
The number of bounds along one dimension of a bounding box. | |
static constexpr int | c_sortGridRatio = 4 |
Ratio of grid cells to atoms. | |
static constexpr int | c_sortGridMaxSizeFactor = c_sortGridRatio + 1 |
Maximum ratio of holes used, in the worst case all particles end up in the last hole and we need num. atoms extra holes at the end. | |
constexpr int | c_numElecTypes = static_cast<int>(ElecType::Count) |
Number of possible ElecType values. | |
constexpr int | c_numVdwTypes = static_cast<int>(VdwType::Count) |
Number of possible VdwType values. | |
static constexpr gmx::EnumerationArray < KernelType, int > | IClusterSizePerKernelType |
The nbnxn i-cluster size in atoms for each nbnxn kernel type. More... | |
static constexpr gmx::EnumerationArray < KernelType, int > | JClusterSizePerKernelType |
The nbnxn j-cluster size in atoms for each nbnxn kernel type. More... | |
static const char * | nb_kfunc_noener_noprune_ptr [c_numElecTypes][c_numVdwTypes] |
Force-only kernel function names. | |
static const char * | nb_kfunc_ener_noprune_ptr [c_numElecTypes][c_numVdwTypes] |
Force + energy kernel function pointers. | |
static const char * | nb_kfunc_noener_prune_ptr [c_numElecTypes][c_numVdwTypes] |
Force + pruning kernel function pointers. | |
static const char * | nb_kfunc_ener_prune_ptr [c_numElecTypes][c_numVdwTypes] |
Force + energy + pruning kernel function pointers. | |
static unsigned int | gpu_min_ci_balanced_factor = 50 |
This parameter should be determined heuristically from the kernel execution times. More... | |
constexpr bool | c_avoidFloatingPointAtomics = (c_clSize == 4) |
Should we avoid FP atomics to the same location from the same work-group? More... | |
static constexpr int | c_syclPruneKernelJPackedConcurrency = 4 |
Prune kernel's jPacked processing concurrency. More... | |
static constexpr int | c_clSize = c_nbnxnGpuClusterSize |
Convenience constants. | |
template<enum VdwType vdwType> | |
constexpr bool | ljComb = EnergyFunctionProperties<ElecType::Count, vdwType>().vdwComb |
Templated constants to shorten kernel function declaration. | |
template<enum ElecType elecType> | |
constexpr bool | elecEwald = EnergyFunctionProperties<elecType, VdwType::Count>().elecEwald |
template<enum ElecType elecType> | |
constexpr bool | elecEwaldTab = EnergyFunctionProperties<elecType, VdwType::Count>().elecEwaldTab |
template<enum VdwType vdwType> | |
constexpr bool | ljEwald = EnergyFunctionProperties<ElecType::Count, vdwType>().vdwEwald |
|
strong |
Nbnxm electrostatic GPU kernel flavors.
Types of electrostatics implementations available in the GPU non-bonded force kernels. These represent both the electrostatics types implemented by the kernels (cut-off, RF, and Ewald - a subset of what's defined in enums.h) as well as encode implementation details analytical/tabulated and single or twin cut-off (for Ewald kernels). Note that the cut-off and RF kernels have only analytical flavor and unlike in the CPU kernels, the tabulated kernels are ATM Ewald-only.
The row-order of pointers to different electrostatic kernels defined in nbnxn_cuda.cu by the nb_*_kfunc_ptr function pointer table should match the order of enumerated types below.
|
strong |
Nbnxm VdW GPU kernel flavors.
The enumerates values correspond to the LJ implementations in the GPU non-bonded kernels.
The column-order of pointers to different electrostatic kernels defined in nbnxn_cuda_ocl.cpp/.cu by the nb_*_kfunc_ptr function pointer table should match the order of enumerated types below.
void Nbnxm::bench | ( | int | sizeFactor, |
const KernelBenchOptions & | options | ||
) |
Sets up and runs one or more Nbnxm kernel benchmarks.
The simulated system is a box of 1000 SPC/E water molecules scaled by the factor sizeFactor
, which has to be a power of 2. One or more benchmarks are run, as specified by options
. Benchmark settings and timings are printed to stdout.
[in] | sizeFactor | How much should the system size be increased. |
[in] | options | How the benchmark will be run. |
|
inlinestatic |
Calculates the amount of shared memory required by the prune kernel.
Note that for the sake of simplicity we use the CUDA terminology "shared memory" for OpenCL local memory.
[in] | num_threads_z | cjPacked concurrency equal to the number of threads/work items in the 3-rd dimension. |
|
inlinestatic |
An early return condition for empty NB GPU workloads.
This is currently used for non-local kernels/transfers only. Skipping the local kernel is more complicated, since the local part of the force array also depends on the non-local kernel. The skip of the local kernel is taken care of separately.
|
static |
Checks the kernel setup.
Returns an error string when the kernel is not available.
|
static |
Count pruning kernel time if either kernel has been triggered.
We do the accounting for either of the two pruning kernel flavors:
Note that the resetting of Nbnxm::GpuTimers::didPrune and Nbnxm::GpuTimers::didRollingPrune should happen after calling this function.
[in] | timers | structs with GPU timer objects |
[in,out] | timings | GPU task timing data |
[in] | iloc | interaction locality |
|
static |
Initializes data structures that are going to be sent to the OpenCL device.
The device can't use the same data structures as the host for two main reasons:
This function is called before the launch of both nbnxn and prune kernels.
|
static |
Free the OpenCL program.
The function releases the OpenCL program assuciated with the device that the calling PP rank is running on.
program | [in] OpenCL program to release. |
real Nbnxm::generateAndFill2DGrid | ( | Grid * | grid, |
gmx::ArrayRef< GridWork > | gridWork, | ||
gmx::HostVector< int > * | cells, | ||
const rvec | lowerCorner, | ||
const rvec | upperCorner, | ||
const gmx::UpdateGroupsCog * | updateGroupsCog, | ||
gmx::Range< int > | atomRange, | ||
real * | atomDensity, | ||
real | maxAtomGroupRadius, | ||
bool | haveFep, | ||
gmx::ArrayRef< const gmx::RVec > | x, | ||
int | ddZone, | ||
const int * | move, | ||
int | numAtomsMoved, | ||
bool | computeGridDensityRatio | ||
) |
Sets the 2D search grid dimensions puts the atoms on the 2D grid.
[in,out] | grid | The pair search grid for one DD zone |
[in,out] | gridWork | Working data for each thread |
[in,out] | cells | The grid cell list |
[in] | lowerCorner | The minimum Cartesian coordinates of the grid |
[in] | upperCorner | The maximum Cartesian coordinates of the grid |
[in] | updateGroupsCog | The center of geometry of update groups, can be nullptr |
[in] | atomRange | The range of atoms to put on this grid |
[in,out] | atomDensity | The atom density, will be computed when <= 0 |
[in] | maxAtomGroupRadius | The maximum radius of atom groups |
[in] | haveFep | Whether non-bonded parameters are perturbed |
[in] | x | The coordinates of the atoms |
[in] | ddZone | The domain decomposition zone |
[in] | move | Tells whether atoms have moved to another DD domain |
[in] | numAtomsMoved | The number of atoms that moved to another DD domain |
[in] | computeGridDensityRatio | When true, return the grid density ratio |
computeGridDensityRatio==true
, the ratio of the effective 2D grid density and the uniform grid density
|
inlinestatic |
Calculate atom range and return start index and length.
[in] | atomData | Atom descriptor data structure |
[in] | atomLocality | Atom locality specifier |
|
inlinestatic |
Do the per-step timing accounting of the nonbonded tasks.
Does timing accumulation and call-count increments for the nonbonded kernels. Note that this function should be called after the current step's nonbonded nonbonded tasks have completed with the exception of the rolling pruning kernels that are accounted for during the following step.
NOTE: if timing with multiple GPUs (streams) becomes possible, the counters could end up being inconsistent due to not being incremented on some of the node when this is skipped on empty local domains!
GpuPairlist | Pair list type |
[out] | timings | Pointer to the NB GPU timings data |
[in] | timers | Pointer to GPU timers data |
[in] | plist | Pointer to the pair list data |
[in] | atomLocality | Atom locality specifier |
[in] | stepWork | Force schedule flags |
[in] | doTiming | True if timing is enabled. |
void Nbnxm::gpu_clear_outputs | ( | NbnxmGpu gmx_unused * | nb, |
bool gmx_unused | computeVirial | ||
) |
Clears GPU outputs: nonbonded force, shift force and energy.
void Nbnxm::gpu_copy_xq_to_gpu | ( | NbnxmGpu gmx_unused * | nb, |
const struct nbnxn_atomdata_t gmx_unused * | nbdata, | ||
gmx::AtomLocality gmx_unused | aloc | ||
) |
Launch asynchronously the xq buffer host to device copy.
The nonlocal copy is skipped if there is no dependent work to do, neither non-local nonbonded interactions nor bonded GPU work.
[in] | nb | GPU nonbonded data. |
[in] | nbdata | Host-side atom data structure. |
[in] | aloc | Atom locality flag. |
void Nbnxm::gpu_free | ( | NbnxmGpu gmx_unused * | nb | ) |
Frees all GPU resources used for the nonbonded calculations.
void Nbnxm::gpu_free_platform_specific | ( | NbnxmGpu * | nb | ) |
Releases the NBNXM GPU data structures.
This function is documented in the header file.
struct gmx_wallclock_gpu_nbnxn_t* Nbnxm::gpu_get_timings | ( | NbnxmGpu gmx_unused * | nb | ) |
Returns the GPU timings structure or NULL if GPU is not used or timing is off.
NbnxmGpu* Nbnxm::gpu_init | ( | const gmx::DeviceStreamManager gmx_unused & | deviceStreamManager, |
const interaction_const_t gmx_unused * | ic, | ||
const PairlistParams gmx_unused & | listParams, | ||
const nbnxn_atomdata_t gmx_unused * | nbat, | ||
bool gmx_unused | bLocalAndNonlocal | ||
) |
Initializes the data structures related to GPU nonbonded calculations.
void Nbnxm::gpu_init_atomdata | ( | NbnxmGpu gmx_unused * | nb, |
const nbnxn_atomdata_t gmx_unused * | nbat | ||
) |
Initializes atom-data on the GPU, called at every pair search step.
void Nbnxm::gpu_init_pairlist | ( | NbnxmGpu gmx_unused * | nb, |
const struct NbnxnPairlistGpu gmx_unused * | h_nblist, | ||
gmx::InteractionLocality gmx_unused | iloc | ||
) |
Initializes pair-list data for GPU, called at every pair search step.
bool Nbnxm::gpu_is_kernel_ewald_analytical | ( | const NbnxmGpu gmx_unused * | nb | ) |
Returns if analytical Ewald GPU kernels are used.
void Nbnxm::gpu_launch_kernel | ( | NbnxmGpu gmx_unused * | nb, |
const gmx::StepWorkload gmx_unused & | stepWork, | ||
gmx::InteractionLocality gmx_unused | iloc | ||
) |
Launch asynchronously the nonbonded force calculations.
Also launches the initial pruning of a fresh list after search.
The local and non-local interaction calculations are launched in two separate streams. If there is no work (i.e. empty pair list), the force kernel launch is omitted.
void Nbnxm::gpu_launch_kernel | ( | NbnxmGpu * | nb, |
const gmx::StepWorkload & | stepWork, | ||
const Nbnxm::InteractionLocality | iloc | ||
) |
Launch GPU kernel.
As we execute nonbonded workload in separate queues, before launching the kernel we need to make sure that he following operations have completed:
These operations are issued in the local queue at the beginning of the step and therefore always complete before the local kernel launch. The non-local kernel is launched after the local on the same device/context, so this is inherently scheduled after the operations in the local stream (including the above "misc_ops"). However, for the sake of having a future-proof implementation, we use the misc_ops_done event to record the point in time when the above operations are finished and synchronize with this event in the non-local stream.
void Nbnxm::gpu_launch_kernel_pruneonly | ( | NbnxmGpu gmx_unused * | nb, |
gmx::InteractionLocality gmx_unused | iloc, | ||
int gmx_unused | numParts | ||
) |
Launch asynchronously the nonbonded prune-only kernel.
The local and non-local list pruning are launched in their separate streams.
Notes for future scheduling tuning: Currently we schedule the dynamic pruning between two MD steps after both local and nonlocal force D2H transfers completed. We could launch already after the cpyback is launched, but we want to avoid prune kernels (especially in the non-local high prio-stream) competing with nonbonded work.
However, this is not ideal as this schedule does not expose the available concurrency. The dynamic pruning kernel:
In the most general case, the former would require scheduling pruning in a separate stream and adding additional event sync points to ensure that force kernels read consistent pair list data. This would lead to some overhead (due to extra cudaStreamWaitEvent calls, 3-5 us/call) which we might be able to live with. The gains from additional overlap might not be significant as long as update+constraints anyway takes longer than pruning, but there will still be use-cases where more overlap may help (e.g. multiple ranks per GPU, no/hbonds only constraints). The above second point is harder to address given that multiple ranks will often share a GPU. Ranks that complete their nonbondeds sooner can schedule pruning earlier and without a third priority level it is difficult to avoid some interference of prune kernels with force tasks (in particular preemption of low-prio local force task).
[in,out] | nb | GPU nonbonded data. |
[in] | iloc | Interaction locality flag. |
[in] | numParts | Number of parts the pair list is split into in the rolling kernel. |
int Nbnxm::gpu_min_ci_balanced | ( | NbnxmGpu gmx_unused * | nb | ) |
Calculates the minimum size of proximity lists to improve SM load balance with GPU non-bonded kernels.
|
inlinestatic |
Reduce data staged internally in the nbnxn module.
Shift forces and electrostatic/LJ energies copied from the GPU into a module-internal staging area are immediately reduced (CPU-side buffers passed) after having waited for the transfers' completion.
Note that this function should always be called after the transfers into the staging buffers has completed.
[in] | nbst | Nonbonded staging data |
[in] | iLocality | Interaction locality specifier |
[in] | reduceEnergies | True if energy reduction should be done |
[in] | reduceFshift | True if shift force reduction should be done |
[out] | e_lj | Variable to accumulate LJ energy into |
[out] | e_el | Variable to accumulate electrostatic energy into |
[out] | fshift | Pointer to the array of shift forces to accumulate into |
void Nbnxm::gpu_reset_timings | ( | struct nonbonded_verlet_t gmx_unused * | nbv | ) |
Resets nonbonded GPU timings.
bool Nbnxm::gpu_try_finish_task | ( | NbnxmGpu gmx_unused * | nb, |
const gmx::StepWorkload gmx_unused & | stepWork, | ||
gmx::AtomLocality gmx_unused | aloc, | ||
real gmx_unused * | e_lj, | ||
real gmx_unused * | e_el, | ||
gmx::ArrayRef< gmx::RVec > gmx_unused | shiftForces, | ||
GpuTaskCompletion gmx_unused | completionKind | ||
) |
Attempts to complete nonbonded GPU task.
This function attempts to complete the nonbonded task (both GPU and CPU auxiliary work). Success, i.e. that the tasks completed and results are ready to be consumed, is signaled by the return value (always true if blocking wait mode requested).
The completionKind
parameter controls whether the behavior is non-blocking (achieved by passing GpuTaskCompletion::Check) or blocking wait until the results are ready (when GpuTaskCompletion::Wait is passed). As the "Check" mode the function will return immediately if the GPU stream still contain tasks that have not completed, it allows more flexible overlapping of work on the CPU with GPU execution.
Note that it is only safe to use the results, and to continue to the next MD step when this function has returned true which indicates successful completion of
fshift
, e_el
, e_lj
).In GpuTaskCompletion::Check mode this function does the timing and keeps correct count for the nonbonded task (incrementing only once per task), in the GpuTaskCompletion::Wait mode timing is expected to be done in the caller.
TODO: improve the handling of outputs e.g. by ensuring that this function explcitly returns the force buffer (instead of that being passed only to nbnxn_gpu_launch_cpyback()) and by returning the energy and Fshift contributions for some external/centralized reduction.
[in] | nb | The nonbonded data GPU structure |
[in] | stepWork | Step schedule flags |
[in] | aloc | Atom locality identifier |
[out] | e_lj | Pointer to the LJ energy output to accumulate into |
[out] | e_el | Pointer to the electrostatics energy output to accumulate into |
[out] | shiftForces | Shift forces buffer to accumulate into |
[in] | completionKind | Indicates whether nnbonded task completion should only be checked rather than waited for |
aloc
locality have completed bool Nbnxm::gpu_try_finish_task | ( | NbnxmGpu * | nb, |
const gmx::StepWorkload & | stepWork, | ||
const AtomLocality | aloc, | ||
real * | e_lj, | ||
real * | e_el, | ||
gmx::ArrayRef< gmx::RVec > | shiftForces, | ||
GpuTaskCompletion | completionKind | ||
) |
Attempts to complete nonbonded GPU task.
See documentation in nbnxm_gpu.h for details.
void Nbnxm::gpu_upload_shiftvec | ( | NbnxmGpu gmx_unused * | nb, |
const nbnxn_atomdata_t gmx_unused * | nbatom | ||
) |
Uploads shift vector to the GPU if the box is dynamic (otherwise just returns).
float Nbnxm::gpu_wait_finish_task | ( | NbnxmGpu gmx_unused * | nb, |
const gmx::StepWorkload gmx_unused & | stepWork, | ||
gmx::AtomLocality gmx_unused | aloc, | ||
real gmx_unused * | e_lj, | ||
real gmx_unused * | e_el, | ||
gmx::ArrayRef< gmx::RVec > gmx_unused | shiftForces, | ||
gmx_wallcycle gmx_unused * | wcycle | ||
) |
Completes the nonbonded GPU task blocking until GPU tasks and data transfers to finish.
Also does timing accounting and reduction of the internal staging buffers. As this is called at the end of the step, it also resets the pair list and pruning flags.
[in] | nb | The nonbonded data GPU structure |
[in] | stepWork | Step schedule flags |
[in] | aloc | Atom locality identifier |
[out] | e_lj | Pointer to the LJ energy output to accumulate into |
[out] | e_el | Pointer to the electrostatics energy output to accumulate into |
[out] | shiftForces | Shift forces buffer to accumulate into |
[out] | wcycle | Pointer to wallcycle data structure |
float Nbnxm::gpu_wait_finish_task | ( | NbnxmGpu * | nb, |
const gmx::StepWorkload & | stepWork, | ||
AtomLocality | aloc, | ||
real * | e_lj, | ||
real * | e_el, | ||
gmx::ArrayRef< gmx::RVec > | shiftForces, | ||
gmx_wallcycle * | wcycle | ||
) |
Wait for the asynchronously launched nonbonded tasks and data transfers to finish.
Also does timing accounting and reduction of the internal staging buffers. As this is called at the end of the step, it also resets the pair list and pruning flags.
[in] | nb | The nonbonded data GPU structure |
[in] | stepWork | Force schedule flags |
[in] | aloc | Atom locality identifier |
[out] | e_lj | Pointer to the LJ energy output to accumulate into |
[out] | e_el | Pointer to the electrostatics energy output to accumulate into |
[out] | shiftForces | Shift forces buffer to accumulate into |
[out] | wcycle | Pointer to wallcycle data structure |
|
inlinestatic |
Do workgroup-level reduction of a single float
.
While SYCL has sycl::reduce_over_group
, it currently (oneAPI 2021.3.0) uses a very large shared memory buffer, which leads to a reduced occupancy.
sm_buf
.subGroupSize | Size of a sub-group. |
groupSize | Size of a work-group. |
itemIdx | Current thread's sycl::nd_item . |
tidxi | Current thread's linearized local index. |
sm_buf | Accessor for local reduction buffer. |
valueToReduce | Current thread's value. Must have length of at least 1. |
tidxi
0: sum of all valueToReduce
. Other threads: unspecified. bool Nbnxm::haveGpuShortRangeWork | ( | const NbnxmGpu gmx_unused * | nb, |
gmx::InteractionLocality gmx_unused | interactionLocality | ||
) |
Returns true if there is GPU short-range work for the given interaction locality.
Note that as, unlike nonbonded tasks, bonded tasks are not split into local/nonlocal, and therefore if there are GPU offloaded bonded interactions, this function will return true for both local and nonlocal atom range.
[in,out] | nb | Pointer to the nonbonded GPU data structure |
[in] | interactionLocality | Interaction locality identifier |
void Nbnxm::launchNbnxmKernel | ( | NbnxmGpu * | nb, |
const gmx::StepWorkload & | stepWork, | ||
const InteractionLocality | iloc | ||
) |
Launch SYCL NBNXM kernel.
nb | Non-bonded parameters. |
stepWork | Workload flags for the current step. |
iloc | Interaction locality. |
void Nbnxm::launchNbnxmKernelPruneOnly | ( | NbnxmGpu * | nb, |
const InteractionLocality | iloc, | ||
const int | numParts, | ||
const int | part, | ||
const int | numSciInPart | ||
) |
Launch SYCL NBNXM prune-only kernel.
nb | Non-bonded parameters. |
iloc | Interaction locality. |
numParts | Total number of rolling-prune parts. |
part | Number of the part to prune. |
numSciInPart | Number of superclusters in part . |
void Nbnxm::launchNbnxmKernelTransformXToXq | ( | const Grid & | grid, |
NbnxmGpu * | nb, | ||
DeviceBuffer< Float3 > | d_x, | ||
const DeviceStream & | deviceStream, | ||
unsigned int | numColumnsMax, | ||
int | gridId | ||
) |
Launch coordinate layout conversion kernel.
[in] | grid | Pair-search grid. |
[in,out] | nb | Nbnxm main structure. |
[in] | d_x | Source atom coordinates. |
[in] | deviceStream | Device stream for kernel submission. |
[in] | numColumnsMax | Max. number of columns per grid for offset calculation in nb . |
[in] | gridId | Grid index for offset calculation in nb . |
|
inlinestatic |
Linear interpolation using exactly two FMA operations.
Implements numeric equivalent of: (1-t)*d0 + t*d1.
const char * Nbnxm::lookup_kernel_name | ( | Nbnxm::KernelType | kernelType | ) |
Return a string identifying the kernel type.
[in] | kernelType | nonbonded kernel type, takes values from the nbnxn_kernel_type enum |
|
static |
Main kernel for NBNXM.
auto Nbnxm::nbnxmKernelPruneOnly | ( | sycl::handler & | cgh, |
const Float4 *__restrict__ | gm_xq, | ||
const Float3 *__restrict__ | gm_shiftVec, | ||
nbnxn_cj_packed_t *__restrict__ | gm_plistCJPacked, | ||
const nbnxn_sci_t *__restrict__ | gm_plistSci, | ||
unsigned int *__restrict__ | gm_plistIMask, | ||
const float | rlistOuterSq, | ||
const float | rlistInnerSq, | ||
const int | numParts, | ||
const int | part | ||
) |
Prune-only kernel for NBNXM.
|
static |
SYCL kernel for transforming position coordinates from rvec to nbnxm layout.
[out] | gm_xq | Coordinates buffer in nbnxm layout. |
[in] | gm_x | Coordinates buffer. |
[in] | gm_atomIndex | Atom index mapping. |
[in] | gm_numAtoms | Array of number of atoms. |
[in] | gm_cellIndex | Array of cell indices. |
[in] | cellOffset | First cell. |
[in] | numAtomsPerCell | Number of atoms per cell. |
[in] | columnsOffset | Index if the first column in the cell. |
void Nbnxm::nbnxn_gpu_x_to_nbat_x | ( | const Nbnxm::Grid gmx_unused & | grid, |
NbnxmGpu gmx_unused * | gpu_nbv, | ||
DeviceBuffer< gmx::RVec > gmx_unused | d_x, | ||
GpuEventSynchronizer gmx_unused * | xReadyOnDevice, | ||
gmx::AtomLocality gmx_unused | locality, | ||
int gmx_unused | gridId, | ||
int gmx_unused | numColumnsMax, | ||
bool gmx_unused | mustInsertNonLocalDependency | ||
) |
X buffer operations on GPU: performs conversion from rvec to nb format.
[in] | grid | Grid to be converted. |
[in,out] | gpu_nbv | The nonbonded data GPU structure. |
[in] | d_x | Device-side coordinates in plain rvec format. |
[in] | xReadyOnDevice | Event synchronizer indicating that the coordinates are ready in the device memory. |
[in] | locality | Copy coordinates for local or non-local atoms. |
[in] | gridId | Index of the grid being converted. |
[in] | numColumnsMax | Maximum number of columns in the grid. |
[in] | mustInsertNonLocalDependency | Whether synchronization between local and non-local streams should be added. Typically, true if and only if that is the last grid in gridset. |
|
static |
Returns whether CPU SIMD support exists for the given inputrec.
If the return value is FALSE and fplog/cr != NULL, prints a fallback message to fplog/stderr.
void Nbnxm::nbnxnInsertNonlocalGpuDependency | ( | NbnxmGpu gmx_unused * | nb, |
gmx::InteractionLocality gmx_unused | interactionLocality | ||
) |
Sync the nonlocal stream with dependent tasks in the local queue.
As the point where the local stream tasks can be considered complete happens at the same call point where the nonlocal stream should be synced with the the local, this function records the event if called with the local stream as argument and inserts in the GPU stream a wait on the event on the nonlocal.
[in] | nb | The nonbonded data GPU structure |
[in] | interactionLocality | Local or NonLocal sync point |
|
inlinestatic |
Final i-force reduction.
Reduce c_nbnxnGpuNumClusterPerSupercluster i-force components stored in fCiBuf
[] accumulating atomically into a_f
. If calcFShift
is true, further reduce shift forces and atomically accumulate into a_fShift
.
This implementation works only with power of two array sizes.
|
inlinestatic |
Local memory-based i-force reduction.
Note that this reduction is unoptimized and some of the barrier synchronization used could be avoided on >=8-wide architectures.
|
inlinestatic |
Shuffle-based i-force reduction.
We need to reduce c_clSize values spaced c_clSize threads apart (hardware threads are consecutive for tidxi
, have stride c_clSize for tidxj
).
We can have up to three reduction steps done with shuffles:
One step (e.g, Intel iGPU, c_clSize == 4, subGroupSize == 8): handled in a separate specialization. Two steps (e.g., NVIDIA, c_clSize == 8, subGroupSize == 32): after two shuffle reduction steps, we do atomicFetchAdd from each sub-group. Three steps (e.g., AMD CDNA, c_clSize == 8, subGroupSize == 64): similar to the two-step approach, but we have two times less atomicFetchAdd's.
|
inlinestatic |
reduceForceIAndFShiftShuffles
specialization for single-step reduction (e.g., Intel iGPUs).
We have three components to reduce, but only one reduction step, so it is not possible to gather the components in fx
of different threads, like we do with 2 and more reduction steps.
Therefore, first even threads handle X and odd threads handle Y; then, even threads additionally handle Z. This also requires the second fShift buffer register.
After one step of reduction using shuffles is complete, we use atomicAdd to accumulate the results in the global memory. That causes a lot of atomic operations on a single memory location, which is poorly handled by some hardware (e.g., Intel Gen9-11 and Xe LP). This can be remediated by using local memory reduction after shuffles, but that's a TODO.
|
inlinestatic |
Reduce c_clSize j-force components using local memory and atomically accumulate into a_f.
c_clSize consecutive threads hold the force components of a j-atom which we reduced in cl_Size steps using shift and atomically accumulate them into a_f
.
TODO: implement binary reduction flavor for the case where cl_Size is power of two.
|
inlinestatic |
Reduce c_clSize j-force components using shifts and atomically accumulate into a_f.
c_clSize consecutive threads hold the force components of a j-atom which we reduced in log2(cl_Size) steps using shift and atomically accumulate them into a_f
.
|
inlinestatic |
Return a pointer to the prune kernel version to be executed at the current invocation.
[in] | kernel_pruneonly | array of prune kernel objects |
[in] | firstPrunePass | true if the first pruning pass is being executed |
void Nbnxm::setupGpuShortRangeWork | ( | NbnxmGpu gmx_unused * | nb, |
const gmx::ListedForcesGpu gmx_unused * | listedForcesGpu, | ||
gmx::InteractionLocality gmx_unused | iLocality | ||
) |
Set up internal flags that indicate what type of short-range work there is.
As nonbondeds and bondeds share input/output buffers and GPU queues, both are considered when checking for work in the current domain.
This function is expected to be called every time the work-distribution can change (i.e. at search/domain decomposition steps).
[in,out] | nb | Pointer to the nonbonded GPU data structure |
[in] | listedForcesGpu | Pointer to the GPU bonded data structure |
[in] | iLocality | Interaction locality identifier |
|
static |
Sorts particle index a on coordinates x along dim.
Backwards tells if we want decreasing iso increasing coordinates. h0 is the minimum of the coordinate range. invh is the 1/length of the sorting range. n_per_h (>=n) is the expected average number of particles per 1/invh sort is the sorting work array. sort should have a size of at least n_per_h*c_sortGridRatio + n, or easier, allocate at least n*c_sortGridMaxSizeFactor elements.
|
static |
Set non-bonded interaction flags for the current cluster.
Sorts atoms on LJ coefficients: !=0 first, ==0 at the end.
|
inlinestatic |
Returns true if LJ combination rules are used in the non-bonded kernels.
[in] | vdwType | The VdW interaction/implementation type as defined by VdwType enumeration. |
constexpr bool Nbnxm::c_avoidFloatingPointAtomics = (c_clSize == 4) |
Should we avoid FP atomics to the same location from the same work-group?
Intel GPUs without native floating-point operations emulate them via CAS-loop, which is very, very slow when two threads from the same group write to the same global location. We don't specialize the kernels by vendor, so we use c_clSize == 4 as a proxy to detect such devices.
|
static |
Prune kernel's jPacked processing concurrency.
The GMX_NBNXN_PRUNE_KERNEL_JPACKED_CONCURRENCY
macro allows compile-time override.
|
static |
This parameter should be determined heuristically from the kernel execution times.
This value is best for small systems on a single AMD Radeon R9 290X (and about 5% faster than 40, which is the default for CUDA devices). Larger simulation systems were quite insensitive to the value of this parameter.
|
static |
The nbnxn i-cluster size in atoms for each nbnxn kernel type.
|
static |
The nbnxn j-cluster size in atoms for each nbnxn kernel type.