#include "gromacs/gpu_utils/gpu_macros.h"
#include "gromacs/math/vectypes.h"
#include "gromacs/mdlib/nbnxn_gpu_types.h"
#include "gromacs/utility/basedefinitions.h"
#include "gromacs/utility/real.h"

Include dependency graph for nbnxn_gpu.h:

This graph shows which files directly or indirectly include this file:

Description

Declare interface for GPU execution for NBNXN module.

Author: Szilard Pall pall..nosp@m.szil.nosp@m.ard@g.nosp@m.mail.nosp@m..com; Mark Abraham mark..nosp@m.j.ab.nosp@m.raham.nosp@m.@gma.nosp@m.il.co.nosp@m.m

Functions
void	nbnxn_gpu_launch_kernel (gmx_nbnxn_gpu_t nb, const struct nbnxn_atomdata_t nbdata, int flags, int iloc)
	Launch asynchronously the nonbonded force calculations. More...

void	nbnxn_gpu_launch_kernel_pruneonly (gmx_nbnxn_gpu_t *nb, int iloc, int numParts)
	Launch asynchronously the nonbonded prune-only kernel. More...

void	nbnxn_gpu_launch_cpyback (gmx_nbnxn_gpu_t nb, const struct nbnxn_atomdata_t nbatom, int flags, int aloc)
	Launch asynchronously the download of nonbonded forces from the GPU (and energies/shift forces if required).

bool	nbnxn_gpu_try_finish_task (gmx_nbnxn_gpu_t nb, int flags, int aloc, real e_lj, real e_el, rvec fshift, GpuTaskCompletion completionKind)
	Attempts to complete nonbonded GPU task. More...

void	nbnxn_gpu_wait_finish_task (gmx_nbnxn_gpu_t nb, int flags, int aloc, real e_lj, real e_el, rvec fshift)
	Completes the nonbonded GPU task blocking until GPU tasks and data transfers to finish. More...

int	nbnxn_gpu_pick_ewald_kernel_type (bool bTwinCut)
	Selects the Ewald kernel type, analytical or tabulated, single or twin cut-off.

Function Documentation

void nbnxn_gpu_launch_kernel	(	gmx_nbnxn_ocl_t *	nb,
		const struct nbnxn_atomdata_t *	nbatom,
		int	flags,
		int	iloc
	)

Launch asynchronously the nonbonded force calculations.

This consists of the following (async) steps launched:

upload x and q;
upload shift vector;
launch kernel; The local and non-local interaction calculations are launched in two separate streams.

Launch asynchronously the nonbonded force calculations.

As we execute nonbonded workload in separate queues, before launching the kernel we need to make sure that he following operations have completed:

atomdata allocation and related H2D transfers (every nstlist step);
pair list H2D transfer (every nstlist step);
shift vector H2D transfer (every nstlist step);
force (+shift force and energy) output clearing (every step).

These operations are issued in the local queue at the beginning of the step and therefore always complete before the local kernel launch. The non-local kernel is launched after the local on the same device/context, so this is inherently scheduled after the operations in the local stream (including the above "misc_ops"). However, for the sake of having a future-proof implementation, we use the misc_ops_done event to record the point in time when the above operations are finished and synchronize with this event in the non-local stream.

void nbnxn_gpu_launch_kernel_pruneonly	(	gmx_nbnxn_gpu_t *	nb,
		int	iloc,
		int	numParts
	)

Launch asynchronously the nonbonded prune-only kernel.

The local and non-local list pruning are launched in their separate streams.

Notes for future scheduling tuning: Currently we schedule the dynamic pruning between two MD steps after both local and nonlocal force D2H transfers completed. We could launch already after the cpyback is launched, but we want to avoid prune kernels (especially in the non-local high prio-stream) competing with nonbonded work.

However, this is not ideal as this schedule does not expose the available concurrency. The dynamic pruning kernel:

should be allowed to overlap with any task other than force compute, including transfers (F D2H and the next step's x H2D as well as force clearing).
we'd prefer to avoid competition with non-bonded force kernels belonging to the same rank and ideally other ranks too.

In the most general case, the former would require scheduling pruning in a separate stream and adding additional event sync points to ensure that force kernels read consistent pair list data. This would lead to some overhead (due to extra cudaStreamWaitEvent calls, 3-5 us/call) which we might be able to live with. The gains from additional overlap might not be significant as long as update+constraints anyway takes longer than pruning, but there will still be use-cases where more overlap may help (e.g. multiple ranks per GPU, no/hbonds only constraints). The above second point is harder to address given that multiple ranks will often share a GPU. Ranks that complete their nonbondeds sooner can schedule pruning earlier and without a third priority level it is difficult to avoid some interference of prune kernels with force tasks (in particular preemption of low-prio local force task).

Parameters

[in,out]	nb	GPU nonbonded data.
[in]	iloc	Interaction locality flag.
[in]	numParts	Number of parts the pair list is split into in the rolling kernel.

bool nbnxn_gpu_try_finish_task	(	gmx_nbnxn_gpu_t *	nb,
		int	flags,
		int	aloc,
		real *	e_lj,
		real *	e_el,
		rvec *	fshift,
		GpuTaskCompletion	completionKind
	)

Attempts to complete nonbonded GPU task.

This function attempts to complete the nonbonded task (both GPU and CPU auxiliary work). Success, i.e. that the tasks completed and results are ready to be consumed, is signaled by the return value (always true if blocking wait mode requested).

The completionKind parameter controls whether the behavior is non-blocking (achieved by passing GpuTaskCompletion::Check) or blocking wait until the results are ready (when GpuTaskCompletion::Wait is passed). As the "Check" mode the function will return immediately if the GPU stream still contain tasks that have not completed, it allows more flexible overlapping of work on the CPU with GPU execution.

Note that it is only safe to use the results, and to continue to the next MD step when this function has returned true which indicates successful completion of

All nonbonded GPU tasks: both compute and device transfer(s)
auxiliary tasks: updating the internal module state (timing accumulation, list pruning states) and
internal staging reduction of (fshift, e_el, e_lj).

TODO: improve the handling of outputs e.g. by ensuring that this function explcitly returns the force buffer (instead of that being passed only to nbnxn_gpu_launch_cpyback()) and by returning the energy and Fshift contributions for some external/centralized reduction.

Parameters

[in]	nb	The nonbonded data GPU structure
[in]	flags	Force flags
[in]	aloc	Atom locality identifier
[out]	e_lj	Pointer to the LJ energy output to accumulate into
[out]	e_el	Pointer to the electrostatics energy output to accumulate into
[out]	fshift	Pointer to the shift force buffer to accumulate into
[in]	completionKind	Indicates whether nnbonded task completion should only be checked rather than waited for

Returns: True if the nonbonded tasks associated with aloc locality have completed

void nbnxn_gpu_wait_finish_task	(	gmx_nbnxn_gpu_t *	nb,
		int	flags,
		int	aloc,
		real *	e_lj,
		real *	e_el,
		rvec *	fshift
	)

Completes the nonbonded GPU task blocking until GPU tasks and data transfers to finish.

Also does timing accounting and reduction of the internal staging buffers. As this is called at the end of the step, it also resets the pair list and pruning flags.

Parameters

[in]	nb	The nonbonded data GPU structure
[in]	flags	Force flags
[in]	aloc	Atom locality identifier
[out]	e_lj	Pointer to the LJ energy output to accumulate into
[out]	e_el	Pointer to the electrostatics energy output to accumulate into
[out]	fshift	Pointer to the shift force buffer to accumulate into

Completes the nonbonded GPU task blocking until GPU tasks and data transfers to finish.

Also does timing accounting and reduction of the internal staging buffers. As this is called at the end of the step, it also resets the pair list and pruning flags.

Parameters

[in]	nb	The nonbonded data GPU structure
[in]	flags	Force flags
[in]	aloc	Atom locality identifier
[out]	e_lj	Pointer to the LJ energy output to accumulate into
[out]	e_el	Pointer to the electrostatics energy output to accumulate into
[out]	fshift	Pointer to the shift force buffer to accumulate into

Description

Functions

Function Documentation