Gromacs
2025.0-dev-20241011-013a99c
|
#include "gmxpre.h"
#include "config.h"
#include <cstdlib>
#include "gromacs/fft/parallel_3dfft.h"
#include "gromacs/gpu_utils/gmxsycl.h"
#include "gromacs/gpu_utils/syclutils.h"
#include "gromacs/math/vec.h"
#include "gromacs/timing/wallcycle.h"
#include "pme_gpu_grid.h"
#include "pme_gpu_types.h"
#include "pme_gpu_types_host.h"
#include "pme_gpu_types_host_impl.h"
Implements PME GPU halo exchange and PME GPU - Host FFT grid conversion functions. These functions are used for PME decomposition in mixed-mode.
Classes | |
class | PackHaloExternal |
Handles a kernel which packs non-contiguous overlap data in all 8 neighboring directions. More... | |
class | UnpackHaloExternal |
Handles a kernel which gathers data from halo region in all 8 neighboring directions. More... | |
class | UnpackAndAddHaloInternal |
Handles a kernel which adds grid overlap data received from neighboring ranks. More... | |
class | PackHaloInternal |
Handles a kernel which packs non-contiguous overlap data in all 8 neighboring directions. More... | |
class | GridConverter< pmeToFft > |
Builds a kernel to convert between PME and FFT grids. More... | |
Typedefs | |
using | mode = sycl::access_mode |
Functions | |
template<typename Kernel , int subGroupSize, class... Args> | |
static void | submit (const DeviceStream &deviceStream, size_t myGridX, size_t myGridY, sycl::uint3 pmeSize, Args &&...args) |
Submits a GPU grid kernel. More... | |
void | pmeGpuGridHaloExchange (const PmeGpu *pmeGpu, gmx_wallcycle *wcycle) |
Grid Halo exchange after PME spread ToDo: Current implementation transfers halo region from/to only immediate neighbours And, expects that overlapSize <= local grid width. Implement exchange with multiple neighbors to remove this limitation ToDo: Current implementation synchronizes pmeStream to make sure data is ready on GPU after spread. Consider using events for this synchnozation. More... | |
void | pmeGpuGridHaloExchangeReverse (const PmeGpu *pmeGpu, gmx_wallcycle *wcycle) |
Grid reverse Halo exchange before PME gather ToDo: Current implementation transfers halo region from/to only immediate neighbours And, expects that overlapSize <= local grid width. Implement exchange with multiple neighbors to remove this limitation ToDo: Current implementation synchronizes pmeStream to make sure data is ready on GPU after FFT to PME grid conversion. Consider using events for this synchnozation. More... | |
template<bool pmeToFft> | |
void | convertPmeGridToFftGrid (const PmeGpu *pmeGpu, float *h_fftRealGrid, gmx_parallel_3dfft *fftSetup, const int gridIndex) |
Copy PME Grid with overlap region to host FFT grid and vice-versa. Used in mixed mode PME decomposition. More... | |
template<bool pmeToFft> | |
void | convertPmeGridToFftGrid (const PmeGpu *pmeGpu, DeviceBuffer< float > *d_fftRealGrid, const int gridIndex) |
Copy PME Grid with overlap region to device FFT grid and vice-versa. Used in full GPU PME decomposition. More... | |
template void | convertPmeGridToFftGrid< true > (const PmeGpu *pmeGpu, float *h_fftRealGrid, gmx_parallel_3dfft *fftSetup, const int gridIndex) |
template void | convertPmeGridToFftGrid< false > (const PmeGpu *pmeGpu, float *h_fftRealGrid, gmx_parallel_3dfft *fftSetup, const int gridIndex) |
template void | convertPmeGridToFftGrid< true > (const PmeGpu *pmeGpu, DeviceBuffer< float > *d_fftRealGrid, const int gridIndex) |
template void | convertPmeGridToFftGrid< false > (const PmeGpu *pmeGpu, DeviceBuffer< float > *d_fftRealGrid, const int gridIndex) |
Variables | |
static constexpr int | sc_subGroupSize = 32 |
Sub-group size for conversion kernels. More... | |
void convertPmeGridToFftGrid | ( | const PmeGpu * | pmeGpu, |
float * | h_fftRealGrid, | ||
gmx_parallel_3dfft * | fftSetup, | ||
int | gridIndex | ||
) |
Copy PME Grid with overlap region to host FFT grid and vice-versa. Used in mixed mode PME decomposition.
[in] | pmeGpu | The PME GPU structure. |
[in] | h_fftRealGrid | FFT grid on host |
[in] | fftSetup | Host FFT setup structure |
[in] | gridIndex | Grid index which is to be converted |
pmeToFft | A boolean which tells if this is conversion from PME grid to FFT grid or reverse |
void convertPmeGridToFftGrid | ( | const PmeGpu * | pmeGpu, |
DeviceBuffer< float > * | d_fftRealGrid, | ||
int | gridIndex | ||
) |
Copy PME Grid with overlap region to device FFT grid and vice-versa. Used in full GPU PME decomposition.
[in] | pmeGpu | The PME GPU structure. |
[in] | d_fftRealGrid | FFT grid on device |
[in] | gridIndex | Grid index which is to be converted |
pmeToFft | A boolean which tells if this is conversion from PME grid to FFT grid or reverse |
void pmeGpuGridHaloExchange | ( | const PmeGpu * | pmeGpu, |
gmx_wallcycle * | wcycle | ||
) |
Grid Halo exchange after PME spread ToDo: Current implementation transfers halo region from/to only immediate neighbours And, expects that overlapSize <= local grid width. Implement exchange with multiple neighbors to remove this limitation ToDo: Current implementation synchronizes pmeStream to make sure data is ready on GPU after spread. Consider using events for this synchnozation.
[in] | pmeGpu | The PME GPU structure. |
[in] | wcycle | The wallclock counter. |
void pmeGpuGridHaloExchangeReverse | ( | const PmeGpu * | pmeGpu, |
gmx_wallcycle * | wcycle | ||
) |
Grid reverse Halo exchange before PME gather ToDo: Current implementation transfers halo region from/to only immediate neighbours And, expects that overlapSize <= local grid width. Implement exchange with multiple neighbors to remove this limitation ToDo: Current implementation synchronizes pmeStream to make sure data is ready on GPU after FFT to PME grid conversion. Consider using events for this synchnozation.
[in] | pmeGpu | The PME GPU structure. |
[in] | wcycle | The wallclock counter. |
|
static |
Submits a GPU grid kernel.
Kernel | The class containing a static kernel() method to return the kernel to execute |
subGroupSize | Size of the sub-group. |
[in] | deviceStream | The device stream upon which to submit |
[in] | myGridX,myGridY | Local domain size in X and Y dimension |
[in] | pmeSize | Local PME grid size |
[in] | args | Parameter pack to pass to the kernel |
|
static |
Sub-group size for conversion kernels.
Chosen to match relevant hardware widths on supported hardware. In particular, PonteVecchio is 32-wide.