Gromacs
2024.4
|
The Parallelizable Handling of Output Data (analysisdata) module provides support for common data analysis tasks within the Framework for trajectory analysis and Framework for energy analysis. The basic approach used in the module is visualized below:
Typically, an analysis tool provides its raw data output through one or more gmx::AnalysisData objects (the root data object in the diagram above). This object provides only storage for the data.
To perform operations on the data, one or more data modules can be attached to the data object. Examples of such operations are averaging, histogramming, and plotting the data into a file. Some data modules are provided by the Parallelizable Handling of Output Data (analysisdata) module. To implement new ones, it is necessary to create a class that implements gmx::IAnalysisDataModule.
In many cases, such data modules also provide data that can be processed further, acting as data objects themselves. This makes it possible to attach further data modules to form a processing chain. In simple cases, such a chain ends in a module that writes the data into a file, but it is also possible to access the data in a data object (whether a plain data object or a data module) programmatically to do further computation or post-processing outside the framework. To do this, the data object typically needs to be told in advance such that it knows to store the data permanently even if attached modules do not require it.
The modules can do their processing online, i.e., as the data is produced. If all the attached modules support this, it is not necessary to store all the raw data in memory. The module design also supports processing frames in parallel: in such cases, the data may become available out of order. In particular for writing the per-frame data into a file, but also for other types of post-processing, it is necessary to reorder the data sequentially. This is implemented once in the framework, and analysis tools do not need to worry, other than using the provided API.
At the highest level, data can be structured into separate gmx::AbstractAnalysisData objects that operate independently. Each such object has an independent set of post-processing modules.
Within a gmx::AbstractAnalysisData object, data is structured along three "dimensions":
Programmatically the data within each frame is organized into point sets. Each point set consists of a continuous range of columns from a single data set. There are two types of data:
The main purpose of multipoint data is to support cases where it is not known in advance how many values there will be for each frame, or where that number is impractically large. The need to do this is mainly a matter of performance/implementation complexity tradeoff: with a more complex internal implementation, it would be possible to support larger data sets without a performance/memory impact they currently impose. The current implementation places the burden of deciding on the appropriate usage pattern on the user code, allowing for much simpler internal implementation.
An individual value (identified by frame, data set, and column) consists of a single value of type real
, an optional error value, and some flags. The flags identify what parts of the value are really available. The following states are possible:
real
value has some meaning. Different data modules handle these cases differently.The base class for all data objects (including data modules that provide data) is gmx::AbstractAnalysisData. This class provides facilities for attaching data modules to the data, and to query the data. It does not provide any methods to alter the data; all logic for managing the actual data is in derived classes.
The main root (non-module) data object class for use in analysis tools is gmx::AnalysisData. This class provides methods to set properties of the data, and to add frames to it. The interface is frame-based: you construct one frame at a time, and after it is finished, you move to the next frame. The frames are not constructed directly using gmx::AnalysisData, but instead separate data handles are used. This is explained in more detail below under Parallelization .
For simple needs and small amounts of data, gmx::AnalysisArrayData is also provided. This class allows for all the data to be prepared in memory as a single big array, and allows random access to the data while setting the values. When all the values are set to their final values, it then notifies the attached data modules by looping over the array.
One major driver for the design of the analysis data module has been to provide support for transparently processing multiple frames in parallel. In such cases, output data for multiple frames may be constructed simultaneously, and must be ordered correctly for some data modules, such as writing it into a file. This ordering is taken care of by the framework, allowing the analysis tool writer to concentrate on the actual analysis task.
From a user's point of view, the main player in this respect is the gmx::AnalysisData object. If there are two threads doing the processing in parallel, it allows creating a separate gmx::AnalysisDataHandle for each object. Each of these handles can be used independently to construct frames into the output data, and the gmx::AnalysisData object internally takes care of notifying the modules correctly. If necessary, it stores finished frames into a temporary buffer until all preceding frames have also been finished.
For increased efficiency, some data modules are also parallelization-aware: they have the ability to process the data in any order, allowing gmx::AnalysisData to notify them as soon as a frame becomes available. If there are only parallel data modules attached, no frame reordering or temporary buffers are needed. If a non-parallel data module is attached to a parallel data module, then that parallel data module takes the responsibility of ordering its output frames. Ideally, such data modules produce significantly less data than what they take in, making it cheaper to do the ordering only at this point.
Currently, no parallel runner has been implemented, but it is likely that applicable tools written to use the framework require minimal or no changes to take advantage of frame-level parallelism once such a runner materializes.
Data modules provided by the Parallelizable Handling of Output Data (analysisdata) module are listed below with a short description. See the documentation of the individual classes for more details. Note that this list is manually maintained, so it may not always be up-to-date. A comprehensive list can be found by looking at the inheritance graph of gmx::IAnalysisDataModule, but the list here is more user-friendly.