One of the critical steps of any HPC program is reading data and storing temporal or final results to or from the file system. Knowing how and when to read or write into a file is essential because it is one of the most time-consuming operations. Not only is it vital to be efficient when doing IO operations (input-output), but it is also crucial to how the data is accessed and how to keep it. Another critical decision is how many processors will execute these IO operations and how they will communicate. In this case, these tests have been executed on gpfs/projects on Marenostrum 4. We will use a pure MPI communication scheme and HDF5 1.10.1.
Considering this, we have performed several tests using the h5bench benchmark to see the performance and impact of different memory access patterns and temporary or end-file storage patterns using a set of data structures.
Configuration
h5bench offers the option to define different configurations for the executions. In our case, the only options we have to define properly are the MPI calls, the data structure's parameters, and the access pattern.
For each test, we have a base configuration in which we have to specify, among other options:
- How the data lives in memory (MEM_PATTERN)
- How will we store it in the file (FILE_PATTERN)
- The number of iterations we want to perform (TIMESTEPS)
- The number of dimensions (NUM_DIMS)
- How many particles do we have in each dimension (DIM_1, DIM_2, DIM_3)
"benchmark": "write",
"file": "test.h5",
"configuration": {
"MEM_PATTERN": "CONTIG",
"FILE_PATTERN": "CONTIG",
"TIMESTEPS": "5",
"DELAYED_CLOSE_TIMESTEPS": "2",
"COLLECTIVE_DATA": "YES",
"COLLECTIVE_METADATA": "YES",
"EMULATED_COMPUTE_TIME_PER_TIMESTEP": "100 ms",
"NUM_DIMS": "1",
"DIM_1": "4194304",
"DIM_2": "1",
"DIM_3": "1",
"CSV_FILE": "output.csv",
"MODE": "SYNC"
}
In this example, we have a one-dimensional array of 4194304 particles stored contiguously. We will also write them contiguously into the file system. Between each iteration, we will stop for 100ms to emulate some computing, but this can be as long or short as we want.
Access modes
We can access or store our data in the following ways:
- Contiguously: The particles are stored one particle following the other, just like a classic array.
- Strided: We have some fixed padding between each particle. Still, it is also an array, and we only access some elements in an ordered manner.
- Interleaved: Represents an array of structs, each with one particle and some padding.
It is possible to combine access patterns of the data stored in memory and disk, so we can have an interleaved memory access and write to disk contiguously (e.g. Storing only the ID of thousands of accounts with multiple data fields into a file). We will run all the multiple combinations for these access modes and dimensions. Each one of the tests executes with one to forty-eight tasks, one core per task, all inside the same node.
For a more in-depth explanation of h5bench, please check the official documentation.
Benchmarks
Given all of this, we will conduct batches of tests according to the dimensionality of the data structures. For each, we will run a write-read and a pure write test. Any change in the base configuration will be specified. Theoretically, Marenostrum 4 can provide up to 100 Gbits/s or 12.5 GBytes/s, so we will strive to get close to those throughputs. Knowing this, we also have to consider that we are codependent with the other machine users, so if they are also writing or reading the file system, we will share the bandwidth, worsening it. We tried to execute the tests when the machine had a low demand.
1D data structures
First, we must identify our test's base configuration to grant good, scalable, accurate and verifiable results. For this, we ran the same test, only modifying the number of particles from 1.048.576 to 67.108.864 for each processor to see how it scales.
The outputs of the benchmarks are in GB/s. The blue line is the raw output bandwidth of the IO operation without considering MPI communications or the HDF5 library; on the red one, we have subtracted these to get the actual bandwidth. The figures show the average of ten executions of the same test.
The scale of these plots is not consistent between them. Take care when comparing results. Higher is better.
- 1M
- 2M
- 4M
- 8M
- 16M
- 24M
- 32M
- 64M
Based on these previous outputs, we can resolve that using less than four million particles for each processor results in poor performance, likely due to communication overheads and page faults, given that probably the data of some processors collide on the same page. These 4194304 particles will be our standard configuration on the 1Dimensional tests. We can also see the impact of the MPI and HDF5 overheads on the 1M and 2M tests.
We can see a big difference between the Raw bandwidth and the Observed/Real bandwidth, and even in some cases, we overpass the limit of 12.5GB/s. What explains these outliers is that the blue line (raw output) does not consider the time we spend on HDF5 operations. E.g., In one execution using 14 cores, the output data was the following:
=================== Performance Results ==================
Total number of ranks: 14
Total emulated compute time: 0.400 s
Total write size: 8.750 GB
Raw write time: 0.642 s
Metadata time: 0.002 s
H5Fcreate() time: 0.025 s
H5Fflush() time: 0.311 s
H5Fclose() time: 0.001 s
Observed completion time: 1.384 s
SYNC Raw write rate: 13.634 GB/s
SYNC Observed write rate: 8.891 GB/s
===========================================================
The Raw write rate results from the Total write size divided by the Raw write time. The Observed write rate is the Total write size divided by the sum of Raw write time, metadata time and the H5F functions. As we can see, H5Fflush takes nearly half the Raw write time in this example.
1D tests
To see this in more detail, we can check the following figure, which shows the GB per second each processor contributes to the bandwidth. Each line is the bandwidth divided by the number of processors on each test (higher is better). In these cases, whenever the line flattens, incrementing one more core provides roughly the same bandwidth (e.g. 47 cores at 0.2 GB/s per core = 9.4GB/s; 48 cores at 0.2 GB/s per core = 9.6GB/s).
In the case of the stride access, we had some unexpected results, with up to 13 TB/s. Some subtests were done to check the reason behind this behavior but they were not conclusive to detect why this test reports TB/s or KB/s randomly when changing parameters, so we decided not to use it in this report, given these random-like results. This will not be a significant inconvenience because the stride write pattern can only be used on the 1D data sets.
After this hiccup, we can finally get some conclusions.
First of all, we need to avoid using contiguous-interleaved and interleaved-contiguous schemes. Both perform poorly, especially the former, which does not reach 1 GB/s, no matter how many processors we use; both are far from the other schemes.
As for the other patterns, all plateau at some point, so incrementing the number of IO cores far from that will not better our performance in a significant way. We have to find a balance between the size of our data and the number of cores we choose to do IO. It is immediate to realize that matching both access patterns outperforms other schemes, so choosing the memory pattern will set our file system pattern and vice versa.
1D write-read tests
Now that we have seen the writing performance of our file system, we will check the write-read one. In this case, when we read, we only access the file system, so we only have to specify its pattern.
h5bench only allows us to perform write-read tests when the configuration of the write tests is contiguous-contiguous, so now we can reduce our executions to only three: A contiguous full-file read, a contiguous partial read and a strided read.
In the following figures, we have the write bandwidth, blue, and the read bandwidth, red. We have also included a comparison of how many GB each core contributes when reading.
In both contiguous-full and contiguous-partial, we have a similar bandwidth distribution given that, technically, it is the same test, just that we read less data on the latter. The full-read outperforms the partial one each time, given that h5bench probably introduces additional logic to read just some chunks. If implemented correctly, this can have similar throughputs to the full-read.
2D data structures
Let us proceed with the 2Dimensional tests. To limit the space of the possibilities, we will only execute equally sized matrices benchmarks, but we will sample if an irregular matrix has any effect. Like before, we will test different sizes to see how it affects the performance and try to find the scalability threshold.
- 1024-1024
- 2048-2048
- 4096-4096
- 8192-8192
In this case, we have tested matrices of 1024x1024 to 8192x8192 particles. As before, we get acceptable results with more than four million particles (2048x2048), but we still have some artefacts on the raw bandwidth. When using non-regular matrices, we can observe results matching regular ones, so we can assess that this does not affect performance.
With these in mind, we have chosen 4096 particles in each dimension and ran all four tests because it is the first test with precise results.
2D tests
Similar to the 1D tests, we have to avoid contiguous-interleaved and interleaved-contiguous, which, just as before, perform poorly. These results should not surprise us, given that a 2D matrix can be considered a giant 1D vector with regions, each one being one row of the matrix. We can check the GB/s per core plot to compare the four schemes.
2D write-read tests
Now that we know how our system file behaves when writing 2D data structures, we will check the read throughputs. In this case, h5bench only allows us to do write-read tests using contiguous data, storing it contiguously and reading it entirely. As before, we have the write bandwidth, blue, and the read bandwidth, red.
As we can see, we stagnate on 7GB/s read bandwidth, which is the same as on the 1D data sets, so we can start to formulate the idea that the dimensionality does not affect the read or write performance if the access is efficient.
3D data structures
When changing from 1D to 2D, we reduced our parameters by a factor of a thousand to avoid having massive data sets and to find the lower bound on which IO operations started to perform. Now we will have to do the same to accommodate the nature of 3D matrices. We expect to lower our parameters to get matrices around 4M-16M particles. In the case of 3D data sets, h5bench only allows one test, a contiguous-contiguous pattern.
- 128-128-128
- 256-256-256
- 512-512-512
As aforementioned, when we have fewer particles than the performance threshold, h5bench reports unreal raw speeds and low bandwidths. In this case, we have acceptable IO speeds when using a 256³ matrix. When testing 1D and 2D structures, we saw good results using more than 4M particles. With 128³ particles (2.097.152), we anticipated a suboptimal result; the result we got. Moreover, we got the expected plots when we passed the 4M particle threshold.
We executed the test with a 192³ matrix halfway between the two initial tests to test this theory. The following plot shows that we are still underperforming even though we have surpassed the expected threshold of 4 million particles (192³ = 7.077.888).
This could mean that what matters is not the number of particles but the size and balance of our rows and columns. We did an additional test with a 256-256-128 matrix, more than 4M particles but with a non-regular distribution. In this case, we get some decent bandwidth, but we can observe the raw bandwidth behaving just as when we had small data sets. These last two tests could mean that the threshold correlates with both the data-set size and structure.
3D tests
h5bench only has one 3D test. For the plot of the 3D test, please check the previous chapter, figure 256³.
3D write-read tests
As for the 2D write-read and 3D write tests, h5bench only allows us to perform a contiguous-contiguous test. We should expect results similar to the ones previously observed of 1D-c-c and 2D-c-c write-read executions. As before, we have the write bandwidth, blue, and the read bandwidth, red. We should get the same bandwidths as the other write-read benchmarks that continuously access and store data.
There was a sudden drop in bandwidth between fifteen and twenty-five cores in all the write-read tests we performed, no matter the dimensionality. When only writing, h5bench did not report those drops in GB/s. The write and read parts of the benchmark execute sequentially so they do not interfere. We tried to find a reason why, but we saw that it is consistent between different nodes, hours and even days.
Projects vs Scratch
As we said earlier, we tested all of the benchmarks on gpfs/projects, but Marenostrum offers two main directories to store data, Projects and Scratch. Projects has a block size of 8MiB against the 16MiB of Scratch. GPFS stores all the files and data in sub-blocks equal to block-size/32; if it has a surplus, the file system will commit an entire sub-block to it (Files smaller than 4KiB will be stored at the inode). (e.g. A file stored at Projects of 4.01KiBytes will occupy 256KiBytes on the file system and if stored at Scratch, 512KiBytes)
Knowing this, we wanted to ensure that we could guarantee the same bandwidths on gpfs/scratch and check if the block-size correlates with performance or data set sizes. For this, we executed the write-read contiguous-contiguous tests there, given that we can check both write and read throughputs at once. Even though we know we could have some losses of bandwidth between fifteen and twenty-five cores, we can estimate the Write IO speeds without the previously-mentioned artefacts.
- Projects vs Scratch
- Projects vs Scratch vs 2X-Scratch
- 1D Write
- 1D Read
- 2D Write
- 2D Read
- 3D Write
- 3D Read
- 1D Write
- 1D Read
- 2D Write
- 2D Read
- 3D Write
- 3D Read
Scratch and Projects, with equally sized data sets, are alike when reading 2D or 3D structures, but when writing and specifically reading 1D data sets, Scratch falls behind by 1GB/s and sometimes up to 2GB/s.
When checking if the block size correlates with bandwidth, we did the same tests but with double-sized parameters ("2X" on the figures). Mostly, these tests fall in the middle ground of the original Scratch and Project benchmarks. The only outliers when this 2X data set outperforms the original ones on Projects are when using more than thirty cores on the 2D Read test and between 1 and 26 cores on the 3D Write benchmark.
Conclusions
These benchmarks have shown us that we must think carefully about how we implement, read and store our data structures. We must avoid writing or reading in small batches, and we should plan when accessing the file system.
What we have seen that destroys our hope of IO performance is mismatching the access pattern of the data. Using different patterns on memory and the file system has shown unacceptable bandwidths. All users must try to avoid this.
As for the other schemes or dimensionalities, we do not see any problem using them if the implementation is efficient. We are confident that dimensionality does not affect performance if the data sets are appropriately sized. When we did the initial tests on each dimension, we tested more extensive data sets, which performed well, so increasing the data does not affect bandwidths.
As always, any doubts about how to perform IO or performance issues can be addressed to:
- RES/BSC: support AT bsc DOT es