IO pattern performance on MareNostrum 4

One of the critical steps of any HPC program is reading data and storing temporal or final results to or from the file system. Knowing how and when to read or write into a file is essential because it is one of the most time-consuming operations. Not only is it vital to be efficient when doing IO operations (input-output), but it is also crucial to how the data is accessed and how to keep it. Another critical decision is how many processors will execute these IO operations and how they will communicate. In this case, these tests have been executed on gpfs/projects on Marenostrum 4. We will use a pure MPI communication scheme and HDF5 1.10.1.

Considering this, we have performed several tests using the h5bench benchmark to see the performance and impact of different memory access patterns and temporary or end-file storage patterns using a set of data structures.

Configuration

h5bench offers the option to define different configurations for the executions. In our case, the only options we have to define properly are the MPI calls, the data structure's parameters, and the access pattern.

For each test, we have a base configuration in which we have to specify, among other options:

How the data lives in memory (MEM_PATTERN)
How will we store it in the file (FILE_PATTERN)
The number of iterations we want to perform (TIMESTEPS)
The number of dimensions (NUM_DIMS)
How many particles do we have in each dimension (DIM_1, DIM_2, DIM_3)

"benchmark": "write", 
       "file": "test.h5", 
       "configuration": { 
           "MEM_PATTERN": "CONTIG", 
           "FILE_PATTERN": "CONTIG", 
           "TIMESTEPS": "5", 
           "DELAYED_CLOSE_TIMESTEPS": "2", 
           "COLLECTIVE_DATA": "YES", 
           "COLLECTIVE_METADATA": "YES", 
           "EMULATED_COMPUTE_TIME_PER_TIMESTEP": "100 ms", 
           "NUM_DIMS": "1", 
           "DIM_1": "4194304", 
           "DIM_2": "1", 
           "DIM_3": "1", 
           "CSV_FILE": "output.csv", 
           "MODE": "SYNC" 
       }

In this example, we have a one-dimensional array of 4194304 particles stored contiguously. We will also write them contiguously into the file system. Between each iteration, we will stop for 100ms to emulate some computing, but this can be as long or short as we want.

Access modes

We can access or store our data in the following ways:

Contiguously: The particles are stored one particle following the other, just like a classic array.
Strided: We have some fixed padding between each particle. Still, it is also an array, and we only access some elements in an ordered manner.
Interleaved: Represents an array of structs, each with one particle and some padding.

It is possible to combine access patterns of the data stored in memory and disk, so we can have an interleaved memory access and write to disk contiguously (e.g. Storing only the ID of thousands of accounts with multiple data fields into a file). We will run all the multiple combinations for these access modes and dimensions. Each one of the tests executes with one to forty-eight tasks, one core per task, all inside the same node.

info

For a more in-depth explanation of h5bench, please check the official documentation.

Benchmarks

Given all of this, we will conduct batches of tests according to the dimensionality of the data structures. For each, we will run a write-read and a pure write test. Any change in the base configuration will be specified. Theoretically, Marenostrum 4 can provide up to 100 Gbits/s or 12.5 GBytes/s, so we will strive to get close to those throughputs. Knowing this, we also have to consider that we are codependent with the other machine users, so if they are also writing or reading the file system, we will share the bandwidth, worsening it. We tried to execute the tests when the machine had a low demand.

1D data structures

First, we must identify our test's base configuration to grant good, scalable, accurate and verifiable results. For this, we ran the same test, only modifying the number of particles from 1.048.576 to 67.108.864 for each processor to see how it scales.

a remark about the tests

The outputs of the benchmarks are in GB/s. The blue line is the raw output bandwidth of the IO operation without considering MPI communications or the HDF5 library; on the red one, we have subtracted these to get the actual bandwidth. The figures show the average of ten executions of the same test.

The scale of these plots is not consistent between them. Take care when comparing results. Higher is better.

Initial 1D tests; contiguous memory, contiguous file system

Based on these previous outputs, we can resolve that using less than four million particles for each processor results in poor performance, likely due to communication overheads and page faults, given that probably the data of some processors collide on the same page. These 4194304 particles will be our standard configuration on the 1Dimensional tests. We can also see the impact of the MPI and HDF5 overheads on the 1M and 2M tests.

We can see a big difference between the Raw bandwidth and the Observed/Real bandwidth, and even in some cases, we overpass the limit of 12.5GB/s. What explains these outliers is that the blue line (raw output) does not consider the time we spend on HDF5 operations. E.g., In one execution using 14 cores, the output data was the following:

=================== Performance Results ==================
    Total number of ranks: 14
    Total emulated compute time: 0.400 s
    Total write size: 8.750 GB
    Raw write time: 0.642 s
    Metadata time: 0.002 s
    H5Fcreate() time: 0.025 s
    H5Fflush() time: 0.311 s
    H5Fclose() time: 0.001 s
    Observed completion time: 1.384 s
    SYNC Raw write rate: 13.634 GB/s 
    SYNC Observed write rate: 8.891 GB/s
===========================================================

The Raw write rate results from the Total write size divided by the Raw write time. The Observed write rate is the Total write size divided by the sum of Raw write time, metadata time and the H5F functions. As we can see, H5Fflush takes nearly half the Raw write time in this example.

1D tests

1D figures; 4M particles

Cont-Cont
Cont-Int
Int-Int
Int-Cont

Contiguous memory, contiguous file system

4M-c-c

To see this in more detail, we can check the following figure, which shows the GB per second each processor contributes to the bandwidth. Each line is the bandwidth divided by the number of processors on each test (higher is better). In these cases, whenever the line flattens, incrementing one more core provides roughly the same bandwidth (e.g. 47 cores at 0.2 GB/s per core = 9.4GB/s; 48 cores at 0.2 GB/s per core = 9.6GB/s).

1D write; GB/s per Core

GBs-core_4M_1D

In the case of the stride access, we had some unexpected results, with up to 13 TB/s. Some subtests were done to check the reason behind this behavior but they were not conclusive to detect why this test reports TB/s or KB/s randomly when changing parameters, so we decided not to use it in this report, given these random-like results. This will not be a significant inconvenience because the stride write pattern can only be used on the 1D data sets.

After this hiccup, we can finally get some conclusions.

First of all, we need to avoid using contiguous-interleaved and interleaved-contiguous schemes. Both perform poorly, especially the former, which does not reach 1 GB/s, no matter how many processors we use; both are far from the other schemes.

As for the other patterns, all plateau at some point, so incrementing the number of IO cores far from that will not better our performance in a significant way. We have to find a balance between the size of our data and the number of cores we choose to do IO. It is immediate to realize that matching both access patterns outperforms other schemes, so choosing the memory pattern will set our file system pattern and vice versa.

1D write-read tests

Now that we have seen the writing performance of our file system, we will check the write-read one. In this case, when we read, we only access the file system, so we only have to specify its pattern.

h5bench only allows us to perform write-read tests when the configuration of the write tests is contiguous-contiguous, so now we can reduce our executions to only three: A contiguous full-file read, a contiguous partial read and a strided read.

In the following figures, we have the write bandwidth, blue, and the read bandwidth, red. We have also included a comparison of how many GB each core contributes when reading.

1D write-read; 4M particles

Cont-Cont-Cont-Full
Cont-Cont-Cont-Partial
Full vs Partial

Contiguous memory, contiguous file system, contiguous full read file system

4M-c-c

In both contiguous-full and contiguous-partial, we have a similar bandwidth distribution given that, technically, it is the same test, just that we read less data on the latter. The full-read outperforms the partial one each time, given that h5bench probably introduces additional logic to read just some chunks. If implemented correctly, this can have similar throughputs to the full-read.

2D data structures

Let us proceed with the 2Dimensional tests. To limit the space of the possibilities, we will only execute equally sized matrices benchmarks, but we will sample if an irregular matrix has any effect. Like before, we will test different sizes to see how it affects the performance and try to find the scalability threshold.

Initial 2D tests; contiguous memory, contiguous file system

1024-1024
2048-2048
4096-4096
8192-8192

In this case, we have tested matrices of 1024x1024 to 8192x8192 particles. As before, we get acceptable results with more than four million particles (2048x2048), but we still have some artefacts on the raw bandwidth. When using non-regular matrices, we can observe results matching regular ones, so we can assess that this does not affect performance.

With these in mind, we have chosen 4096 particles in each dimension and ran all four tests because it is the first test with precise results.

2D tests

2D tests; 4K-4K particles

Cont-Cont
Cont-Int
Int-Int
Int-Cont

Contiguous memory, contiguous file system

4K-c-c

Similar to the 1D tests, we have to avoid contiguous-interleaved and interleaved-contiguous, which, just as before, perform poorly. These results should not surprise us, given that a 2D matrix can be considered a giant 1D vector with regions, each one being one row of the matrix. We can check the GB/s per core plot to compare the four schemes.

2D write; GB/s per Core

GBs-core-4k

2D write-read tests

Now that we know how our system file behaves when writing 2D data structures, we will check the read throughputs. In this case, h5bench only allows us to do write-read tests using contiguous data, storing it contiguously and reading it entirely. As before, we have the write bandwidth, blue, and the read bandwidth, red.

Contiguous memory, contiguous file system, contiguous read file system

4M-c-c

As we can see, we stagnate on 7GB/s read bandwidth, which is the same as on the 1D data sets, so we can start to formulate the idea that the dimensionality does not affect the read or write performance if the access is efficient.

3D data structures

When changing from 1D to 2D, we reduced our parameters by a factor of a thousand to avoid having massive data sets and to find the lower bound on which IO operations started to perform. Now we will have to do the same to accommodate the nature of 3D matrices. We expect to lower our parameters to get matrices around 4M-16M particles. In the case of 3D data sets, h5bench only allows one test, a contiguous-contiguous pattern.

Initial 3D tests; contiguous memory, contiguous file system

128-128-128
256-256-256
512-512-512

128

As aforementioned, when we have fewer particles than the performance threshold, h5bench reports unreal raw speeds and low bandwidths. In this case, we have acceptable IO speeds when using a 256³ matrix. When testing 1D and 2D structures, we saw good results using more than 4M particles. With 128³ particles (2.097.152), we anticipated a suboptimal result; the result we got. Moreover, we got the expected plots when we passed the 4M particle threshold.

We executed the test with a 192³ matrix halfway between the two initial tests to test this theory. The following plot shows that we are still underperforming even though we have surpassed the expected threshold of 4 million particles (192³ = 7.077.888).

192³ particles

192

This could mean that what matters is not the number of particles but the size and balance of our rows and columns. We did an additional test with a 256-256-128 matrix, more than 4M particles but with a non-regular distribution. In this case, we get some decent bandwidth, but we can observe the raw bandwidth behaving just as when we had small data sets. These last two tests could mean that the threshold correlates with both the data-set size and structure.

256-256-128 particles

256-128

3D tests

This chapter intentionally left blank

h5bench only has one 3D test. For the plot of the 3D test, please check the previous chapter, figure 256³.

3D write-read tests

As for the 2D write-read and 3D write tests, h5bench only allows us to perform a contiguous-contiguous test. We should expect results similar to the ones previously observed of 1D-c-c and 2D-c-c write-read executions. As before, we have the write bandwidth, blue, and the read bandwidth, red. We should get the same bandwidths as the other write-read benchmarks that continuously access and store data.

Contiguous memory, contiguous file system, contiguous read file system

3D-c-c-wr

Unexpected Write-read behaviour

There was a sudden drop in bandwidth between fifteen and twenty-five cores in all the write-read tests we performed, no matter the dimensionality. When only writing, h5bench did not report those drops in GB/s. The write and read parts of the benchmark execute sequentially so they do not interfere. We tried to find a reason why, but we saw that it is consistent between different nodes, hours and even days.

Projects vs Scratch

As we said earlier, we tested all of the benchmarks on gpfs/projects, but Marenostrum offers two main directories to store data, Projects and Scratch. Projects has a block size of 8MiB against the 16MiB of Scratch. GPFS stores all the files and data in sub-blocks equal to block-size/32; if it has a surplus, the file system will commit an entire sub-block to it (Files smaller than 4KiB will be stored at the inode). (e.g. A file stored at Projects of 4.01KiBytes will occupy 256KiBytes on the file system and if stored at Scratch, 512KiBytes)

Knowing this, we wanted to ensure that we could guarantee the same bandwidths on gpfs/scratch and check if the block-size correlates with performance or data set sizes. For this, we executed the write-read contiguous-contiguous tests there, given that we can check both write and read throughputs at once. Even though we know we could have some losses of bandwidth between fifteen and twenty-five cores, we can estimate the Write IO speeds without the previously-mentioned artefacts.

Scratch and Projects performance juxtaposition

Projects vs Scratch
Projects vs Scratch vs 2X-Scratch

1D-w-c-c-f-4M

Scratch and Projects, with equally sized data sets, are alike when reading 2D or 3D structures, but when writing and specifically reading 1D data sets, Scratch falls behind by 1GB/s and sometimes up to 2GB/s.

When checking if the block size correlates with bandwidth, we did the same tests but with double-sized parameters ("2X" on the figures). Mostly, these tests fall in the middle ground of the original Scratch and Project benchmarks. The only outliers when this 2X data set outperforms the original ones on Projects are when using more than thirty cores on the 2D Read test and between 1 and 26 cores on the 3D Write benchmark.

Conclusions

These benchmarks have shown us that we must think carefully about how we implement, read and store our data structures. We must avoid writing or reading in small batches, and we should plan when accessing the file system.

What we have seen that destroys our hope of IO performance is mismatching the access pattern of the data. Using different patterns on memory and the file system has shown unacceptable bandwidths. All users must try to avoid this.

As for the other schemes or dimensionalities, we do not see any problem using them if the implementation is efficient. We are confident that dimensionality does not affect performance if the data sets are appropriately sized. When we did the initial tests on each dimension, we tested more extensive data sets, which performed well, so increasing the data does not affect bandwidths.

Still with unanswered questions?

As always, any doubts about how to perform IO or performance issues can be addressed to:

RES/BSC: support AT bsc DOT es