Skip to main content

HyperThreading performance on MareNostrum 4

· 15 min read
Cristian Morales Pérez
David Vicente Dorca
Joan Vinyals Ylla-Català

During the execution of synthetic and real app benchmarks, we detected that the performance of the benchmarks have a high variability on MareNostrum. Also, we were notified by other colleagues that they achieved more stable times with other apps, as FALL3D, on clusters like JUWELS. Therefore, we decided to investigate the origin of this “system noise” and the different ways to limit it, as its affects the stability of the application performance and affects negatively in the scalability, mainly in applications with synchronization MPI calls in their kernel.

Using a synthetic benchmark developed by us, we observed that there are periodic events during the simulation that affects the time of a small number of iterations. We concluded that these events come from the system and they produce preemptions on the simulation. To avoid these preemptions, we tried to run the same benchmarks leaving 1 or 2 cores per node empty. Thus, these 1 or 2 empty cores are available to run these periodic events therefore the other cores avoid the possible preemption and the applications running on them could have a more stable time per iteration.

In addition, during the PRACE UEABS activity, we noticed that some applications perform better on SkyLake systems with HyperThreading enabled, as JUWELS cluster (beyond the performance improvement acquired by the higher frequencies). This is because the systems with HyperThreading handle better the preemption and the context switching are swifter.

Taking these points into account, we decided to test multiple applications with multiple configurations, with HyperThreading and without limiting the frequency to 2,1 GHz, as others SkyLake clusters do.

Noise Benchmark (gitlab)

We developed a synthetic benchmark to quantize the effect of noise on bursts of computation with very low granularity between synchronizations. The noise is propagated through all the allocated resources when the application has such synchronization, losing CPU time proportional to the resources allocated.

The benchmark we developed simulates this behaviour and keeps track of the noise-induced load imbalance. The benchmark consists of several iterations with short bursts of computation (around 10us); that are timed and stored in an array. Ideally, all the iterations should take the same amount of time, but due to noise, some variability appears. We use the accumulated values of the minimum and maximum values among all cores for all the iterations. We take the ratio of accumulated minimums over accumulated maximums as our efficiency value.

Configuration

For the synthetic benchmark tests with hyperthreading, we used the MN4 "hyperthreading" partition. This partition has 8 nodes available (s24r2b[65-72]) with the HyperThreading enabled, and it can be used by any user using the following flag on their sbatch/salloc commands:

--partition=hyperthreading

Also, we have used the normal nodes from the "main" partition to compare. By default, all the users' jobs run on this partition.

Perfomance analysis

We wanted to compare the value of this metric on hyperthreading threads to observe if the operating system used the extra hardware thread for the preemptions so that both codes run simultaneously, reducing the effect of noise.

We executed the benchmark from 1 full node to 8 (the maximum available in the hyperthreading partition) and then plotted the results in the image below. We can observe how the normal partition has a significantly more noise effect than the hyperthreading one.

Noise benchmark plot

As earlier explained, these values represent the loss of efficiency when performing fine-grain synchronized computation. So from these outcomes, we can get an idea of which applications will have better behaviour in the hyperthreading partition rather than in the normal one—those with this type of communication pattern.

Applications

Configurations

All the application simulations were run on a MN4 reservation on the nodes s02r2b[25-48],s03r1b[49-72], 64 nodes in total. These nodes had set up on the following ways:

  • Def (Default): Hyperthreading disabled, Frequency limited to 2,1GHz
  • HT (Hyperhreading): Hyperthreading enabled, Frequency limited to 2,1GHz
  • NL (NO LIMIT): Hyperthreading disabled, Frequency unlimited
  • HT+NL:Hyperthreading enabled, Frequency unlimited

In some cases, for each test, we have run the same simulation with 24 and 23 tasks per socket (48 and 46 per node).

TestsHyperthreadingFrequencyTasks per socket
Def-24DisabledLimited to 2,1 Ghz24
Def-23DisabledLimited to 2,1 Ghz23
HT-24EnabledLimited to 2,1 Ghz24
HT-23EnabledLimited to 2,1 Ghz23
NL-24DisabledUnlimited24
NLDisabledUnlimited23
HT+NLEnabledUnlimited24
HT+NLEnabledUnlimited23

To run with 23 tasks per socket, we have added the following to the jobscript:

#SBATCH --ntasks-per-node=46
#SBATCH –ntasks-per-socket=23
srun –cpu-bind=core /path/to/app

Namd

  • Input: 1GND
  • 64 nodes (pure MPI)
TestsAVGMEDSTD.DEVMINMAX
Def-241001,051003,5711,42980,471017,46
Def-23 873,27873,136,18863,56881,62
HT-24825,5825,52,05822,09830,14
HT-23836,81836,893,18831,54846,11
NL-24880,72880,588,06865,69898,11
NL-23828,38827,922,63822,54833,1
HT+NL-24766,09766,573,94757,85774,43
HT+NL-23761,2760,894,25751,52771,2

NAMD Performance

We observe that the performance of NAMD is very bad with the configuration Def-24, and it is the most common configuration used by the users. Additionally, the noise using Def-24 is higher than in other cases. The best performance is achieved using HT, without limiting the frequency and using 23 tasks per socket. In average, there is a 31,5% of speed-up of using HT+NL-23 instead of Def-24. Also, we observe that if the HT is enabled, the noise is not reduced using 23 tasks per socket instead of 24 tasks per socket.

GROMACS

  • Input: Lignocellulose
    • 100000 steps
  • 64 nodes (Hybrid: 2 OpenMP threads per MPI task)
TestsAVGMEDSTD.DEVMINMAX
Def-2465,264,443,6758,2571,12
Def-23 76,3579,936,2266,3283,12
HT-2467,3667,841,0665,2268,47
HT-2363,4163,30,7662,2865,09
NL-2477,8579,034,0972,6482,94
NL-2372,8868,838,7558,1183,14
HT+NL-2473,4173,440,4672,5774,11
HT+NL-2367,8167,990,6866,3968,71

GROMACS Performance

We need to take into account that in this case we are not measuring the performance with execution time, we are using ns/day (the higher the better). In the case of GROMACS, we only can confirm that with HT enabled, the noise is much lower, but the performance is the same as the Def-24 configuration. In the case of GROMACS, we do not observe a clear impact of using 23 or 24 tasks per socket.

In average, there is a 17-19% of speed-up of using Def-23 or NL-24 instead of Def-24. There is not performance loss if we use any configuration with HT, NL or combiend instead of Def-24.

CP2K

  • Input: H2O-DFT-LS
  • 64 nodes (Hybrid: 2 OpenMP threads per MPI task)
TestsAVGMEDSTD.DEVMINMAX
Def-241667,451665,8110,351651,531691,06
HT-241797,771797,930,721796,231798,51
NL-241662,521646,7744,581644,331785,43
HT+NL-241776,041775,592,411773,391779,82

On CP2K we observe a similar case as we have seen with GROMACS, the noise is much lower if the HT is enabled but the performance is lower than with the HT disabled.

Also, we observe that the noise is higher if there is no frequency limit. With NL-24 there are two outliers, but in general is quite stable and with the best performance.

There is only a 6% of speed-up using Def-24 instead of HT+NL.

WRF

  • Input: IP 12km
  • 64 nodes – Pure MPI
TestsAVGMEDSTD.DEVMINMAX
Def-24232,01230,556,48220,45244,28
Def-23 199,96199,351,4198,46204,96
HT-24219,88219,861,08217,28221,78
HT-23227,35227,531,47224,26231,7
NL-24213,73213,875,26205,76222,33
NL-23197,99197,990,58196,59199,72
HT+NL-24186,75186,690,91184,74188,97
HT+NL-23187,71187,590,98185,86189,67

WRF Performance

With WRF we observe a similar case as we have seen with NAMD. The worst case is the Def-24, the default configuration in MareNostrum 4 and used by all the users. Also, this configuration is the one with the higher noise impact.

The best performance of WRF is achieved with HT enabled and without frequency limit (no matter if 23 or 24 tasks per socket). In average, there is a 24% of speed-up of using HT+NL instead of Def-24. Additionally, in the case of WRF, if the HT is disabled we observe a performance improvement using 23 tasks per socket.

NEK500

  • Input: JUELICH
  • 64 nodes – Pure MPI
TestsAVGMEDSTD.DEVMINMAX
Def-24260,15261,565,8250,76268,5
Def-23 220,53220,641,77217,63223,52
HT-24204,25204,091,01203,13205,97
HT-23208,01207,920,29207,78208,43
HT+NL-24213,12213,341,75210,88216,08
HT+NL-23221,95220,189,43213,59242,03

NEK500 Performance

As we can see on the Table and the Figure, the worst Nek500 performance is obtained with the Def-24 configuration.

Also, we observe that the HT configurations are the ones with best performance. In this case, we see the best performance is with HT-24, but the most stable runs are with HT-23 (less noise). In average, there is a 27,4% of speed-up of using HT-24 instead of Def-24.

In the case of HT+NL, we see that NEK-500 does not benefit from removing the frequency limit.

Alya – UEABS

Test Case A

  • Input: SPHERE_16.7M

    • 16.7M sphere mesh
    • Number of steps: 50
    • Mesh division: 1
  • 100 conescutive simulations on the same job

  • Time measurment: Nastin Module time

TestsAVGMEDSTD.DEVMINMAX
Def-2420,1919,652,1416,4526,21
Def-23 17,3217,50,6415,3818,16
HT-2417,6117,630,6815,5719,3
HT-2318,1718,290,616,4419,49
NL-2418,5718,122,0715,527,7
NL-2317,2617,420,5615,4818,03
HT+NL-2417,717,880,7115,6419,14
HT+NL-2317,7217,730,7615,6319

Alya Test A Performance

As observed in previous benchmarks, the Def-24 is the worst one in terms of average performance and the one with most noise. The noise is reduced using 23 tasks per socket and/or HT, but it is still high. In average, there is a 14% of speed-up using the HT+NL configurations instead of Def-24.

Alya – UEABS

Test Case B

  • Input: SPHERE_132M

    • 132M sphere mesh
    • Number of steps: 100
    • Mesh Division: 1
  • 100 consecutive simulations

TestsAVGMEDSTD.DEVMINMAX
Def-24444,15433,4632,70385,32555,03
Def-23 396,44390,1228,30351,26465,40
HT-24402,35396,0626,68360,93471,39
HT-23398,23392,0923,85360,86466,12
NL-24441,26439,2728,99393,93531,65
NL-23395,72390,9524,24357,18463,74
HT+NL-24401,66396,7926,59354,83473,29
HT+NL-23401,05393,5525,38355,09473,92

Alya B Performance

We see that Alya is positively affected using 23 tasks per socket. In the configurations with 23 tasks per socket, the performance is always better and the noise is lower. The worst performance is achieved using the Def-24 configuration, also with the most variance. In general, the best configuration is NL-23, but the fastest simulation was with Def-23.

In average, there is a 11% of speed-up using the HT+NL configurations instead of Def-24.

Alya B 23 tasks Performance

Despite of this, the test case B of Alya seems special because the variance is still high using HT and/or 23 tasks per socket.

HATE Tests

Multiple apps that run everyday on MareNostrum to check the status.

  • Alya
  • Amber
  • CPMD
  • GROMACS
  • HPCG
  • linpack
  • namd
  • vasp
  • wrf

Only one simulation per tests and always running with 24 tasks per socket.

DEFHTNLHT+NL
Alya155,18152,2150,01152,46
Amber470,57474,29470,19285,08
CPMD77,4478,6778,177,12
GROMACS247370,4246,4304,98
HPCG120,79126,06126,36125,5
linpack331,44337,17332,03343,09
namd785,55789,98786,77669,19
vasp294,5293,34301,91291,25
wrf96,43106,3798,9994,2

With the HATE benchmarks we observe a similar behavior as we have seen on the specific benchmark tests with multiple simulations. In most of the cases, the HT-NL is the best choice to obtain the best performance. Then, we observe that the performance with GROMACS is negatively affected when the HT is enabled.

Only on the synthetic benchmarks as HPCG and HPL, the performance is better with the DEF configurations.

Input/Output

DEFHTNLHT+NL
Read imb_io_home5608,3530168225537,6
Read imb_io_project5269,45794,45632,26803,3
Read imb_io_scratch5594,44989,367455953,2
Write imb_io_home3653,53547,33943,53920,8
Write imb_io_project293,2412,6524,1386,6
Write imb_io_scratch228,4244,8242,1273,9

We observe from these tests that the bandwidth of the input/output operations is improved when the frequency limit is removed. We do not see a clear impact of enabling or disabling the HT.

These tests can be highly affected by the usage of the file systems during the job execution.

Conclusions

In all benchmarks, if the HT is enabled, the performance variance between multiple simulations is lower. Additionally, in all of the cases, when the HT is disabled and using 23 tasks per socket, it reduces the performance variance, and also in some cases we obtain a performance improvement. . In the case of Alya, despite using 23 tasks per socket or Hyperthreading, the noise is still high compared with other benchmarks.

In all benchmarks, the most common configuration used by the users (Def-24) is the one of the slowest in terms of performance. As Def-24 is a configuration without HT enabled and using 24 tasks per socket, it is the configuration with highest performance variance (noise).

Applications as NAMD and WRF have the best performance using hyperthreading and removing the frequency limit. Also, AMBER, CPMD and VASP, as we have seen on the HATE tests, have the best performance using this configuration. In these cases, we do not observe a clear impact of using 23 or 24 tasks per socket.

In the case of the I/O benchmarks, the bandwidth is higher if the frequency limit is removed, but it is not affected if the HT is enabled or disabled.