Technical Information MN4 CTE-ARM

MN4 CTE-ARM (2020) Technical details

MN4 CTE-ARM is a supercomputer based on 192 A64FX ARM processors, with a Linux Operating System and an Tofu interconnect network (6.8GB/s).

See below a summary of the system:

  • Peak Performance of 648.8 Tflops
  • 6 TB of main memory
  • 192 nodes, each node composed by:
    • 1x A64FX @ 2.20GHz (4 sockets and 12 CPUs/socket, total 48 CPUs per node)
    • 32 GiB of HBM2  1TB/s memory distributed in 4 CMGs (8GiB each socket with 256GB/s bw)
    • 1/16 x SSD 512GB as local storage (shared among 16 nodes)
  • Interconnection networks:
    • Tofu interconnect (Interconnect BW: 6.8GB/s, Node injection BW: 40GB/s)
    • I/O connection through Infiniband EDR (100Gbps)
  • Operating System:  Red Hat Enterprise Linux release 7.6 (Maipo)

 

MN4 CTE-ARM user documentation

Compute node characteristics

​A64FX chip has some particularities that differenciate it from other general purpose CPUs in our cluster. Particularly its core distribution, and the use of helper cores. There are 8 NUMA nodes, 4 of them composed by each one of the helper core, and the other 4 composed by 12 cores each.

CORE and NUMA considerations

  • A64FX, Armv8.2-A + SVE instructions set
  • 52 cores in total, distributed in 4 CMGs. 48 compute cores and 4 helper cores.

NUMA distribution is as follows:

- NUMA [0-3] -> helper cores[0-3] 

- NUMA [4-7] -> cores[12-23], [23-25], [36-47], [48-59]

 

Note: cores 4-11 are unassigned.

NODE considerations

The cluster is composed by 3 different types of nodes. All of them offer the same characteristics in terms of compute power, but differ in the connectivity and purpose. These 3 types are:

  • Compute nodes: These are the more simpler ones, main SoC with provided with Tofu connnectivity between them (groups of 12 nodes). 
  • BIO nodes: Boot I/O nodes, nodes that contain a 512 GB M.2. 1/16 of the nodes are BIO, and serve as the boot server for all the nodes that share the same Disk.
  • GIO nodes: Nodes that contain the global I/O connection to the shared file system (FEFS), as well as Infiniband EDR connection. 1/48 nodes are GIO, and serve as link to FEFS and IB for each shelf in the cluster.

Application compilation considerations

The access point for all users will be the login nodes (2 x86 intel CPUs). Due to the nature of these nodes, compilations must be done using cross-compiling, otherwise they won’t be compatible inside the compute nodes.

Most of the available applications are compiled for ARM, but some modules under the folder “/apps/modules/modulefiles/x86_64” are also available to be run in the login nodes.

Available compilers

  • Fujitsu compilers (Only MPI available option)​: 

    Versions available: 1.1.18, 1.2.26b

     

    Language Cross compilation command (from x86 login) Native ARM command
    Non-MPI
    Fortran
    frtpx
    frt
    C
    fccpx
    fcc
    C++
    FCCpx
    FCC
    MPI Fortran
    mpifrtpx
    mpifrt
    C
    mpifccpx
    mpifcc
    C++
    mpiFCCpx
    mpiFCC

     

  • GNU compilers: 
    Versions available:
            - x86: 4.8.5(system), 10.2.0
            - a64fx:  8.3.1, 10.2.0, 11.0.0
      Language Native x86/ARM compilation 
    Non-MPI
    Fortran
    gfortan
    C
    gcc
    C++ g++

     

  • ARM compilers: 

    ARM compilers are only available from compuite nodes. These compilers are GNU based and follow the same nomenclature, but also have clang support.

      Language Native ARM compilation
    Non-MPI
    Fortran
    gfortan/armflang
    C gcc/armclang
    C++
    g++/armclang++

Fujitsu mathematical libraries and compiler options

Fujitsu provides a set of mathematical libraries that aim to accelerate calculations performed in the A64FX chip. These libraries are the following:

  • BLAS/LAPACK/ScaLAPACK:
    • Components

      - BLAS:  Library for vector and matrix operations.

      - LAPACK:  Library for linear algebra.

      - ScaLAPACK: MPI parallelized library for linear algebra.

    • Compilation options

       

      Library Parallelization Flag
      BLAS/LAPACK Serial
      -SSL2
      Thread parallel -SSL2BLAMP
      ScaLAPACK MPI
      -SCALAPACK

  • SSL II:
    • Components

      - SSL II:  Thread-safe serial library for numeric operations linear algebra, eigen value/eigen vector, non-linear calculation, extreme problem, supplement/approximation...

      - SSL II Multi-Thread:  Supports some important functionalities that are expected to perform efficiently on SMP machines.

      - C-SSL II: Supports a part of funcionalities of serial SSL II (for Fortran) in C programs. Thread safe.

      - C-SSL-II Multithread: Supports a part of funcionalities of multithreaded SSL II (for Fortran) in C programs.

      - SSL II/MPI: 3D Fourier Transfer routine parallelized with MPI.

      - Fast Quadruple Precision Fundamental Arithmetics Library: Represents quadruple-precision values by double-double format, and performs fast arithmetics on them.

    • Compilation options
      Library Parallelization Flag
      SSLII/C-SSLII Serial
      -SSL2
      Thread parallel -SSL2BLAMP
      ScaLAPACK MPI
      -SSL2MPI

       

  • Recommended compiling options:
    Option Description
    -Kfast

    Equivalent to..

    Fortran:
    -O3 -Keval,fp_relaxed,mfunc,ns,omitfp,sse{,fp_contract}
    C/C++:
    -O3 -Keval,fast_matmul,fp_relaxed,lib,mfunc,ns,omitfp,rdconv,sse{,fp_contract} -x-

    -Kparallel
    Applies automatic parallelization. 

Benchmarks

The following information corresponds to some performance test runs executed "in-house" by the user support team. Information on compilation and run scripts is provided, except for HPL, which was run using a precompiled script provided by fujitsu.

HPL (linpack)

Result correspond to a precompiled binary provided by fujitsu.

Performace score -> 548350 Gflops

HPCG

  • Compilation details:

    Compiler used was - fccpx: Fujitsu C/C++ Compiler 4.2.0b tcsds-1.2.26

    Relevant flags defined...

    CXX       = mpiFCC
    HPCG_OPTS = -HPCG_NO_OPENMP -DHPCG_CONTIGUOUS_ARRAYS
    CXXFLAGS  = -Kfast
    
  • Jobscript (full cluster example):
    #PJM -N hpcg
    #PJM -L node=192:noncont
    #PJM --mpi "proc=9216"
    #PJM -L rscunit=rscunit_ft02
    #PJM -L rscgrp=large
    #PJM -L elapse=0:30:00
    #PJM -j
    #PJM -S
    EXEC=./xhpcg
    module load fuji
    /usr/bin/time -p mpiexec ${EXEC} 80 80 80

     

  • Scores
    Compilation/Code Size
    Score (Gflops)
    Fujitsu  1 node 100 Gflops
    Full cluster(192 nodes)
    19206.4 Gflops
    BSC 
    1 node
    40 Gflops
    Full cluster(192 nodes)
    7115.35 Gflops

STREAM

  • Compilation details:

    Compiler used was - fccpx: Fujitsu C/C++ Compiler 4.0.0 tcsds-1.1.18

    Relevant flags defined...

    CC       = mpifccpx
    FCC      = mpifrtpx
    CFLAGS   = -Kfast -Kparallel -Kopenmp -Kzfill
    FFLAGS   = -Kfast -Kparallel -Kopenmp -Kzfill -Kcmodel=large
    
  • Jobscript :
    #!/bin/bash
    #PJM -L node=1
    #PJM -L rscunit=rscunit_ft02
    #PJM -L rscgrp=def_grp
    #PJM -L elapse=00:10:00
    #PJM -j
    #PJM --mpi "proc=4"
    #PJM -N "stream_bw" 
                           
    export OMP_DISPLAY_ENV=true
    export OMP_PROC_BIND=close
    export OMP_NUM_THREADS=12
    mpiexec ./stream.exe
    
  • Scores
    Mode
    Score (GB/s)
    Copy
    839.437 GB/s
    Scale
    827.712 GB/s
    Add
    854.164 GB/s
    Triad
    852.717 GB/s

GROMACS

  • Compilation details:

    Compiler used was: GCC 11.0 with fuji MPI linked libraries

    Relevant flags defined..

    CC       = mmpifcc
    FCC      = mpifrt
    LDFLAGS  = "-L/opt/FJSVxtclanga/tcsds-1.2.26/lib64/"
    
    GROMACS options:
    
    -DGMX_OMP=1 -DGMX_MPI=ON -DGMX_BUILD_MDRUN_ONLY=ON -DGMX_RELAXED_DOUBLE_PRECISION=ON -DGMX_BUILD_OWN_FFTW=OFF -DGMX_BLAS_USER=/opt/FJSVxtclanga/.common/SECA005/lib64/libssl2mtsve.a -DGMX_EXTERNAL_BLAS=on -DGMX_EXTERNAL_LAPACK=on -DGMX_LAPACK_USER=/opt/FJSVxtclanga/.common/SECA005/lib64/libssl2mtsve.a -DGMX_FFT_LIBRARY=fftw3 -DFFTWF_LIBRARY=/apps/FFTW/3.3.7/FUJI/FMPI/lib/libfftw3f.so -DFFTWF_INCLUDE_DIR=/apps/FFTW/3.3.7/FUJI/FMPI/include/ -DGMX_SIMD=ARM_SVE
    
    
  • Jobscript :

    !/bin/bash
    #PJM -L rscunit=rscunit_ft02
    #PJM -L rscgrp=large
    #PJM -L node=1:noncont
    #PJM -L elapse=2:30:00
    #PJM --mpi "proc=8"
    #PJM --mpi rank-map-bynode 
    #PJM -N gromacs
    #PJM -L rscunit=rscunit_ft02,rscgrp=large,freq=2200

    export OMP_NUM_THREADS=6
    export LD_LIBRARY_PATH=/apps/FFTW/3.3.7/FUJI/FMPI/lib/:$LD_LIBRARY_PATH

    module purge

    module load fuji
    module load gcc
    module load gromacs

    mpirun mdrun_mpi -s lignocellulose-rf.BGQ.tpr -nsteps 1000 -ntomp 6

  • Scores
    Size
    Score (ns/day)
    1 Node
    0.735 ns/day