SORS: Achieving Supercomputer Performance (Without Superhuman Effort)

Date: 12/Dec/2024 Time: 10:00

Place:

[HYBRID] BSC Auditorium and Online via Zoom.

Primary tabs

Abstract

One of the truisms of supercomputing is that promised (aka peak) performance always dwarfs realized performance. This gap has generally widened over time, most notably after the reemergence of ML/AI. To some degree, this is to be expected. Vendors are motivated to put their best foot forward, and as it was once put to me, vendor-published benchmarks should be considered the asymptotic limit of tuning. No matter how hard customers try, they will never completely replicate a vendor result – Hercules never catches the hare. At best, he gets within a given epsilon.

Even though Newton refuted the Hercules argument, it is still the case that the performance gap will always exist. The size of the gap, however, is a different matter. Jack Dongarra noted in the 80’s that to achieve what he called “supercomputer performance” on the Cray 1, computations had to be organized to reuse data. Without data reuse, computation performance on the Cray was basically scalar speed. Compilers and vector computers in the 80’s by and large delivered supercomputer performance, a fact which unfortunately is not so true today.

This talk discusses how to achieve supercomputing performance. While the compiler is essential to a successful strategy, achieving supercomputer performance requires more: an appropriate compiler, a balanced architecture and an informed user. The talk roughly covers three broad topics from the user, compiler, and architecture perspective:

  1. A quick review of the compiler theory behind reordering transformations such as vectorization and parallelization, demonstrating how it can produce supercomputer performance on vector architectures. As a general rule, understanding how compilers work is usually invaluable in aiding application developers produce more effective programs.
  2. An examination of some common architectures used for ML/AI acceleration, discussing how reordering theory is necessary for their success, why their performance gap is so large, and the challenges that have to be overcome to meet the “supercomputer” mark.
  3. Speculation from a compiler point of view as to the architectures that can be effective in ML/AI acceleration. Following the days of CISC v. RISC, finding the right balance between compiler and architecture is essential in finding a successful approach.
Short Bio
Being a firm believer in the statement “Implementations are oft-needed checks on the egos of theoreticians”, Randy Allen has likewise spread his career across both theory and implementation of highly-innovative highly-optimizing tools and software. Dr. Allen’s PhD dissertation focused on optimizing compilers for parallel and vector machines, culminating in the graduate-level textbook (coauthored with Ken Kennedy) “Optimizing Compilers for Modern Architectures”. As VP of Performance Engineering at Chronologic Simulation, Randy was an early developer of VCS, the world’s first compiled Verilog simulator. Catalytic, Inc (founded by Randy), pioneered fixed-point MATLAB(r) compilers for DSPs. Dr. Allen’s current interests are developing high-performance, low-power architectures and compilers that can meet the needs of ML/AI networks. Dr. Allen earned a PhD in Mathematical Sciences from Rice University and was a founding member of the Rice Computer Science Department. He is currently a Leading Researcher at Barcelona Supercomputing Center.

Speakers

Speaker: Randy Allen, Leading Researcher. Technical Management HW Engineering - Computer Sciences Department, BSC

Host: Teresa Cervero, Leading Researcher. Technical Management HW Engineering - Computer Sciences Department, BSC