Hybrid SORS: Silent Data Corruptions in Computing Systems – Modelling, Measuring, Mitigation Across the Layers

Date: 01/Feb/2023 Time: 15:00

Place:

Severo Ochoa room and over Zoom with required registration.

Primary tabs

Objectives

Click here to download the presentation slides.

Abstract: Hyperscale’s of computing systems (Meta and Google) recently revealed a major issue in the operation of their server fleets: CPU hardware faults, marginalities, and bugs (not random transients) generate wrong program outputs (silent data corruptions – SDCs) more frequently than ever imagined and propagate at scale without any alert from the hardware or software. The research community was invited to join this challenging endeavour (cf. Meta RFP). In this talk, we discuss the severity of the problem and its (likely still unknown) implications in large scale computing. We focus on the problem’s cross-layer (circuit, microarchitecture, ISA, software) and end-to-end nature and how modelling efforts at different layers of abstraction can shed light to accurate measurement of SDCs rates. Fast and effective quantification of the rates along with identification of “troublemaking” hardware structures and software pieces, can assist mitigation actions by silicon manufacturers and system and software integrators.

Short bio: Dimitris Gizopoulos is Professor at the National and Kapodistrian University of Athens where he leads the Computer Architecture Lab. The lab’s research focuses on dependable, energy efficient, and performance optimized computing systems across different domains and abstraction layers. Lab’s work is supported by the EU as well as by the industry (currently by Meta, AMD, Intel on dependable and energy-efficiency computing topics). Gizopoulos is member of the Editorial Boards of several IEEE and ACM Transactions/Magazines and has recently served as General Chair of the IEEE/ACM MICRO53 and MICRO54 symposium editions. He is an IEEE Fellow and an ACM Distinguished Member.

Speakers

Speaker: Dimitris Gizopoulos, Professor, Department of Informatics & Telecommunications, University of Athens
Host: Petar Radojkovic, Memory technologies Established Researcher, Computer Sciences, BSC