SORS: Navigating heterogeneity and scalability in modern chip design

Date: 27/Jun/2023 Time: 11:00

Place:

BSC Auditorium and Zoom.

 

Primary tabs

Objectives

Abstract: Due to the growing size of AI models, computationally-intensive workloads, such as neural networks, are increasingly adopting sparse data structures to reduce memory footprints. Similar sparse structures are also central to graph analytics, recommendation systems, and computational chemistry. Traversing sparse data leads to irregular, data-dependent memory accesses, which frequently miss in the cache hierarchy due to limited spatial or temporal reuse. Consequently, fetching data from main memory results in processor stalls and increased memory access contention in multi-processor systems. This talk focuses on how computers process sparse data structures by introducing innovations across the hardware-software stack.

First, I present the hardware design of a memory-access engine, MAPLE, to which off-the-shelf processing elements (PEs) can offload irregular memory accesses and consume the data via software-managed queues. When used in conjunction with software pipelining or prefetching, MAPLE effectively mitigates memory latency, providing 2x speedups over software-only and hardware-only techniques.

Second, to scale sparse applications to thousands of PEs, my work introduces Dalorex, an execution model where a program is split at pointer indirection into tasks that only access a small range of the address space. Tasks then execute at the processor with dedicated access to that memory range, which creates a pipeline of task executions for each independent exploration of the data structure. My work also proposes a novel integration of memory technology and PEs, which enables creating Dalorex at scale. I present results with parallelizations of six sparse workloads scaling up to a million PEs, leading to faster execution times than the top entries of the Graph500 ranking for the studied problem sizes.

In an era of big data, sparse data structures are crucial for a wide range of applications. My work demonstrates that by migrating the execution to the PE closest to a memory region as the sparse structure is traversed, we can enable massive improvements in performance and energy efficiency. While different memory integrations involve trade-offs in performance, power, and cost, my proposed chiplet integration enables cost-effective decision-making after silicon production. Moreover, a section of the talk will be devoted to two formal verification methodologies developed to aid the design of safe and secure hardware modules in the context of the Cambrian explosion of accelerators and specialized hardware

 

Short bio: Marcelo is a PhD candidate in the Department of Computer Science at Princeton University advised by Margaret Martonosi and David Wentzlaff. He received his BSE from University of Murcia. Marcelo is interested in hardware innovations that are modular, to make SoC integration practical. His research focuses on Computer Architecture, from hardware RTL design and verification to software programming models of novel architectures.

He has previously worked in the hardware industry at Arm, contributing to the design and verification of three GPU projects, at Cerebras Systems, creating High-Performance Computing kernels, and AMD Research, working towards designing the next generation data centers optimized for large graph data structure traversal.

At Princeton, he has contributed in two academic chip tapeouts that aims to improve the performance, power and programmability of several emerging workflows in the broad areas of Machine Learning and Graph Analytics.

 

Speakers

Speaker: Marcelo Vera, PhD candidate in the Department of Computer Science at Princeton University, US.
Host: Osman Unsal, Computer Architecture For Parallel Paradigms Group Manager, CS, BSC