Reliability is becoming a first-class system design criteria in addition to performance and power. Hardware solutions by themselves will not be sufficient to mitigate the future error rates. The group investigates how to "marry" innovative hardware and software ideas for resilience.
Summary
Exascale supercomputer components are increasing in size and complexity, resulting in increased failures; while scalability at the end of Moore era faces decreasing transistor lifetimes and increased variability. Therefore, reliability is becoming a first-class system design criteria in addition to performance and power. Hardware solutions by themselves will not be sufficient to mitigate the future error rates. In this research line, the group will investigate how to "marry" innovative hardware and software ideas for resilience. Areas of activity include going below safe operating voltages to save energy, leveraging task-based and use of fault-prediction for proactive fault tolerance.
Objectives
- Order of magnitute improvement in total system reliability from supercomputers to waerables
- Resilient circuit/architecture design
- Silent Data Corruption mitigation, processor datapath resilience
- Leveraging OmpSs programming model for fault tolerance