Resilience In Distributed Systems

Primary tabs

Our team works in multiple projects related to high-performance computing, deep learning and blockchain technology. We are mostly interested in scaling all theses systems by several orders of magnitude, while maintaining a high level of reliability, robustness and security. To reach this goals we develop midleware to tolerate failures and recover gracefully, as well as monitoring systems and performance models to meausre, analyze and predict the resilience of the target systems.

Objectives

Our objective is to develop reliable and robust systems capable to scale orders of magnitude beyond the current state-of-the-art. We do research in several domains:

  • Resilience for Extreme scale High Performance Computing
  • Performance and resilience modeling for Deep Learning
  • Reliability and robustness analysis for next generation blockchain technology