HPC environments are complex and using a framework for developing distributed HPC workflows greatly reduces development costs while increasing the confidence, robustness and portability of the resulting systems.
Summary
HPC (High Performance Computing) software is mandatory in many science and engineering fields. However, the development of such systems is not only expensive but difficult to maintain and extend. The BSC has a research line giving support to the scientific community during the application development process on parallel and distributed systems, providing them different approaches which are aware of common, inevitable problems that appear naturally in HPC environments such as computational nodes failures, I/O bottlenecks, high workloads, checkpointing, and so on.
The aim of this research line is to provide reliable strategies to deal with the inherent problems of handling process workflows that originate during a simulation due to the use of the hundreds or even thousands of computational resources within a HPC environment.
Specifically, the research purpose is centred on providing the HPC applications with a set of tools, methodologies and hints which allow those applications to reach a high maturity state by means of:
- Identifying the process workflows
- Providing solutions to manage them
Objectives
- Develop strategies related to fault tolerance such as node failures, numerical instabilities on simulations or even system crashes by introducing task balancers, checkpointing, on-the-fly reconfiguration, and so on.
- Maximise the use of available resources in order to exploit all the resources most of the time, by means of an efficient engagement between the computational resources and the processes requirements.
- Improve the methodology concerning software building for HPC purposes with the aim of generating applications based on the premises of quality and reliability as solution.