BigHPC

A Management Framework for Consolidated Big Data and HPC Computer Science and Engineering

INESC TEC was a project partner

Funding

P2020, in copromotion with UT Austin, 1,1M€

Duration

2020—2023

Description BigHPC aimed at simplifying the management of computing and storage resources at HPC infrastructures, supporting Big Data and parallel computing applications, through a novel framework that can be seamlessly integrated with existing HPC centers and software stacks.
The contributions of the project had a direct impact on science, industry and society, by accelerating scientific breakthroughs in different fields and increasing the competitiveness of companies through better data analysis and improved decision-support processes.

Scientific Advances

BigHPC designed and implemented a novel solution for monitoring and optimally managing the infrastructure, data, and applications of current and next-generation HPC data centers. The BigHPC project also produced an innovative solution to efficiently manage parallel and Big Data workloads that:

- Combines novel monitoring, virtualization and software-defined storage components;
- Can cope with HPC’s infrastructural scale and heterogeneity;
- Efficiently supports different workload requirements while ensuring holistic performance and resource usage;
- Can be seamlessly integrated with existing HPC infrastructures and software stacks;
- Will be validated with pilots running in both MACC and TACC supercomputers.
The combination of these goals is unexplored by current commercial solutions and academic work, thus requiring new research work to achieve such objectives.

Impact

The research conducted at the project allowed advancing the state-of-the-art for resource management in HPC infrastructures. Next, we focus on the software-defined-storage component, led by INESC TEC, while discussing its integration with the virtualization and monitoring component, led by other BigHPC partners.

The project proposed a new solution, based on the Software-Defined Storage paradigm, that is able to holistically control all applications running at the HPC infrastructure and manage the storage resources being used by these to enable better quality-of-service for HPC users. In more detail, this solution advanced the state of the art in two main ways:

- Through a novel data plane stage middleware that is able to transparently intercept storage I/O requests from applications and leverage features such as rate-limiting to ensure that applications are not able to saturate shared storage resources at the HPC infrastructure. Previous state-of-the-art was not generally applicable or required manual user intervention to rate-limit applications exhibiting undesired I/O patterns.

- Through a novel distributed control plane component, that coordinates all data plane stages intercepting I/O requests from applications in order to ensure global quality-of-service policies (e.g., I/O fairness and/or priority across jobs from different users) across the whole cluster. This coordination is made possible because the control plane has holistic visibility of the cluster’s resources and applications deployed in it, through its integration with the BigHPC’s monitoring and virtualization components (two other contributions of the project). Previous state-of-the-art solutions were unable to automatically and holistically ensure quality-of-service policies across the full HPC infrastructure and were limited to small-scale clusters (i.e., tens to hundreds of nodes) due to their centralized design.

As highlights, the aforementioned solution resulted in scientific publications in two top scientific venues (i.e.,CCGrid and Usenix FAST) and 3 open-source prototypes.

Know more about our projects

2016—2020 BEACONING Computer Science and Engineering
2020—2023 AIDA Computer Science and Engineering