This talk will present recent advances in extending OmpSs-2 to distributed-memory systems, highlighting three contributions and the associated challenges. OmpSs-2@Cluster employs a common address space and weak accesses to support concurrent task creation and dataflow execution across nodes. Achieving good performance and scalability on 16 to 32 nodes requires detailed performance analysis together with a set of optimizations and runtime techniques, which I will outline in the talk. Second, I will describe how task offloading, in combination with BSC’s Dynamic Load Balancing (DLB), enables OmpSs-2@Cluster to mitigate load imbalance in MPI + OmpSs-2 programs with minimal application changes. Third, I will explain how the runtime can exploit the iterative structure of certain task dependency graphs to precompute communications and execute iterative regions efficiently, yielding performance and scalability comparable to state-of-the-art asynchronous MPI+X. Together, these results indicate that distributed tasking can combine productivity, adaptability, and high performance in modern HPC applications.
Building similarity graph...
Analyzing shared references across papers
Loading...
Paul Carpenter
Omar Shaaban
Juliette Fournis d'Albiat
Universitat Politècnica de Catalunya
Barcelona Supercomputing Center
Building similarity graph...
Analyzing shared references across papers
Loading...
Carpenter et al. (Thu,) studied this question.
www.synapsesocial.com/papers/69db375f4fe01fead37c55e7 — DOI: https://doi.org/10.4230/oasics.parma-ditam.2026.1