What question did this study set out to answer?

The aim is to improve the efficiency of the block incomplete LU (BILU(0)) preconditioner on GPUs by addressing data dependencies.

February 26, 2026Open Access

HiDAP-BILU: Hierarchical Dependency-Aware Parallelism for Block ILU Preconditioner on GPUs

Key Points

The aim is to improve the efficiency of the block incomplete LU (BILU(0)) preconditioner on GPUs by addressing data dependencies.
Introduced HiDAP-BILU for dependency-aware block incomplete LU preconditioner
Employed hierarchical parallelism strategy to maximize concurrency
Implemented block-to-block waiting scheme to preserve inter-block dependencies
Utilized warp-level data distribution for intra-block dependencies
Designed architecture-aware optimizations to minimize warp divergence and enhance memory access
Achieved up to 4.77x speedup for factorization compared to existing methods
Attained 3.10x speedup for triangular solve
Provided average speedup of 3.11x in end-to-end BILU-preconditioned iterative solvers

Abstract

The block incomplete LU (BILU(0)) preconditioner is widely adopted for solving large-scale block-sparse linear systems arising from coupled partial differential equations (PDEs). However, strong inherent data dependencies and high memory bandwidth requirements of block matrix operations in the preconditioner pose significant challenges for efficient implementation on GPUs. Existing methods face the trade-off between parallelism and convergence, and efforts to leverage the block properties remain limited. In this work, we introduce HiDAP-BILU for dependency-aware BILU(0) preconditioner on GPUs. HiDAP-BILU employs a block-centric hierarchical parallelism strategy to maximize concurrency. A block-to-block waiting scheme is proposed to preserve inter-block dependencies, while warp-level data distribution operations are utilized to maintain intra-block dependencies. Aligned with block-centric hierarchical parallelism, various architecture-aware optimizations are designed to minimize warp divergence and ensure coalesced memory access. Experiment results demonstrate that HiDAP-BILU achieves up to 4.77x and 3.10x speedups compared to state-of-the-art general methods for factorization and triangular solve, respectively. Additionally, it provides an average speedup of 3.11x in end-to-end BILU-preconditioned iterative solvers.

HiDAP-BILU: Hierarchical Dependency-Aware Parallelism for Block ILU Preconditioner on GPUs

Key Points

Abstract

Cite This Study