The block incomplete LU (BILU(0)) preconditioner is widely adopted for solving large-scale block-sparse linear systems arising from coupled partial differential equations (PDEs). However, strong inherent data dependencies and high memory bandwidth requirements of block matrix operations in the preconditioner pose significant challenges for efficient implementation on GPUs. Existing methods face the trade-off between parallelism and convergence, and efforts to leverage the block properties remain limited. In this work, we introduce HiDAP-BILU for dependency-aware BILU(0) preconditioner on GPUs. HiDAP-BILU employs a block-centric hierarchical parallelism strategy to maximize concurrency. A block-to-block waiting scheme is proposed to preserve inter-block dependencies, while warp-level data distribution operations are utilized to maintain intra-block dependencies. Aligned with block-centric hierarchical parallelism, various architecture-aware optimizations are designed to minimize warp divergence and ensure coalesced memory access. Experiment results demonstrate that HiDAP-BILU achieves up to 4.77x and 3.10x speedups compared to state-of-the-art general methods for factorization and triangular solve, respectively. Additionally, it provides an average speedup of 3.11x in end-to-end BILU-preconditioned iterative solvers.
Guo et al. (Tue,) studied this question.