This interim report presents the status and achievements of the T3. 2 task within WP3. We first discuss three different strategies for transitioning from CUDA to programming models more suited for achieving functional and performance portability with all families of accelerators. • In QUANTUM ESPRESSO, developers have adopted OpenMP and OpenACC as an alternative back end for offloading, gradually phasing out the previous CUDAFortran implementation. Several proposed interfaces have been adopted for the whole high-level code layer to maintain the source code uniqueness and transparency concerning the two backends. • YAMBO developers present the use of the deviceXlib MAX library to integrate multiple offload back-ends inside large Fortran codes. • In BIGDFT, crucial kernels have been implemented using the SYCL C++ programming model with significant results both in portability and performance. Another part of this report is then dedicated to the work done in T3. 2 on the FFTXlib of QUANTUM ESPRESSO. This latter has been successfully offloaded with the OpenMP backend. To improve the performance and scalability of this porting, T3. 2 has taken charge of implementing a batched version of the library for the Cray/HIP toolchain. To understand the intrinsic scalability limits of this kernel, we are also performing a comparative analysis of FFTXlib performance versus analogous FFT libraries for accelerators (CuFFTMP, heFFTe). We present the work done to design and implement an optimised band parallelization scheme that can efficiently side FFTXlib parallelism in QUANTUM ESPRESSO, overcoming its current limits in strong and weak scalability. We also present the work done by the BIGDFT group as an appendix for streamlining and optimizing the benchmarking process. We conclude with some final remarks and point out two main topics of oncoming activities in T3. 2: data exchange within workflows and the improvement of check-pointing and code resilience.
Delugas et al. (Sun,) studied this question.