Viewport prediction is a key component in tile-based 360° video streaming. Existing viewport prediction models based on Long Short-term Memory Networks (LSTM) or Transformer typically output a single deterministic future trajectory through deterministic mapping, which fails to capture the inherent randomness in viewing behavior. Moreover, when encoding trajectory features, such models often map trajectory coordinates directly into a high-dimensional space while neglecting the spatial information inherent in the coordinates themselves. Additionally, they exhibit limitations in capturing cross-modal relationships between visual and trajectory features. To address these issues, this paper proposes DiffVP, a diffusion model for viewport prediction in 360° videos. Under the constraints of viewing historical trajectories and video saliency maps, DiffVP leverages Denoising Diffusion Implicit Models (DDIMs) to model future viewing trajectories in the form of probability distributions, generating diverse and reasonable prediction results. In the denoising network, DiffVP employs Explicit Coordinate-Time Encoding (ECTE) to model the temporal dependencies of trajectories and the spatial relationships among coordinates; moreover, a Coordinate-Aware Saliency Features Fusion (CASF) module is proposed to achieve cross-modal alignment and interactive fusion of saliency and trajectory features. Experimental results on three public datasets demonstrate that DiffVP achieves the best accuracy for 2–5 s viewport prediction without sacrificing the performance of short-term (<1 s) prediction.
Zheng et al. (Mon,) studied this question.