What question did this study set out to answer?

This study aims to determine if vision-language models can predict momentary suicidal ideation from smartphone screenshots.

April 10, 2026Open Access

Predicting Momentary Suicidal Ideation From Smartphone Screenshots Using Vision-Language Models: Prospective Machine Learning Study

Key Points

This study aims to determine if vision-language models can predict momentary suicidal ideation from smartphone screenshots.
Participants completed ecological momentary assessments while smartphone screenshots were captured every 5 seconds during active use.
Two vision-language models were fine-tuned to predict suicidal ideation from the screenshots taken before assessments.
Performance was evaluated with both temporal and subject holdouts.
Achieved strong discrimination at the EMA level with AUC=0.83 and AUPRC=0.77.
Image-based models outperformed text-only models (AUC=0.83 vs 0.79).
Subject holdout generalization was near chance, with AUC≈0.50, while a lexical method achieved AUC=0.62.

Abstract

Abstract Background Passive smartphone sensing shows promise for suicide prevention, but behavioral metadata (GPS, screen time, and accelerometry) often lacks the contextual information needed to detect acute psychological distress. Analyzing what people actually see, read, and type on their phones—rather than just usage patterns—may provide more proximal signals of risk. Objective This study aimed to test whether vision-language models (VLMs) applied to passively captured smartphone screenshots can predict momentary suicidal ideation (SI). Methods Seventy-nine adults with past month suicidal thoughts or behaviors completed ecological momentary assessments (EMA) over 28 days while screenshots were captured every 5 seconds during active phone use. We fine-tuned open-source VLMs (Qwen2.5-VL Alibaba Cloud, LFM2-VL Liquid AI), and text-only models (Qwen3 Alibaba Cloud) to predict SI from screenshots captured in the 2 hours preceding each EMA. We evaluated performance with temporal and subject holdouts. Results The analytic sample comprised 2.5 million screenshots from 70 participants. Temporal holdout models achieved strong discrimination at the EMA level (AUC=0.83; AUPRC=0.77), with image-based models outperforming text-only models (AUC=0.83 vs 0.79; 95% CI 0.003-0.07). Subject holdout generalization was near chance (AUC≈0.50), though a simple lexical screening method retained modest discrimination (AUC=0.62). Smaller models performed comparably to larger models, supporting feasible on-device deployment. Conclusions Screen content predicts short-term SI with clinically meaningful accuracy when models are personalized but does not generalize across individuals. These findings support a 2-stage clinical architecture, coarse lexical screening for new patients, with personalized VLM-based monitoring after a calibration period. On-device inference may enable privacy-preserving deployment.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Ross Jacobucci

Wenpei Shao

Veronika Kobrinsky

Journals

JMIR Mental Health

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Predicting Momentary Suicidal Ideation From Smartphone Screenshots Using Vision-Language Models: Prospective Machine Learning Study

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study