What type of study is this?

This is a Systematic Review study.

September 16, 2025Open Access

Charting the evolution of artificial intelligence mental health chatbots from rule‐based systems to large language models: a systematic review

Key Points

LLM-based chatbots surged to 45% of new studies in 2024, indicating a significant shift in AI mental health tools.
Only 16% of LLM studies underwent clinical efficacy testing, exposing a critical validation gap in mental health interventions.
The three-tier evaluation framework emphasizes technical validation, user engagement, and clinical efficacy, guiding future AI standards.
Discrepancies between marketed claims and actual AI architectures raise ethical concerns about safety and transparency in mental health applications.

Abstract

The rapid evolution of artificial intelligence (AI) chatbots in mental health care presents a fragmented landscape with variable clinical evidence and evaluation rigor. This systematic review of 160 studies (2020-2024) classifies chatbot architectures - rule-based, machine learning-based, and large language model (LLM)-based - and proposes a three-tier evaluation framework: foundational bench testing (technical validation), pilot feasibility testing (user engagement), and clinical efficacy testing (symptom reduction). While rule-based systems dominated until 2023, LLM-based chatbots surged to 45% of new studies in 2024. However, only 16% of LLM studies underwent clinical efficacy testing, with most (77%) still in early validation. Overall, only 47% of studies focused on clinical efficacy testing, exposing a critical gap in robust validation of therapeutic benefit. Discrepancies emerged between marketed claims ("AI-powered") and actual AI architectures, with many interventions relying on simple rule-based scripts. LLM-based chatbots are increasingly studied for emotional support and psychoeducation, yet they pose unique ethical concerns, including incorrect responses, privacy risks, and unverified therapeutic effects. Despite their generative capabilities, LLMs remain largely untested in high-stakes mental health contexts. This paper emphasizes the need for standardized evaluation and benchmarking aligned with medical AI certification to ensure safe, transparent and ethical deployment. The proposed framework enables clearer distinctions between technical novelty and clinical efficacy, offering clinicians, researchers and regulators ordered steps to guide future standards and benchmarks. To ensure that AI chatbots enhance mental health care, future research must prioritize rigorous clinical efficacy trials, transparent architecture reporting, and evaluations that reflect real-world impact rather than the well-known potential.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Yining Hua

Steve Siddals

Zilin Ma

Journals

World Psychiatry

Actions

Institutions

Harvard University

New York University

Beth Israel Deaconess Medical Center

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Charting the evolution of artificial intelligence mental health chatbots from rule‐based systems to large language models: a systematic review

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study

Also consider