What type of study is this?

This is a Experimental Study study.

September 27, 2025Open Access

Testing Chatbot Systems using Agentic AI Approach

Key Points

The agentic evaluation methodology enhances chatbot performance assessment by uncovering edge cases and vulnerabilities.
Evaluation time is reduced significantly by 60-70% compared to conventional methods.
Autonomous AI agents simulate diverse user interactions, providing insights into chatbot capabilities.
The framework aligns theoretical performance indicators with practical deployment challenges of large language models.

Abstract

As large language models (LLMs) become increasingly integrated into real-world applications, robust and scalable evaluation methods are essential to ensure their reliability, safety, and effectiveness. This work introduces an innovative evaluation framework grounded in an agentic AI simulation approach, designed to overcome the limitations of traditional testing methodologies in newly developed chatbots. Unlike conventional methods that depend on static benchmarks or human evaluators, our approach employs autonomous AI agents capable of simulating a wide spectrum of user interactions. Within a controlled multi-agent environment, these evaluator agents interact with the target chatbot using natural language queries specifically designed to probe various functional capabilities, identify edge cases, and uncover potential failure modes. The agentic evaluation methodology systematically assesses the performance of chatbots in multiple dimensions, including task completion efficiency, contextual understanding in dynamic conversations, and adherence to safety and ethical guidelines. By incorporating recent advances in agentic metrics and automated scenario generation, our system produces detailed data-driven performance reports that capture both strengths and vulnerabilities in chatbot behavior. Preliminary results show that this approach not only reveals significantly more edge cases than conventional methods, but also reduces overall evaluation time by approximately 60-70 percent. This work contributes to a scalable, standardized testing paradigm that better aligns theoretical performance indicators with the practical challenges of deploying LLMs in real-world environments.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Obaid Sajjad

Wajih ur Rehman

Muhammad Numan

Journals

International Journal of Innovations in Science and Technology

Actions

Institutions

Pakistan Institute of Engineering and Applied Sciences

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Testing Chatbot Systems using Agentic AI Approach

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study