Most evaluation of conversational AI relies on short, prompt‑based tests that fail to reflect how real people use these systems in real and diverse situations. Such tests do not capture the demands of extended interaction, shifting user intent, or the cumulative effects of cognitive and emotional input over time. This paper introduces the Argo AI Testing Protocol (the Argo Protocol), a structured approach for evaluating AI systems within the User Interaction Space -the full set of observable outputs and interactions available to a user. The Protocol proposes Sustained Multi‑Axis Load Testing, a method for applying controlled stress across multiple vectors simultaneously: interactions extended across time, increasing cognitive complexity, the user’s emotional input, the model’s pattern‑state stability, the computational resources available to the model, and the time allowed for each response. Rather than prescribing fixed procedures, durations, or compliance requirements, the Argo Protocol provides a conceptual framework and diagnostic vocabulary that developers can adapt to their own models, environments, and constraints. The aim here is not to define a standard, but the Protocol may serve as a starting point for one should the field require a formalised approach in the future. The evaluation is grounded in the observable behaviour of the User Interaction Space. Under sustained, multi‑axis load- the Argo Protocol suggests a viable route for reproducible, real‑world testing that better reflects how AI systems are actually used by real people.
Building similarity graph...
Analyzing shared references across papers
Loading...
William Argo
Building similarity graph...
Analyzing shared references across papers
Loading...
William Argo (Mon,) studied this question.
www.synapsesocial.com/papers/69df2c88e4eeef8a2a6b1a8f — DOI: https://doi.org/10.5281/zenodo.19560324