What question did this study set out to answer?

This research aims to explore how large language models, specifically ChatGPT, can enhance voice user interface (VUI) testing.

May 8, 2026Open Access

Using large language models to test voice user interfaces

Key Points

This research aims to explore how large language models, specifically ChatGPT, can enhance voice user interface (VUI) testing.
Focused on optimizing prompt usage and interaction strategy with ChatGPT.
Conducted experiments comparing specialized models and ChatGPT in generating paraphrases for voice commands.
Evaluated the impact of these paraphrases on identifying bugs in VUIs.
Optimized usage of ChatGPT achieved new state-of-the-art performance in VUI testing, enhancing bug-revealing paraphrases.
Integration of generated paraphrases fixed some bugs, but several remained and new bugs were introduced.
Indicates a need for specialized methods for effective bug resolution in voice user interfaces.

Abstract

Abstract Voice-based virtual assistants enable hands-free operation, allowing users to perform tasks, access information, and control smart home devices through simple voice commands. Their growing ubiquity in smartphones, smart speakers, and other devices led to the flourish of more and more apps taking advantage of a Voice User Interface (VUI). VUI testing is far from trivial due to the wide variability in human speech (e.g., different accents, dialects, speech patterns), and the fact that users can express the same command in numerous ways, using different (but semantically equivalent) wordings and phrases. For this reason, techniques have been proposed to support VUI testing. The basic idea behind these specialized approaches is to generate paraphrases for the set of voice commands for which developers implemented support in the VUI. Preliminary results from a recent study suggest that specialized models can outperform a general-purpose LLM (ChatGPT). However, a simple prompt and interaction strategy with ChatGPT has been adopted. In other words, it is still unknown whether optimizing the LLM usage allows to obtain better results. In this paper, we aim to thoroughly study to what extent LLMs (ChatGPT, specifically) can be adopted to test VUIs. We focused on optimizing the used prompt and the interaction with the model. Our results show that an optimized use of LLMs results in new state-of-the-art performance for VUI testing in terms of number of correct and bug-revealing paraphrases. While introducing the generated paraphrases into the Voice Interaction Models of the skills allows to fix some bugs, we observe that many bugs remain, and some are even introduced by the generated paraphrases. Our results call for specialized approaches for fixing bugs in VUIs.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Emanuela Guglielmi

Angelica Spina

Gabriele Bavota

Journals

Empirical Software Engineering

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Using large language models to test voice user interfaces

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study