What question did this study set out to answer?

The aim is to develop a novel test set for evaluating simultaneous speech translation models effectively, focusing on word order consistency.

April 17, 2026Open Access

Rethinking Evaluation in Simultaneous Speech Translation: A Case for Monotonic Test Sets

Key Points

The aim is to develop a novel test set for evaluating simultaneous speech translation models effectively, focusing on word order consistency.
Constructed a new test set designed specifically for simultaneous translation with exact word and phrase order.
Verified the quality of translations using professional interpreters.
Examined performance across three language pairs with varying word order similarities.
Existing test data was found to underestimate model performance.
The new test set, simul-tst-COMMON, provides a better evaluation framework.
Analysis showed that adaptive policies mimic human interpreter behavior more closely than traditional methods.

Abstract

Abstract Overcoming the trade-off between quality and latency is a challenge in simultaneous speech translation. Common approaches in previous works have been to segment the source sentence or align target sentences with the source’s syntax as closely as possible, enabling faster translations while maintaining quality. However, a major limitation in these studies is the reliance on existing translation test data, which often include reordering and are unsuitable for simultaneous settings with low latency. Alternatively, some use interpretation data transcribed from interpreters, which is also problematic due to translation errors and omissions, making both inadequate for fully evaluating simultaneous models. In this work, we introduce a construction, verification, and analysis of a new test set specifically designed for simultaneous settings, with a focus on maintaining word and phrase order consistency with the source. The test set comprises three language pairs representing different levels of word order similarity to the source by leveraging large language models, with quality verified by professional interpreters. This provides an interpreter-grounded perspective, and empirically shows the ideal level of monotonicity and the other sentence style characteristics including syntax simplicity and sentence length. It also reveals the capabilities and limitations of LLMs on monotonic translation. Experiments revealed that existing test data tends to underestimate a model’s performance, while the proposed test set, simul-tst-COMMON, offers a more appropriate evaluation of simultaneous models. Moreover, the quality gap between wait-k and Local Agreement suggests that the adaptive policy more closely resembles the monotonic translation behavior of human interpreters. Finally, the analysis highlights the limitations of current metrics, which may not be fully suitable for evaluating simultaneous tasks.

Read Full Paperexternally

Bookmark

View Full Paper

Cite This Study

Makinae et al. (Wed,) studied this question.

synapsesocial.com/papers/69e1cf985cdc762e9d8588e3 https://doi.org/https://doi.org/10.1162/coli.a.622

Bookmark

View Full Paper