March 3, 2026Open Access

Randomized controlled trials to evaluate diagnostic tests

Key Points

RCTs effectively evaluate the clinical utility and effectiveness of diagnostic tests and therapeutic interventions.
Using discordant subgroups can improve analysis accuracy in trials assessing diagnostic tests.
Design limitations in diagnostic RCTs can lead to misinterpretation of test impacts on patient outcomes.
Methodological rigor in trial design helps ensure a meaningful comparison between diagnostic test results.

Abstract

In medicine, randomized controlled trials (RCTs) are considered the gold standard for evaluating the effectiveness of interventions1, 2. Randomization, resulting in balanced groups, prevents selection bias and confounding, thereby allowing causal inference of the intervention being studied3. By directly comparing two exchangeable groups with similar baseline characteristics, the therapeutic intervention under investigation, be it pharmacological, surgical or otherwise, can be rigorously evaluated. A valid conclusion from the RCT is yielded if the intervention is genuinely the only variable applied differently to the groups and if the sample size is sufficiently large to detect or refute between-group differences with adequate statistical power2, 4. Similarly, RCTs can also evaluate the clinical utility of diagnostic or screening tests; that is, whether using a test to guide diagnosis and treatment decisions improves patient outcomes5, 6. However, unlike therapeutic RCTs, where randomization directly governs treatment, diagnostic RCTs are more complex. Tests themselves do not alter outcomes; they can influence outcomes only if the results of the test inform a clinical decision to change patient management5, 6. Recognizing the distinction between test and intervention is crucial, as they are discrete variables. This conceptual separation introduces specific methodological challenges that must be addressed explicitly when designing and interpreting a RCT that compares different diagnostic or prognostic tests7, 8. Trials that conflate test and treatment effects may yield conclusions that are difficult to interpret6, 9. Although the complexity involved in diagnostic test evaluation has been discussed extensively in the literature5, 8, 10, some diagnostic RCTs continue to have design limitations. In particular, some trials do not clearly distinguish the information provided by the test itself from the impact of subsequent therapeutic management. While such designs may have other advantages, particularly in terms of external validity, this can introduce methodological inefficiencies and a mismatch between what the trial intends to evaluate (the test) and what it actually measures, which may instead reflect treatment effects8. Consequently, even large, rigorously conducted trials may estimate inaccurately the clinical value of a test if this distinction is overlooked. This disconnect highlights the importance of designing trials that consider not only diagnostic accuracy but also the clinical consequences of acting on the test result6, 8. In this article, we review different diagnostic RCT designs, examine examples of recent diagnostic RCTs in obstetrics and reproductive medicine and discuss the limitations that impact their interpretability and clinical utility. Two principal approaches can be used in diagnostic RCTs: comparing a diagnostic test with no test, or comparing two different diagnostic tests. In both scenarios, the two common designs are the two-arm design and the paired (or discordant) design (Figure 1). In the two-arm design, participants are randomly assigned to undergo diagnostic test A or B (or in some studies, a test vs no test), and subsequent therapeutic management is dictated by the result of the test performed (Figure 1a). In contrast, the paired design involves all participants undergoing both tests A and B, with only those whose results differ between the tests (i.e. are discordant) being randomized to different management pathways (Figure 1b)5, 10. While a two-arm design offers advantages for blinding and for reflecting population-level implementation, a discordant design can offer greater efficiency by isolating the treatment contrast attributable to the diagnostic strategy and avoid dilution from participants whose management would be the same, regardless of the randomization arm8, 9. These design principles are central to interpreting the results of diagnostic RCTs and ensuring that meaningful treatment contrasts are captured. The importance of aligning randomization with test-informed management decisions is evident when evaluating recent diagnostic trials in obstetrics and reproductive medicine. The Gestational Diabetes Mellitus Trial of Diagnostic Detection Thresholds (GEMS) randomized over 4000 women in an attempt to evaluate how different glycemic thresholds affect the diagnosis and outcomes of gestational diabetes mellitus (GDM) following a 75-g oral glucose tolerance test (OGTT)11. This study compared a lower-glycemic-criteria (GC) group (recommended by the International Association of Diabetes in Pregnancy Study Groups (IADPSG)12 after the Hyperglycemia and Adverse Pregnancy Outcomes study13) with a higher-GC group (based on higher diagnostic thresholds developed by the Australasian Diabetes in Pregnancy Society through expert consensus14). GDM diagnosis was determined using OGTT results: in the lower-GC group, GDM was diagnosed if fasting plasma glucose level was ≥ 5.1 mmol/L, 1-h level was ≥ 10.0 mmol/L or 2-h level was ≥ 8.5 mmol/L; in the higher-GC group, GDM was diagnosed if fasting plasma glucose level was ≥ 5.5 mmol/L or 2-h level was ≥ 9.0 mmol/L. In both study arms, clinical management was based on GDM status: women diagnosed with GDM received standard diabetes care, including nutritional therapy, blood glucose monitoring and as-needed pharmacological treatment, while those not diagnosed with GDM continued with routine care without additional interventions11. Each comparison group (or randomization arm) therefore included both women who would be diagnosed with GDM (and thus treated) under both sets of thresholds and those who would not be diagnosed or treated under either. In these cases, management was identical regardless of group allocation, and any outcome differences would probably reflect the result of chance alone. Only women in the small subgroup whose OGTT values fell in the intermediate range between the two glycemic thresholds (i.e. those diagnosed under the lower but not the higher threshold) received different management based on randomization. This subgroup comprised fewer than 10% of the overall study population11. The main analysis showed no significant difference in the primary outcome (incidence of large-for-gestational-age (LGA) neonates) between the groups overall (8.8% vs 8.9%; adjusted relative risk (aRR), 0.98 (95% CI, 0.80–1.19); P = 0.82), leading the authors to conclude that the use of lower GC for the diagnosis of GDM did not improve outcomes. However, a subgroup analysis in the GEMS that was limited to women with discordant classification (i.e. those with a fasting plasma glucose level ≥ 5.1 and 10.0 mmol/L and/or a 2-h level ≥ 8.5 and 10.0 mmol/L and/or 2-h level ≥ 8.5 and 10th percentile and a CPR 10th percentile and a normal CPR were managed expectantly until 41 + 3 weeks18. For these concordant groups, CPR disclosure did not alter clinical outcome, and any differences between these groups would reflect chance alone. By randomizing all participants and analyzing outcomes across the full cohort, the trial diluted any potential treatment effect within the discordant subgroup, rendering the overall negative result difficult to interpret19. Ideally, a more design-sensitive approach would have involved randomizing after the ultrasound exam, limiting randomization to only the participants for whom CPR results would alter management (Figure S2b). In such a design, blinding of test results is inherently constrained because eligibility itself is determined by the CPR, so to improve concealment, participants with an EFW > 10th percentile and normal CPR would also need to be blinded to their result. While this approach is feasible if prespecified and appropriately justified during trial planning and consent, it is rarely implemented in practice. Nonetheless, the methodological complexities are outweighed by the analytical advantage of targeting the subgroup in which the test result influences clinical decision-making. When using a two-arm design, at a minimum the primary analysis should focus on this discordant subgroup. Alternatively, the overlapping protocols between groups should be acknowledged during trial planning, with a prespecified subgroup analysis and a sample size calculation powered appropriately for this discordant cohort. The Fetal Growth Restriction at Term Managed by Angiogenic Factors Versus Feto-Maternal Doppler (GRAFD) trial investigated the use of serum angiogenic factors to differentiate pathological fetal growth restriction (FGR) from small-for-gestational-age (SGA) fetuses among 1088 pregnancies with an EFW 10th percentile on ultrasound would need to be regardless of cerebroplacental ratio vs protocol of GRAFD randomized controlled In protocol, all were randomized regardless of test In protocol, only cases with discordant classification using different tests are randomized and protocol published and protocol of randomized controlled In protocol, all were randomized regardless of test In published discordant groups from study were concordant group with protocol only cases with vs protocol of selection randomized controlled In protocol, randomization occurred prior to test and only group randomized to selection by was In protocol, all were only between and selection are randomized and The is not for the or of any information by the than should be to the for the

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

H. J. Giles‐Clark

D. L. Rolnik

S. M. Skinner

Journals

Ultrasound in Obstetrics and Gynecology

Actions

Institutions

University of Amsterdam

Monash University

Amsterdam University Medical Centers

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Randomized controlled trials to evaluate diagnostic tests

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study