What question did this study set out to answer?

This research investigates whether retrieval mechanisms enhance performance in document-based question answering compared to full-document prompting.

May 10, 2026Open Access

When Context Is Not Enough: Retrieval Outperforms Full-Document Prompting on Indonesian Constitutional Court Verdicts

Key Points

This research investigates whether retrieval mechanisms enhance performance in document-based question answering compared to full-document prompting.
Evaluated 50 Indonesian Constitutional Court verdicts and 300 human-reviewed question-answer pairs.
Compared three architectures: Long Context, Simple RAG, and Advanced RAG using Gemini 2.5 Flash and GPT-4o Mini.
Conducted a length-sensitivity analysis and ablation to assess components' impact on performance.
Simple RAG showed higher reliability compared to Long Context with Cohen's d of 0.582 to 0.803.
Hybrid BM25+dense search improved faithfulness by +9.2 percentage points, while cross-encoder reranking decreased faithfulness by −9.0 percentage points.
Long Context faithfulness dropped significantly to 0.205 on lengthy documents, while Simple RAG remained at 0.803.

Abstract

The recent expansion of large language model (LLM) context windows raises a practical question for document-grounded question answering: if an entire source document fits into the prompt, is retrieval-augmented generation (RAG) still necessary? We evaluate this question on 50 Indonesian Constitutional Court (Mahkamah Konstitusi, MK) verdicts and 300 human-reviewed question-answer pairs spanning four cognitive types. Three architectures are compared across two model families: Long Context (LC), Simple RAG, and Advanced RAG, using Gemini 2.5 Flash and GPT-4o Mini. Simple RAG is the most reliable architecture in every complete comparison. In Phase 2, the ranking is stable across both models: Simple RAG > Advanced RAG > Long Context, with Cohen's d ranging from 0.582 to 0.803 for the Simple RAG versus Long Context comparison. A component-level ablation shows that hybrid BM25+dense search is the single most beneficial Advanced RAG component (+9.2 pp faithfulness), while cross-encoder reranking is the most harmful (−9.0 pp), attributed to language and domain mismatch. Length-sensitivity analysis reveals that Long Context faithfulness collapses to 0.205 on long verdicts while Simple RAG remains at 0.803. On this legal QA benchmark, targeted retrieval is more faithful, cheaper, and operationally more robust than full-document prompt injection.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Muhammad Iqbal Hilmy Izzulhaq

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

When Context Is Not Enough: Retrieval Outperforms Full-Document Prompting on Indonesian Constitutional Court Verdicts

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study