The recent expansion of large language model (LLM) context windows raises a practical question for document-grounded question answering: if an entire source document fits into the prompt, is retrieval-augmented generation (RAG) still necessary? We evaluate this question on 50 Indonesian Constitutional Court (Mahkamah Konstitusi, MK) verdicts and 300 human-reviewed question-answer pairs spanning four cognitive types. Three architectures are compared across two model families: Long Context (LC), Simple RAG, and Advanced RAG, using Gemini 2.5 Flash and GPT-4o Mini. Simple RAG is the most reliable architecture in every complete comparison. In Phase 2, the ranking is stable across both models: Simple RAG > Advanced RAG > Long Context, with Cohen's d ranging from 0.582 to 0.803 for the Simple RAG versus Long Context comparison. A component-level ablation shows that hybrid BM25+dense search is the single most beneficial Advanced RAG component (+9.2 pp faithfulness), while cross-encoder reranking is the most harmful (−9.0 pp), attributed to language and domain mismatch. Length-sensitivity analysis reveals that Long Context faithfulness collapses to 0.205 on long verdicts while Simple RAG remains at 0.803. On this legal QA benchmark, targeted retrieval is more faithful, cheaper, and operationally more robust than full-document prompt injection.
Building similarity graph...
Analyzing shared references across papers
Loading...
Muhammad Iqbal Hilmy Izzulhaq
Building similarity graph...
Analyzing shared references across papers
Loading...
Muhammad Iqbal Hilmy Izzulhaq (Fri,) studied this question.
www.synapsesocial.com/papers/6a0020eac8f74e3340f9bc8f — DOI: https://doi.org/10.5281/zenodo.20086805