Retrieval-Augmented Generation (RAG) systems rely on effective document segmentation to retrieve relevant context for large language models (LLMs). However, existing chunking strategies suffer from critical drawbacks: fixed-size segmentation disrupts semantic continuity; naive merging introduces noise; overly long chunks exceed token budgets; and multi-page documents often lose hierarchical context. These limitations severely impact retrieval accuracy in dense, hierarchical long-form documents such as legal filings, financial disclosures, and research articles—especially when documents contain complex elements like bar charts, pie charts, tables, and other fine-grained details. We propose HASM-RAG (Hierarchy-Token-Aware Semantic Merging), a novel framework that integrates document hierarchy with token-aware constraints to adaptively merge semantically related segments. By preserving structural lineage while enforcing token-budget compliance, HASM-RAG produces retrieval units that are both contextually coherent and semantically complete—even when hierarchical markers are missing across pages. The experimental evaluation demonstrates that HASM-RAG achieves superior recall, higher precision, and reduced redundancy compared to baseline RAG approaches, offering a robust solution for retrieving long-context knowledge. HASM-RAG is lightweight, model-agnostic, and compatible with modern vector-database-based RAG pipelines.
Building similarity graph...
Analyzing shared references across papers
Loading...
Komal Mahto
Building similarity graph...
Analyzing shared references across papers
Loading...
Komal Mahto (Fri,) studied this question.
www.synapsesocial.com/papers/69b5ff6e83145bc643d1bfdd — DOI: https://doi.org/10.5281/zenodo.18995886