What question did this study set out to answer?

To address limitations in document segmentation for Retrieval-Augmented Generation systems using HASM-RAG.

March 15, 2026Open Access

HASM-RAG: Hierarchy-Token-Aware Semantic Splitting and Merging for Dense Long-Form Documents in Knowledge Retrieval

Key Points

To address limitations in document segmentation for Retrieval-Augmented Generation systems using HASM-RAG.
Developed HASM-RAG integrating document hierarchy with token-aware constraints.
Evaluated against baseline RAG approaches for effectiveness in retrieval tasks.
Achieved higher precision and recall compared to existing RAG strategies.
Reduced redundancy in retrieved documents.

Abstract

Retrieval-Augmented Generation (RAG) systems rely on effective document segmentation to retrieve relevant context for large language models (LLMs). However, existing chunking strategies suffer from critical drawbacks: fixed-size segmentation disrupts semantic continuity; naive merging introduces noise; overly long chunks exceed token budgets; and multi-page documents often lose hierarchical context. These limitations severely impact retrieval accuracy in dense, hierarchical long-form documents such as legal filings, financial disclosures, and research articles—especially when documents contain complex elements like bar charts, pie charts, tables, and other fine-grained details. We propose HASM-RAG (Hierarchy-Token-Aware Semantic Merging), a novel framework that integrates document hierarchy with token-aware constraints to adaptively merge semantically related segments. By preserving structural lineage while enforcing token-budget compliance, HASM-RAG produces retrieval units that are both contextually coherent and semantically complete—even when hierarchical markers are missing across pages. The experimental evaluation demonstrates that HASM-RAG achieves superior recall, higher precision, and reduced redundancy compared to baseline RAG approaches, offering a robust solution for retrieving long-context knowledge. HASM-RAG is lightweight, model-agnostic, and compatible with modern vector-database-based RAG pipelines.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Komal Mahto

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

HASM-RAG: Hierarchy-Token-Aware Semantic Splitting and Merging for Dense Long-Form Documents in Knowledge Retrieval

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study