Accurate and detailed image captioning is crucial for documenting and disseminating knowledge about Chinese cultural relics, yet this task is severely limited by its domain-specific nature and the acute scarcity of paired image-caption data. While paired visual-text data is limited, substantial volumes of domain texts about these relics often exist. We propose a novel framework for Chinese cultural relics image captioning that effectively leverages this abundant domain texts using diffusion language models (DLMs). Our approach involves pre-training a DLM on the large corpus of domain texts instill domain-specific linguistic knowledge, followed by fine-tuning the pre-trained DLM on the limited paired image-caption data, conditioned on visual features. Experiments demonstrate that this strategy significantly boosts captioning performance compared to methods that do not exploit the domain texts or use it less effectively. This work highlights the power of DLMs in leveraging readily available domain text to overcome data scarcity for complex vision-language generation tasks, offering a valuable tool for cultural heritage documentation and broader natural language processing applications.
Building similarity graph...
Analyzing shared references across papers
Loading...
Mi et al. (Sat,) studied this question.
www.synapsesocial.com/papers/697703f6722626c4468e8fbc — DOI: https://doi.org/10.1145/3793547
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:
Chenggang Mi
Yu Li
Journal on Computing and Cultural Heritage
Northwestern Polytechnical University
Xi'an International Studies University
Building similarity graph...
Analyzing shared references across papers
Loading...