What question did this study set out to answer?

The goal is to improve image captioning for Chinese cultural relics using abundant domain texts.

January 26, 2026

Improving Image Captioning for Chinese Cultural Relics with Diffusion Language Models

Key Points

The goal is to improve image captioning for Chinese cultural relics using abundant domain texts.
Pre-trained a diffusion language model on a large corpus of domain texts.
Fine-tuned the model on limited paired image-caption data.
Conditioned the training on visual features.
Conducted experiments to compare performance against other methods.
Significant improvement in captioning performance compared to baseline methods.
Effective use of domain texts led to enhanced understanding of cultural relics.
Proven potential of DLMs in complex vision-language tasks.

Abstract

Accurate and detailed image captioning is crucial for documenting and disseminating knowledge about Chinese cultural relics, yet this task is severely limited by its domain-specific nature and the acute scarcity of paired image-caption data. While paired visual-text data is limited, substantial volumes of domain texts about these relics often exist. We propose a novel framework for Chinese cultural relics image captioning that effectively leverages this abundant domain texts using diffusion language models (DLMs). Our approach involves pre-training a DLM on the large corpus of domain texts instill domain-specific linguistic knowledge, followed by fine-tuning the pre-trained DLM on the limited paired image-caption data, conditioned on visual features. Experiments demonstrate that this strategy significantly boosts captioning performance compared to methods that do not exploit the domain texts or use it less effectively. This work highlights the power of DLMs in leveraging readily available domain text to overcome data scarcity for complex vision-language generation tasks, offering a valuable tool for cultural heritage documentation and broader natural language processing applications.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Cite this study

Mi et al. (Sat,) studied this question.

www.synapsesocial.com/papers/697703f6722626c4468e8fbc — DOI: https://doi.org/10.1145/3793547

Also consider

Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:

External knowledge-assisted Transformer for image captioning· 2023 · 14 citations
High-Resolution Image Synthesis with Latent Diffusion Models· 2022 · 13,286 citations
BLEU· 2001 · 21,362 citations
The Art and Architecture Thesaurus (AAT): A Critical Appraisal· 1995 · 48 citations
Re-ViLM: Retrieval-Augmented Visual Language Model for Zero and Few-Shot Image Captioning

Social Feed

Authors

Chenggang Mi

Yu Li

Journals

Journal on Computing and Cultural Heritage

Actions

Institutions

Northwestern Polytechnical University

Xi'an International Studies University

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Improving Image Captioning for Chinese Cultural Relics with Diffusion Language Models

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Cite this study

Also consider

Social Feed

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion