Industrial catalysis, as a core field of chemical engineering, is characterized by intensive professional terminology and complex knowledge structures, making it challenging for general-purpose large language models to accurately understand and apply relevant professional knowledge. This research presents a domain-specific fine-tuning technique and retrieval-augmented generation system for the industrial catalysis field. Through a multi-model collaborative data processing pipeline, we construct high-quality training corpora, employ parameter-efficient fine-tuning techniques to train specialized domain models, and design a retrieval-augmented generation workflow based on consistency verification. The research first establishes a training corpus containing 2.3 billion tokens, including 1.1 billion domain-specific tokens and 1.2 billion general tokens with a balanced 1:1 ratio strategy. Subsequently, we apply rank-stabilized low-rank adaptation (rsLoRA) method to perform parameter-efficient fine-tuning on the Yi-1.5-6B model, resulting in the PeiYang Micro-Emergence model, which achieves a score of 76.81 in industrial catalysis field evaluation, significantly outperforming the general-purpose model Qwen2.5-72B-Instruct (65.45 points) with 12 times the parameters, while maintaining good general capabilities. We further construct a 3.37 million domain-specific retrieval pair dataset and optimize the embedding model using Matryoshka representation learning (MRL) techniques, achieving an average improvement of 2.87 percentage points in domain retrieval recall@3 while slightly enhancing general capabilities. Finally, we design a professional retrieval-augmented generation workflow integrating bilingual hypothetical document generation, dual-path retrieval, and consistency verification, achieving high-quality professional knowledge services. This system provides accurate and reliable professional knowledge services for the industrial catalysis field, demonstrates the application value of domain-specific large language models in resource-constrained environments, and offers a replicable technical pathway for artificial intelligence applications in other specialized domains.
Building similarity graph...
Analyzing shared references across papers
Loading...
Xin Chang
Shican Wu
Xiao Ma
Tianjin University
Ministry of Education
Building similarity graph...
Analyzing shared references across papers
Loading...
Chang et al. (Mon,) studied this question.
www.synapsesocial.com/papers/69ba428e4e9516ffd37a2e0d — DOI: https://doi.org/10.53941/sce.2026.100002