What question did this study set out to answer?

The research aims to create a dataset that enhances multi-turn reasoning capabilities in large language models for specialized domains.

March 22, 2026

MPCCD-MLF: A dataset of multi-round professional consultation conversations in the medical, legal and financial domains

Key Points

The research aims to create a dataset that enhances multi-turn reasoning capabilities in large language models for specialized domains.
Constructed the MPCCD-MLF dataset using web crawling, prompt engineering, and structural reorganization.
Collected data from professional platforms like Haodf.com and Xueqiu.com.
Implemented a double-blind evaluation strategy for quality assessment.
Final dataset contains 31,745 three-round question-answer interactions.
Achieved a dataset quality score of 4.75 out of 5.
Facilitates high-quality, interpretable reasoning for model training.

Abstract

Large language models hold vast application potential across diverse fields such as healthcare, law, and finance. However, these domains impose higher requirements on the models specialization, accuracy, explainability, and security. Most existing public datasets primarily focus on conclusive answers and lack explainable reasoning that reflects the expert decision-making process within complex consultation scenarios. Consequently, they are insufficient for effectively supporting conducting long-context, multi-turn interactive reasoning in large language models. To address this, this study constructs the MPCCD-MLF dataset (dataset of multi-round professional consultation conversations in the medical, legal and financial domains), comprising multi-round conversation corpora across medical, legal, and financial domains. Data sources include professional platforms such as Haodf.com, China Legal Service Network (12348), and Xueqiu.com, covering the period from January 2023 to December 2024. The dataset was constructed through a pipeline including web crawling, prompt engineering, and structural reorganization. Using specially designed multi-dimensional constrained prompt templates, it anchors to factual judgements and conclusive information within experts’ original responses, thereby generating structured and interpretable reasoning expressions that unfold across multiple conversation rounds. After cleaning and anonymization, the final dataset contains 31,745 three-round question-answer interactions (approximately 181 MB) stored in JSON format. Each conversation follows a multi-round interaction pattern comprising user query, expert response, user follow-up, and expert follow-up response. To ensure dataset quality, a double-blind evaluation strategy combining automated model scoring and expert manual verification was adopted, yielding an overall dataset quality score of 4.75 (out of 5). This dataset provides high-quality, and highly interpretable corpora for large language models in specialized domains, supporting research on complex logical reasoning and long-context multi-round interactions, and offering valuable data resources for the development of domain-specific intelligent consultation systems.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Congfei Luo

Qidong YAN

Dejie Wang

Journals

China Scientific Data

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

MPCCD-MLF: A dataset of multi-round professional consultation conversations in the medical, legal and financial domains

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study