July 25, 2024Open Access

Self-Training with Direct Preference Optimization Improves Chain-of-Thought Reasoning

Key Points

Key points are not available for this paper at this time.

Abstract

Effective training of language models (LMs) for mathematical reasoning tasks demands high-quality supervised fine-tuning data. Besides obtaining annotations from human experts, a common alternative is sampling from larger and more powerful LMs. However, this knowledge distillation approach can be costly and unstable, particularly when relying on closed-source, proprietary LMs like GPT-4, whose behaviors are often unpredictable. In this work, we demonstrate that the reasoning abilities of small-scale LMs can be enhanced through self-training, a process where models learn from their own outputs. We also show that the conventional self-training can be further augmented by a preference learning algorithm called Direct Preference Optimization (DPO). By integrating DPO into self-training, we leverage preference data to guide LMs towards more accurate and diverse chain-of-thought reasoning. We evaluate our method across various mathematical reasoning tasks using different base models. Our experiments show that this approach not only improves LMs' reasoning performance but also offers a more cost-effective and scalable solution compared to relying on large proprietary LMs.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Cite this study

Wang et al. (Thu,) studied this question.

www.synapsesocial.com/papers/68e5f2d2b6db6435875874b2 — DOI: https://doi.org/10.48550/arxiv.2407.18248

Authors

Tianduo Wang

Shichen Li

Wei Lu

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Self-Training with Direct Preference Optimization Improves Chain-of-Thought Reasoning

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Cite this study

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Also consider