DPO-ST
Public[ACL 2024] Self-Training with Direct Preference Optimization Improves Chain-of-Thought Reasoning
Creat:2024-06-04T23:37:20
Update:2025-02-28T09:50:26
https://arxiv.org/abs/2407.18248
47
Stars
0
Stars Increase
[ACL 2024] Self-Training with Direct Preference Optimization Improves Chain-of-Thought Reasoning