Guide-GRPO
PublicAims for memory-efficient training (24GB VRAM) on consumer GPUs. Optimizing language models through guidance tokens in reasoning chains, based on DeepSeekRL-Extended.
Creat:2025-02-23T18:37:40
Update:2025-03-06T11:11:08
30
Stars
0
Stars Increase