Long-Context GRPO

60 points by veryluckyxyz 5 months ago

This is what I understood from the blog post (please correct me if I am wrong):

Unsloth allows you to give it a transformer model and additional training data to do LoRA/QLoRA. LoRA/QLoRA will keep the weights of the model as constant but as output some low rank adjustments to the weights which serves as the weight "delta".

Typically one would do SFT with the training data. But Unsloth allows you to do RL (Reinforcement learning) specifically GRPO on the model + training data you give it also! The output of the GRPO here is again in the form the LoRA/QLoRA weights.

You have found a way to reduce the memory requirements for GRPO.

Question: How does one decide whether the training data will be SFT (Supervised fine tuning) or GRPO ? When will you get better results with SFT and when with GRPO ?

danielhanchen 5 months ago

Yes you're correct!
Very good question on SFT vs GRPO!
Assume the dataset I have is "What is 2+2?", "The answer is 4".
1. If you have very high quality labelled data, SFT should work fine. Ie "What is 2+2? Let me think about it....., The Answer is 4"
2. If you only have the input "What is 2+2", and just the answer "4", but nothing in between, GRPO could be very helpful! GRPO can help produce the reasoning traces automatically - you will need to provide some scoring / reward functions though. For example if the answer == 4, + 1 score.
3. You can combine SFT and GRPO! Do SFT first, then GRPO - this actually makes GRPO most likely converge faster!
- sidkshatriya 5 months ago
  
  Does this mean that you can only do GRPO on the training models that have reasoning traces in <think>...</think>
  
  danielhanchen 5 months ago
  
  Oh no at all!! You can actually convert a model to even generate the <think>...</think> tokens themselves! That's how DeepSeek trained R1 Zero, which essentially made the model have reasoning skills!
  
  sidkshatriya 5 months ago
  
  Wont you have to use a distilled DeepThink model then ? Because the training phase with GRPO required to its reasoning within <think></think> for least loss.
  
  danielhanchen 5 months ago
  
  Oh no no!! The trick for GRPO is you essentially let the model "learn" how to do reasoning itself!!!
  The <think> tokens are optional for formatting reasons. You could use <reasoning> or <thinking> or [reasoning] for example in the system prompt.
  
  codelion 5 months ago
  
  Models already have hidden latent CoT style reasoning within them, GRPO would help induce that behavior. For instance see https://x.com/asankhaya/status/1838375748165628053 where a sampling technique (CoT decoding) can actual improve performance of the model.
  
  danielhanchen 5 months ago
  
  Oh yep! The deepseek paper also mentioned how large enough LLMs inherently have responding capabilities and the goal of GRPO is to accentuate latent skills!
  
  wrsh07 5 months ago
  
  Nah, you can just request that in your prompt and then fail answers that are incorrect and/or don't include the think trace
  
  danielhanchen 5 months ago
  
  Yes exactly! You can in fact add that has a reward function for style and format checking!
- lyu07282 5 months ago
  
  can you give some real-world examples for when this would be useful? Does this work for tasks requiring tool calling as well?
  
  danielhanchen 5 months ago
  
  Yes tool calling is a prime example!! Ie you have some specific task, and the final output involving some tools, but sadly the steps to call the tools / the stuff in between / the thinking process is missing.
  You can employ GRPO and maybe add an actual Python environment for the model to learn to act in.
  
  byefruit 5 months ago
  
  I'm waiting for https://github.com/huggingface/trl/pull/2810 to land. I think this should work with the existing unsloth setup without changes.
  
  danielhanchen 5 months ago
  
  Oh yes!! Will has definitely been on a roll!! Excited for the PR as well!
imjonse 5 months ago

Is it established whether GRPO is essential for this to work as it does, or could other RLHF-class methods provide similar results? My initial (possibly mistaken) impression was that GRPO was one of ways of mitigating the lack of enormous hardware resources.
- danielhanchen 5 months ago
  
  Yep so GRPO is much more memory efficient than PPO, but other RL type algorithms can work fine as well!

yorwba 5 months ago

> We also found interestingly that:

  torch.exp(q - q.detach()) * advantages.unsqueeze(1)

> is used, which should be evaluated to 1 right? We actually found this is necessary - it seems that the autograd engine might not be propagating gradients correctly.

The autograd engine is propagating gradients correctly, but the question is, which gradients?

You could encapsulate this as a function

  f = lambda a, b: torch.exp(a - b) * advantages.unsqueeze(1)

then have f_a(a, b) be the derivative of that with respect to a, and substitute in q for both variables to get f_a(q, q).

But if you substitute to get f(q, q) first and then differentiate with respect to q, you don't get f_a(q, q), but instead f_a(q, q) + f_b(q, q), which in this case would be 0. The ordering of variable substitution and differentiation cannot be exchanged freely.

detach() is a way to say "we want to differentiate the expression first, treating this as a constant, and then substitute with this variable afterwards."

danielhanchen 5 months ago

Oh thanks for posting! If anyone has any questions about stuff, feel free to ask!

loxias 5 months ago

Thanks for what you're doing. Of all the various companies and orgs posting chatter about deep learning, I've come to really appreciate your efforts (and Anthropic), because you're USING MATH. :)
I have some understanding of applied math, continuous and discrete, and while I don't keep up to date with developments in deep learning/AI in general, I always look forward to unsloth posts because they tend to center on achieving a desirable result thanks to proper application of good old fashioned "understanding the damn math and once you do, then doing the obvious". :)
Reminds me of learning about how to optimize twiddle factors in wring a performant FFT (more than divide-and-conquer, one also uses trig identities and some algebra to reduce the number of multiplies), or of learning of elliptic minimal Q-factors (EMQF) filters -- clever IIR filters that give a sharp frequency response using less than 50% (or is it 25% or more?) of the computation required traditionally by optimizing for *coefficients with lots of zeros in the base 2 representation*. And computers, it turns out, can multiply numbers by zero really fast. ;-)
The throughline to me is that if you pause and think deeply about "wait, what are we really doing here?" and look at the whole math stack, and think about what computers are good at, sometimes you can achieve great results.
- danielhanchen 5 months ago
  
  Oh thank you a lot!!
  I always keep maths at the center of everything we do :) It's literally humanity's ultimate superpower if we can express everything in mathematical terms!
  I'll keep writing up blog posts with more maths!!
lennxa 5 months ago

thanks for your efforts!
how practical do you think grpo is? (for most people)
here's my thoughts - grpo starts off slow, with super small loss (likely because the rewards on all observations are the same) - as you mentioned, some sft on reasoning data ought to help speed things up - unless you're a lab with a gazillion gpus, wouldn't you be better off taking your non-reasoning dataset and converting it into a high quality reasoning dataset using frontier models (maybe deepseek)? could grpo be cheaper or better accuracy? - maybe you do tons of sft and when you've reached the frontier models' perf on your task, then perhaps grpo could help more exploration
would be great to hear your thoughts
- danielhanchen 5 months ago
  
  Thanks! Yes so synthetic data generation and data augmentation are also very useful! A trick one could employ is to first generate 1000s of possible answers then select the top 10 to be used in GRPO - it's kinda like o3 with majority voting!