Conference Session

Efficient Reinforcement Learning

11:20am - 11:39am | Efficient Reinforcement Learning

Speakers: Rhythm Garg & Linden Li (both Co-founders, Applied Compute)

Speaker Profiles: Rhythm Garg | Linden Li

Bio: Co-founders, Applied Compute

Topic: RL mechanisms for building superhuman agents and discussing proprietary RL stack for efficient model training

Notes

  • how do we push AI past productivity into real stuff
    • deploy with a data fly wheel
    • RL is the tool that they use
  • how does high computer RL help LLM learn to reason
    • get a model, and try it 100s of times
    • grade the answers
    • when it’s correct, reinforce the thinking path for each one
  • applied computer is different from the labs
    • need the runs to be fast
    • cheap
    • predictable (generally low variance)
    • can we build
  • naive sync rl
    • no
    • async rl — pipeline RL is their preferred one
    • inflight weight update
    • some tokens are from previous weights, sometimes multiple gens back
    • variance increases as you increase staleness
    • want staleness for fast runs, but staleness makes training unstable and requires advancements
  • assuming we know that, what is the high throughput way to do RL
    • surprising far with some first principal modeling problem
      • n_gpus is cast member #1
        • harder to calculate with async because they can split
      • training_batch_size
        • sample n problems in parallel
      • sampling through
        • KV Cache memory
        • we should be estimated the base kv cache
        • forward pass latency per GPU
      • training_throughput_per_gpu
    • really focused on maximizing GPU usage for training run
    • async
      • too many training but not enough sampling
        • no good
      • too many samples
        • no good either
    • delicate balanced and they seemed to know

Slides

Slide: 2025-11-21-11-22

Slide

Key Point: Academic research introducing PipelineRL, a method for improving the efficiency of on-policy reinforcement learning when generating long sequences, positioning it as a contribution to both research and practical implementation.

Literal Content:

  • Title: “PipelineRL: Faster On-policy Reinforcement Learning for Long Sequence Generation”
  • Authors listed: Alexandre Piché, Elissa Kanooloo, Rafael Pardinas, Xingyu Chen, Dzmitry Bahdanau
  • Affiliations: Armanadies AI Research team
  • ArXiv reference: arXiv:2309.16128v2 [cs.LG]
  • Abstract section with technical details about reinforcement learning for sequence generation

Slide: 2025-11-21-11-24

Slide

Key Point: Explaining the fundamental tradeoff in reinforcement learning - there’s “no free lunch” when it comes to optimizing policies. The mathematical progression shows how policy gradient methods become more complex when dealing with off-policy learning.

Literal Content:

  • Title: “No free lunch”
  • Three mathematical formulas showing expectation equations with policy gradients
  • Progressive complexity in the formulas, introducing importance sampling weights

Slide: 2025-11-21-11-34

Slide

Key Point: Explaining key constraints when optimizing the layout of training and inference infrastructure - balancing token throughput between training and inference, and ensuring staleness doesn’t exceed acceptable limits.

Literal Content:

  • Pink background
  • Title: “Figuring out the optimal layout”
  • Section titled “Invariants:”
    1. Training token consumption rate == Inference token production rate (with mathematical formula)
    2. Max theoretical staleness does not exceed what our ML can handle (with formula for max_staleness)
Stay Updated

Get the Latest AI Engineering Insights

Join the Focus.AI newsletter for curated research, analysis, and perspectives on the evolving AI landscape.

No spam. Unsubscribe anytime.

CLASSIFIED_FILES

USER: AUTHORIZED

[ EMPTY DRAWER ]

No documents have been filed.