RL From Scratch
From Q-learning to GRPO — build reinforcement learning from first principles.
Pods in this Course

Basics of Reinforcement Learning
From the agent-environment loop to Bellman equations, rewards, and your first OpenAI Gymnasium agent -- all from first principles.

Value Functions and Q-Learning
From Bellman's recursive insight to teaching machines to learn optimal behavior through trial and error.

Building DQN Atari Agents
How DeepMind combined Q-learning with convolutional neural networks to play Atari games at superhuman levels -- and why this 2013 paper changed everything.

Policy Gradient Methods
From REINFORCE to Actor-Critic -- how gradient ascent on policy parameters unlocks continuous and high-dimensional action spaces.

RLHF Theory and Implementation: Teaching Machines to Learn from Human Preferences
A complete guide to aligning language models with human preferences: from reward modeling to PPO, with full code implementations.

Group-Relative Policy Optimization (GRPO) -- From Scratch
How DeepSeek eliminated the critic network and made RLHF simpler, cheaper, and better. From group-relative advantages to training reasoning models.

Building a Reasoning Model from Scratch
How to teach a small language model to think step-by-step using reinforcement learning with verifiable rewards -- from SFT on chain-of-thought data to GRPO training to distillation.

OpenClaw-RL: Personalizing AI Agents from Conversation Feedback
How to make your AI assistant learn your preferences from natural conversations. Build the full OpenClaw-RL pipeline: session-aware rollouts, Binary RL with GRPO-TCR, On-Policy Distillation with hindsight hints, and the RLAnything closed loop.

Mini-SWE-RL: Teaching a Small Language Model to Fix Bugs with RL
How to build the exact same RL pipeline used by state-of-the-art SWE agents like DeepSWE — miniaturized to run on your laptop in 30 minutes.

Reinforcement Learning with Language Feedback
Reinforcement Learning with Language Feedback