RL From Scratch

From Q-learning to GRPO — build reinforcement learning from first principles.

intermediate~24 hours10 pods live

Pods in this Course

Basics of Reinforcement Learning

From the agent-environment loop to Bellman equations, rewards, and your first OpenAI Gymnasium agent -- all from first principles.

~4h3 notebooksCase study

Value Functions and Q-Learning

From Bellman's recursive insight to teaching machines to learn optimal behavior through trial and error.

~4h4 notebooksCase study

Building DQN Atari Agents

How DeepMind combined Q-learning with convolutional neural networks to play Atari games at superhuman levels -- and why this 2013 paper changed everything.

~4h4 notebooksCase study

Policy Gradient Methods

From REINFORCE to Actor-Critic -- how gradient ascent on policy parameters unlocks continuous and high-dimensional action spaces.

~3h3 notebooksCase study

RLHF Theory and Implementation: Teaching Machines to Learn from Human Preferences

A complete guide to aligning language models with human preferences: from reward modeling to PPO, with full code implementations.

~3h3 notebooksCase study

Group-Relative Policy Optimization (GRPO) -- From Scratch

How DeepSeek eliminated the critic network and made RLHF simpler, cheaper, and better. From group-relative advantages to training reasoning models.

~3h3 notebooksCase study

Building a Reasoning Model from Scratch

How to teach a small language model to think step-by-step using reinforcement learning with verifiable rewards -- from SFT on chain-of-thought data to GRPO training to distillation.

~3h4 notebooksCase study

OpenClaw-RL: Personalizing AI Agents from Conversation Feedback

How to make your AI assistant learn your preferences from natural conversations. Build the full OpenClaw-RL pipeline: session-aware rollouts, Binary RL with GRPO-TCR, On-Policy Distillation with hindsight hints, and the RLAnything closed loop.

~4h4 notebooksCase study

Mini-SWE-RL: Teaching a Small Language Model to Fix Bugs with RL

How to build the exact same RL pipeline used by state-of-the-art SWE agents like DeepSWE — miniaturized to run on your laptop in 30 minutes.

~3h4 notebooksCase study

Reinforcement Learning with Language Feedback

~4h8 notebooksCase study