VizuaraVizuara AI Pods

RL From Scratch

From Q-learning to GRPO — build reinforcement learning from first principles.

intermediate~24 hours10 pods live

Pods in this Course

Basics of Reinforcement Learning
1

Basics of Reinforcement Learning

From the agent-environment loop to Bellman equations, rewards, and your first OpenAI Gymnasium agent -- all from first principles.

~4h3 notebooksCase study
Value Functions and Q-Learning
2

Value Functions and Q-Learning

From Bellman's recursive insight to teaching machines to learn optimal behavior through trial and error.

~4h4 notebooksCase study
Building DQN Atari Agents
3

Building DQN Atari Agents

How DeepMind combined Q-learning with convolutional neural networks to play Atari games at superhuman levels -- and why this 2013 paper changed everything.

~4h4 notebooksCase study
Policy Gradient Methods
4

Policy Gradient Methods

From REINFORCE to Actor-Critic -- how gradient ascent on policy parameters unlocks continuous and high-dimensional action spaces.

~3h3 notebooksCase study
RLHF Theory and Implementation: Teaching Machines to Learn from Human Preferences
5

RLHF Theory and Implementation: Teaching Machines to Learn from Human Preferences

A complete guide to aligning language models with human preferences: from reward modeling to PPO, with full code implementations.

~3h3 notebooksCase study
Group-Relative Policy Optimization (GRPO) -- From Scratch
6

Group-Relative Policy Optimization (GRPO) -- From Scratch

How DeepSeek eliminated the critic network and made RLHF simpler, cheaper, and better. From group-relative advantages to training reasoning models.

~3h3 notebooksCase study
Building a Reasoning Model from Scratch
7

Building a Reasoning Model from Scratch

How to teach a small language model to think step-by-step using reinforcement learning with verifiable rewards -- from SFT on chain-of-thought data to GRPO training to distillation.

~3h4 notebooksCase study
OpenClaw-RL: Personalizing AI Agents from Conversation Feedback
8

OpenClaw-RL: Personalizing AI Agents from Conversation Feedback

How to make your AI assistant learn your preferences from natural conversations. Build the full OpenClaw-RL pipeline: session-aware rollouts, Binary RL with GRPO-TCR, On-Policy Distillation with hindsight hints, and the RLAnything closed loop.

~4h4 notebooksCase study
Mini-SWE-RL: Teaching a Small Language Model to Fix Bugs with RL
9

Mini-SWE-RL: Teaching a Small Language Model to Fix Bugs with RL

How to build the exact same RL pipeline used by state-of-the-art SWE agents like DeepSWE — miniaturized to run on your laptop in 30 minutes.

~3h4 notebooksCase study
Reinforcement Learning with Language Feedback
10

Reinforcement Learning with Language Feedback

Reinforcement Learning with Language Feedback

~4h8 notebooksCase study