Codecademy Logo

Intro to Reinforcement Learning

Related learning

  • Learn reinforcement learning fundamentals and build learning agents with Gymnasium in this hands-on Python course.
    • With Certificate
    • Intermediate.
      2 hours

Reinforcement Learning

Reinforcement learning (RL) enables agents to learn through trial and error by interacting with an environment. Unlike supervised learning, which relies on labeled data, RL focuses on maximizing cumulative rewards. Agents make decisions based on feedback from actions taken, improving strategies over time.

Reinforcement Loop Python

The reinforcement learning loop consists of: observe current state → choose action based on policy → take action → receive reward → observe next state → update policy (if learning) → repeat until episode ends.

import gymnasium as gym
import numpy as np
# Create environment
env = gym.make("FrozenLake-v1", is_slippery=False)
# Initialize Q-table for learning (optional)
q_table = np.zeros((env.observation_space.n, env.action_space.n))
alpha = 0.1 # Learning rate
gamma = 0.99 # Discount factor
epsilon = 0.1 # Exploration rate
for episode in range(1000):
# STEP 1: Observe current state (reset environment)
state = env.reset()[0]
done = False
while not done:
# STEP 2: Choose action based on policy (epsilon-greedy)
if np.random.random() < epsilon:
action = env.action_space.sample() # Explore
else:
action = np.argmax(q_table[state]) # Exploit
# STEP 3: Take action in environment
next_state, reward, terminated, truncated, info = env.step(action)
done = terminated or truncated
# STEP 4: Receive reward (already got it from env.step())
# STEP 5: Observe next state (already got it from env.step())
# STEP 6: Update policy (Q-learning update rule, if learning)
best_next_action = np.max(q_table[next_state])
q_table[state, action] += alpha * (reward + gamma * best_next_action - q_table[state, action])
# STEP 7: Move to next state (prepare for next iteration)
state = next_state
# STEP 8: Repeat until episode ends (controlled by while loop)
env.close()

RL Components Overview

Understanding key components in Reinforcement Learning (RL) is crucial. The agent acts as the decision-maker. It interacts with the environment, which represents everything the agent deals with. The state reflects the current condition, while actions represent the moves the agent can make. The agent receives a reward as feedback for its actions, governed by a policy or decision-making strategy.

FrozenLake Example

In the FrozenLake environment, we can see each RL component clearly:

  • Agent: The character trying to navigate across the frozen lake. The agent is the learner that makes decisions about which direction to move.
  • Environment: The 4x4 grid of the frozen lake itself, containing safe frozen tiles, dangerous holes, a starting position, and a goal.
  • State: The agent’s current position on the grid, represented as numbers 0-15.
  • Actions: The four possible moves the agent can make at any position: 0=Left, 1=Down, 2=Right, 3=Up.
  • Reward: The feedback signal the agent receives. In FrozenLake, the agent gets +1 reward only when reaching the goal (state 15), and 0 reward for all other moves, including falling into holes.
  • Policy: The agent’s strategy for choosing actions. This could be random (picking any available direction), rule-based (always go right then down), or learned through experience (choosing the action with the highest expected reward).
Depiction of the FrozenLake environment showing a grid of tiles representing different states. The character on the starting tile represents the agent. Arrows indicate possible actions (up, down, left, right). Tiles include safe paths, holes (which end the episode), and a goal state that gives a reward. The environment highlights the concept of states, actions, transitions, and rewards in reinforcement learning.

Episodic Tasks

Episodes represent complete sequences from start to terminal state. Episodic tasks have clear beginnings and endings (like FrozenLake reaching the goal or falling into a hole), while continuous tasks run indefinitely without natural endpoints.

Episodic Tasks

  • Clear start and end points
  • The episode terminates when the goal is reached or failure occurs
  • Examples: FrozenLake, games, maze navigation, robot completing a specific task

Continuous Tasks

  • No natural endpoints
  • Agent interacts with the environment indefinitely
  • Examples: stock trading, temperature control, autonomous driving

Python Policy Types

Policies dictate how agents operate in various states. They can be random, rule-based, or learned over time. For instance, a rule-based policy will follow predetermined actions, while a learned one adapts using feedback.

import gymnasium as gym
import numpy as np
import random
env = gym.make("FrozenLake-v1", is_slippery=False)
# =============================================================================
# 1. RANDOM POLICY
# =============================================================================
def random_policy(state):
"""Chooses actions randomly - no strategy"""
return env.action_space.sample()
# =============================================================================
# 2. RULE-BASED POLICY
# =============================================================================
def rule_based_policy(state, grid_size=4):
"""Down-then-right strategy based on domain knowledge"""
row, col = divmod(state, grid_size)
# Rule: Go down until bottom row, then go right
if row < grid_size - 1:
return 1 # Down
else:
return 2 # Right
# =============================================================================
# 3. LEARNED POLICY (from Q-table)
# =============================================================================
def learned_policy(state, q_table, epsilon=0.1):
"""Epsilon-greedy policy using learned Q-values"""
if random.random() < epsilon:
return env.action_space.sample() # Explore
else:
return np.argmax(q_table[state]) # Exploit learned knowledge

Q-function in Python

Action-Value Functions, or Q-functions, estimate the expected reward for taking a specific action in a given state and following the policy thereafter. Q(s,a) represents the value of action a in state s.

In the lesson, we approximated the Action-Value functions using a Q-table, organized as a 2D array where rows represent states and columns represent actions. In FrozenLake’s 4×4 grid, we have 16 states (numbered 0-15) and 4 possible actions (0=Left, 1=Down, 2=Right, 3=Up). Each cell Q(s,a) contains the estimated value of taking action ‘a’ in state ‘s’. States are numbered left-to-right, top-to-bottom, so state 0 is the top-left starting position and state 15 is the bottom-right goal.

Q(14,2) stands out with a value of 1.00 because in this state, action 2 (going to the right) leads directly to the goal. Similarly, states that are holes (like states 5, 7, 11, 12) have Q-values of 0.00 for all actions since falling into a hole ends the episode with no reward. The Q-table reveals the optimal policy by showing which action has the highest value in each state. For example, the greedy policy would always choose the action with the maximum Q-value for the current state.

Q-Value Selection

The greedy policy in reinforcement learning chooses the action with the highest Q-value for each state. This represents the current best-known strategy from learned experiences. Q-values help determine the expected rewards for actions, guiding decision-making under uncertainty.

In code, the greedy policy can be obtained from the Q-table using np.argmax() to select the action with the largest Q-value.

import numpy as np
# Q-table for FrozenLake: 16 states × 4 actions
# Actions: 0=Left, 1=Down, 2=Right, 3=Up
q_table = np.array([
[0.00, 0.45, 0.62, 0.00], # State 0 (start)
[0.35, 0.52, 0.71, 0.15], # State 1
[0.48, 0.63, 0.89, 0.22], # State 2
[0.28, 0.41, 0.00, 0.18], # State 3
[0.00, 0.58, 0.47, 0.33], # State 4
[0.00, 0.00, 0.00, 0.00], # State 5 (hole)
[0.31, 0.44, 0.66, 0.29], # State 6
[0.00, 0.00, 0.00, 0.00], # State 7 (hole)
[0.00, 0.72, 0.58, 0.41], # State 8
[0.43, 0.67, 0.74, 0.39], # State 9
[0.55, 0.81, 0.93, 0.48], # State 10
[0.00, 0.00, 0.00, 0.00], # State 11 (hole)
[0.00, 0.00, 0.00, 0.00], # State 12 (hole)
[0.61, 0.84, 0.95, 0.52], # State 13
[0.73, 0.89, 1.00, 0.67], # State 14 (next to goal)
[0.00, 0.00, 0.00, 0.00], # State 15 (goal)
])
# Greedy policy: select action with highest Q-value
state = 14 # Example: agent is next to the goal
# Pick greedy action
greedy_action = np.argmax(q_table[state])
print(f"Current state: {current_state}")
print(f"Q-values: {q_table[current_state]}")
print(f"Greedy action: {greedy_action} (Right)")
print(f"Best Q-value: {q_table[current_state, greedy_action]:.2f}")
# Output:
# Current state: 14
# Q-values: [0.73 0.89 1. 0.67]
# Greedy action: 2 (Right)
# Best Q-value: 1.00

Epsilon-Greedy Strategy

The epsilon-greedy strategy is key to balancing exploration by choosing random actions and exploitation by selecting the best-known actions. It effectively interleaves new possibilities while leveraging existing knowledge.

How Epsilon-Greedy Works

  • With probability ε (epsilon): Choose a random action (exploration)
  • With probability 1-ε: Choose the best-known action (exploitation)
def epsilon_greedy_policy(q_table, state, epsilon=0.2):
"""
Choose action using epsilon-greedy policy
"""
if random.random() < epsilon:
# Exploration: choose random action
return env.action_space.sample()
else:
# Exploitation: choose best known action
return np.argmax(q_table[state])

SARSA Update in Python

SARSA stands for State-Action-Reward-State-Action, an approach that updates Q-values by considering the actual next action. The key formula is:

Q(s,a) ← (1-α)Q(s,a) + α(reward + γ·Q(s',a'))

where

  • s = current state (where the agent is now)
  • a = current action (what the agent just did)
  • s’ = next state (where the agent ended up)
  • a’ = next action (what the agent will actually do next)
  • α = learning rate
  • γ = discount factor

This method is often more conservative than Q-learning because it utilizes the agent’s chosen action value rather than the maximum possible value for the next state-action pair.

import numpy as np
import random
import gymnasium as gym
# SARSA Algorithm Implementation
env = gym.make("FrozenLake-v1", is_slippery=False)
q_table = np.zeros((16, 4))
# Parameters
alpha = 0.1 # Learning rate
gamma = 0.99 # Discount factor
epsilon = 0.3 # Exploration rate
episodes = 1000
def epsilon_greedy(state, q_table, epsilon):
"""Choose action using epsilon-greedy policy"""
if random.random() < epsilon:
return env.action_space.sample()
else:
return np.argmax(q_table[state])
# SARSA Training Loop
for episode in range(episodes):
# Step 1: Get initial STATE
state = env.reset()[0]
done = False
# Step 2: Choose initial ACTION using policy
action = epsilon_greedy(state, q_table, epsilon)
while not done:
# Step 3: Take action and get REWARD and next STATE
next_state, reward, terminated, truncated, _ = env.step(action)
done = terminated or truncated
# Step 4: Choose next ACTION using same policy
next_action = epsilon_greedy(next_state, q_table, epsilon)
# SARSA UPDATE: Use actual next action (a')
q_table[state, action] += alpha * (
reward + gamma * q_table[next_state, next_action] - q_table[state, action]
)
# Move to next state and action
state = next_state
action = next_action # This is key - use the action we actually chose

Gymnasium Library Basics

Gymnasium is a vital Python library for Reinforcement Learning (RL), offering standardized environments with uniform interfaces. Each action yields state, reward, terminated, truncated, and info, enabling efficient RL algorithm implementation.

import gymnasium as gym
env = gym.make('CartPole-v1')
state = env.reset()
done = False
while not done:
action = env.action_space.sample() # Random action
state, reward, terminated, truncated, info = env.step(action)
if terminated or truncated:
break
print(f'Final State: {state}, Reward: {reward}')

Q-Value Convergence

Convergence occurs when Q-values stabilize and stop changing significantly. It can be monitored through average rewards, success rates, Q-value changes (delta), and policy stability. Convergence-based stopping criteria are often better than fixed episode counts, as they ensure the agent has learned a stable policy before training ends.

Convergence Indicators

  • Average Reward - Flattens when Q-values stabilize
  • Success Rate - Plateaus when learning stabilizes
  • Q-value Delta - Approaches zero when changes are minimal
  • Policy Changes - Decreases to zero when policy is stable
import numpy as np
import gymnasium as gym
env = gym.make("FrozenLake-v1", is_slippery=False)
q_table = np.zeros((env.observation_space.n, env.action_space.n))
# Parameters
alpha = 0.3
gamma = 0.99
epsilon = 0.4
n_episodes = 5000
# Initialize convergence tracking variables
rewards = []
avg_rewards = []
success_rate = []
q_value_deltas = []
policy_changes = []
successes = 0
old_q_table = np.zeros_like(q_table) # For comparison
for episode in range(n_episodes):
state, _ = env.reset()
done = False
total_reward = 0
while not done:
# Epsilon-greedy action selection
if np.random.random() < epsilon:
action = env.action_space.sample()
else:
action = np.argmax(q_table[state])
next_state, reward, terminated, truncated, _ = env.step(action)
done = terminated or truncated
# Q-learning update
best_next_action = np.max(q_table[next_state])
q_table[state, action] += alpha * (reward + gamma * best_next_action - q_table[state, action])
state = next_state
total_reward += reward
# Track metrics after episode ends
rewards.append(total_reward)
if reward > 0: # goal reached
successes += 1
# Calculate Q-value delta (average change in Q-table)
q_delta = np.mean(np.abs(q_table - old_q_table))
q_value_deltas.append(q_delta)
# Calculate policy changes
current_policy = np.argmax(q_table, axis=1)
previous_policy = np.argmax(old_q_table, axis=1)
policy_change = np.sum(current_policy != previous_policy)
policy_changes.append(policy_change)
# Update old Q-table for next comparison
old_q_table = q_table.copy()
# Print progress every 500 episodes
if (episode + 1) % 500 == 0:
print(f"Episode {episode + 1}: "
f"Avg Reward = {np.mean(rewards[-500:]):.2f}, "
f"Success Rate = {successes / 500:.2f}, "
f"Q-Delta = {np.mean(q_value_deltas[-500:]):.5f}, "
f"Policy Changes = {np.mean(policy_changes[-500:]):.1f}")
success_rate.append(successes / 500)
avg_rewards.append(np.mean(rewards[-500:]))
successes = 0
print("Training completed!")

Understanding Reward Hacking

Reward hacking happens when agents misinterpret the reward function to maximize rewards without completing the task correctly. To prevent this, designing thoughtful and robust reward structures is essential, ensuring agents focus on achieving desired outcomes rather than exploiting loopholes.

Agents are literal optimizers - they find the easiest way to maximize the reward signal, which may not align with human intentions. The agent doesn’t understand the true goal; it only sees the numerical reward.

Here are some concrete examples of reward hacking:

  • In the FrozenLake lesson example, when researchers added a +0.1 reward for each step taken (intending to encourage the agent to avoid holes), the agent learned to move left repeatedly at the starting position, collecting infinite small rewards without ever attempting to reach the goal.
  • A famous boat racing AI was supposed to learn to win races but instead discovered it could score more points by driving in circles and repeatedly hitting reward pickup items rather than actually finishing the race course.
  • A soccer-playing AI learned that instead of trying to score goals, it could kick the ball out of bounds repeatedly to maintain possession and avoid the risk of the opponent scoring, maximizing its “ball control” reward while making the game unwatchable.

Q-Learning in Action

Q-learning updates Q-values using the best possible future reward, combining old knowledge with new experience to learn optimal action values. The update formula is:

Q(s,a) ← (1-α)Q(s,a) + α(reward + γ·max Q(s',a')),

  • Q(s,a) = the current Q-value being updated (expected future reward for taking action ‘a’ in state ‘s’)
  • s = current state (where the agent is)
  • a = action taken (what the agent did)
  • s’ = next state (where the agent ended up)
  • α (alpha) = learning rate (0 to 1), controls how much new experiences update existing knowledge
  • γ (gamma) = discount factor (0 to 1), determines how much future rewards are valued compared to immediate rewards
  • reward = immediate reward received for taking action ‘a’ in state ‘s’
  • max Q(s’,a’) = the highest Q-value among all possible actions in the next state (representing the best possible future reward)
import numpy as np
import random
import gymnasium as gym
# Q-learning implementation from the lesson
env = gym.make("FrozenLake-v1", is_slippery=False, render_mode="ansi")
env.action_space.seed(42)
state, info = env.reset(seed=42)
# Parameters
alpha = 0.3 # Learning rate
gamma = 0.99 # Discount factor
epsilon = 0.4 # Exploration rate
n_episodes = 5000
# Initialize Q-table
q_table = np.zeros((env.observation_space.n, env.action_space.n))
successes = 0
for episode in range(n_episodes):
state, _ = env.reset()
done = False
while not done:
# Epsilon-greedy action selection
action = env.action_space.sample() if random.random() < epsilon else np.argmax(q_table[state])
# Take action and observe results
next_state, reward, terminated, truncated, _ = env.step(action)
done = terminated or truncated
# Q-LEARNING UPDATE RULE (key part!)
best_next_action = np.max(q_table[next_state])
q_table[state, action] += alpha * (reward + gamma * best_next_action - q_table[state, action])
state = next_state
if reward > 0:
successes += 1
if (episode + 1) % 500 == 0:
print(f"Episode {episode+1}: Successes in last 500 episodes = {successes}")
successes = 0
print("Q-learning training completed!")

Learn more on Codecademy

  • Learn reinforcement learning fundamentals and build learning agents with Gymnasium in this hands-on Python course.
    • With Certificate
    • Intermediate.
      2 hours