Reinforcement learning (RL) enables agents to learn through trial and error by interacting with an environment. Unlike supervised learning, which relies on labeled data, RL focuses on maximizing cumulative rewards. Agents make decisions based on feedback from actions taken, improving strategies over time.
PythonThe reinforcement learning loop consists of: observe current state → choose action based on policy → take action → receive reward → observe next state → update policy (if learning) → repeat until episode ends.
import gymnasium as gymimport numpy as np# Create environmentenv = gym.make("FrozenLake-v1", is_slippery=False)# Initialize Q-table for learning (optional)q_table = np.zeros((env.observation_space.n, env.action_space.n))alpha = 0.1 # Learning rategamma = 0.99 # Discount factorepsilon = 0.1 # Exploration ratefor episode in range(1000):# STEP 1: Observe current state (reset environment)state = env.reset()[0]done = Falsewhile not done:# STEP 2: Choose action based on policy (epsilon-greedy)if np.random.random() < epsilon:action = env.action_space.sample() # Exploreelse:action = np.argmax(q_table[state]) # Exploit# STEP 3: Take action in environmentnext_state, reward, terminated, truncated, info = env.step(action)done = terminated or truncated# STEP 4: Receive reward (already got it from env.step())# STEP 5: Observe next state (already got it from env.step())# STEP 6: Update policy (Q-learning update rule, if learning)best_next_action = np.max(q_table[next_state])q_table[state, action] += alpha * (reward + gamma * best_next_action - q_table[state, action])# STEP 7: Move to next state (prepare for next iteration)state = next_state# STEP 8: Repeat until episode ends (controlled by while loop)env.close()
Understanding key components in Reinforcement Learning (RL) is crucial. The agent acts as the decision-maker. It interacts with the environment, which represents everything the agent deals with. The state reflects the current condition, while actions represent the moves the agent can make. The agent receives a reward as feedback for its actions, governed by a policy or decision-making strategy.
In the FrozenLake environment, we can see each RL component clearly:
Episodic TasksEpisodes represent complete sequences from start to terminal state. Episodic tasks have clear beginnings and endings (like FrozenLake reaching the goal or falling into a hole), while continuous tasks run indefinitely without natural endpoints.
Policies dictate how agents operate in various states. They can be random, rule-based, or learned over time. For instance, a rule-based policy will follow predetermined actions, while a learned one adapts using feedback.
import gymnasium as gymimport numpy as npimport randomenv = gym.make("FrozenLake-v1", is_slippery=False)# =============================================================================# 1. RANDOM POLICY# =============================================================================def random_policy(state):"""Chooses actions randomly - no strategy"""return env.action_space.sample()# =============================================================================# 2. RULE-BASED POLICY# =============================================================================def rule_based_policy(state, grid_size=4):"""Down-then-right strategy based on domain knowledge"""row, col = divmod(state, grid_size)# Rule: Go down until bottom row, then go rightif row < grid_size - 1:return 1 # Downelse:return 2 # Right# =============================================================================# 3. LEARNED POLICY (from Q-table)# =============================================================================def learned_policy(state, q_table, epsilon=0.1):"""Epsilon-greedy policy using learned Q-values"""if random.random() < epsilon:return env.action_space.sample() # Exploreelse:return np.argmax(q_table[state]) # Exploit learned knowledge
Q-function in PythonAction-Value Functions, or Q-functions, estimate the expected reward for taking a specific action in a given state and following the policy thereafter. Q(s,a) represents the value of action a in state s.
In the lesson, we approximated the Action-Value functions using a Q-table, organized as a 2D array where rows represent states and columns represent actions. In FrozenLake’s 4×4 grid, we have 16 states (numbered 0-15) and 4 possible actions (0=Left, 1=Down, 2=Right, 3=Up). Each cell Q(s,a) contains the estimated value of taking action ‘a’ in state ‘s’. States are numbered left-to-right, top-to-bottom, so state 0 is the top-left starting position and state 15 is the bottom-right goal.
Q(14,2) stands out with a value of 1.00 because in this state, action 2 (going to the right) leads directly to the goal. Similarly, states that are holes (like states 5, 7, 11, 12) have Q-values of 0.00 for all actions since falling into a hole ends the episode with no reward. The Q-table reveals the optimal policy by showing which action has the highest value in each state. For example, the greedy policy would always choose the action with the maximum Q-value for the current state.

Q-Value SelectionThe greedy policy in reinforcement learning chooses the action with the highest Q-value for each state. This represents the current best-known strategy from learned experiences. Q-values help determine the expected rewards for actions, guiding decision-making under uncertainty.
In code, the greedy policy can be obtained from the Q-table using np.argmax() to select the action with the largest Q-value.
import numpy as np# Q-table for FrozenLake: 16 states × 4 actions# Actions: 0=Left, 1=Down, 2=Right, 3=Upq_table = np.array([[0.00, 0.45, 0.62, 0.00], # State 0 (start)[0.35, 0.52, 0.71, 0.15], # State 1[0.48, 0.63, 0.89, 0.22], # State 2[0.28, 0.41, 0.00, 0.18], # State 3[0.00, 0.58, 0.47, 0.33], # State 4[0.00, 0.00, 0.00, 0.00], # State 5 (hole)[0.31, 0.44, 0.66, 0.29], # State 6[0.00, 0.00, 0.00, 0.00], # State 7 (hole)[0.00, 0.72, 0.58, 0.41], # State 8[0.43, 0.67, 0.74, 0.39], # State 9[0.55, 0.81, 0.93, 0.48], # State 10[0.00, 0.00, 0.00, 0.00], # State 11 (hole)[0.00, 0.00, 0.00, 0.00], # State 12 (hole)[0.61, 0.84, 0.95, 0.52], # State 13[0.73, 0.89, 1.00, 0.67], # State 14 (next to goal)[0.00, 0.00, 0.00, 0.00], # State 15 (goal)])# Greedy policy: select action with highest Q-valuestate = 14 # Example: agent is next to the goal# Pick greedy actiongreedy_action = np.argmax(q_table[state])print(f"Current state: {current_state}")print(f"Q-values: {q_table[current_state]}")print(f"Greedy action: {greedy_action} (Right)")print(f"Best Q-value: {q_table[current_state, greedy_action]:.2f}")# Output:# Current state: 14# Q-values: [0.73 0.89 1. 0.67]# Greedy action: 2 (Right)# Best Q-value: 1.00
The epsilon-greedy strategy is key to balancing exploration by choosing random actions and exploitation by selecting the best-known actions. It effectively interleaves new possibilities while leveraging existing knowledge.
def epsilon_greedy_policy(q_table, state, epsilon=0.2):"""Choose action using epsilon-greedy policy"""if random.random() < epsilon:# Exploration: choose random actionreturn env.action_space.sample()else:# Exploitation: choose best known actionreturn np.argmax(q_table[state])
PythonSARSA stands for State-Action-Reward-State-Action, an approach that updates Q-values by considering the actual next action. The key formula is:
Q(s,a) ← (1-α)Q(s,a) + α(reward + γ·Q(s',a'))
where
This method is often more conservative than Q-learning because it utilizes the agent’s chosen action value rather than the maximum possible value for the next state-action pair.
import numpy as npimport randomimport gymnasium as gym# SARSA Algorithm Implementationenv = gym.make("FrozenLake-v1", is_slippery=False)q_table = np.zeros((16, 4))# Parametersalpha = 0.1 # Learning rategamma = 0.99 # Discount factorepsilon = 0.3 # Exploration rateepisodes = 1000def epsilon_greedy(state, q_table, epsilon):"""Choose action using epsilon-greedy policy"""if random.random() < epsilon:return env.action_space.sample()else:return np.argmax(q_table[state])# SARSA Training Loopfor episode in range(episodes):# Step 1: Get initial STATEstate = env.reset()[0]done = False# Step 2: Choose initial ACTION using policyaction = epsilon_greedy(state, q_table, epsilon)while not done:# Step 3: Take action and get REWARD and next STATEnext_state, reward, terminated, truncated, _ = env.step(action)done = terminated or truncated# Step 4: Choose next ACTION using same policynext_action = epsilon_greedy(next_state, q_table, epsilon)# SARSA UPDATE: Use actual next action (a')q_table[state, action] += alpha * (reward + gamma * q_table[next_state, next_action] - q_table[state, action])# Move to next state and actionstate = next_stateaction = next_action # This is key - use the action we actually chose
Gymnasium is a vital Python library for Reinforcement Learning (RL), offering standardized environments with uniform interfaces. Each action yields state, reward, terminated, truncated, and info, enabling efficient RL algorithm implementation.
import gymnasium as gymenv = gym.make('CartPole-v1')state = env.reset()done = Falsewhile not done:action = env.action_space.sample() # Random actionstate, reward, terminated, truncated, info = env.step(action)if terminated or truncated:breakprint(f'Final State: {state}, Reward: {reward}')
Convergence occurs when Q-values stabilize and stop changing significantly. It can be monitored through average rewards, success rates, Q-value changes (delta), and policy stability. Convergence-based stopping criteria are often better than fixed episode counts, as they ensure the agent has learned a stable policy before training ends.
import numpy as npimport gymnasium as gymenv = gym.make("FrozenLake-v1", is_slippery=False)q_table = np.zeros((env.observation_space.n, env.action_space.n))# Parametersalpha = 0.3gamma = 0.99epsilon = 0.4n_episodes = 5000# Initialize convergence tracking variablesrewards = []avg_rewards = []success_rate = []q_value_deltas = []policy_changes = []successes = 0old_q_table = np.zeros_like(q_table) # For comparisonfor episode in range(n_episodes):state, _ = env.reset()done = Falsetotal_reward = 0while not done:# Epsilon-greedy action selectionif np.random.random() < epsilon:action = env.action_space.sample()else:action = np.argmax(q_table[state])next_state, reward, terminated, truncated, _ = env.step(action)done = terminated or truncated# Q-learning updatebest_next_action = np.max(q_table[next_state])q_table[state, action] += alpha * (reward + gamma * best_next_action - q_table[state, action])state = next_statetotal_reward += reward# Track metrics after episode endsrewards.append(total_reward)if reward > 0: # goal reachedsuccesses += 1# Calculate Q-value delta (average change in Q-table)q_delta = np.mean(np.abs(q_table - old_q_table))q_value_deltas.append(q_delta)# Calculate policy changescurrent_policy = np.argmax(q_table, axis=1)previous_policy = np.argmax(old_q_table, axis=1)policy_change = np.sum(current_policy != previous_policy)policy_changes.append(policy_change)# Update old Q-table for next comparisonold_q_table = q_table.copy()# Print progress every 500 episodesif (episode + 1) % 500 == 0:print(f"Episode {episode + 1}: "f"Avg Reward = {np.mean(rewards[-500:]):.2f}, "f"Success Rate = {successes / 500:.2f}, "f"Q-Delta = {np.mean(q_value_deltas[-500:]):.5f}, "f"Policy Changes = {np.mean(policy_changes[-500:]):.1f}")success_rate.append(successes / 500)avg_rewards.append(np.mean(rewards[-500:]))successes = 0print("Training completed!")
Reward hacking happens when agents misinterpret the reward function to maximize rewards without completing the task correctly. To prevent this, designing thoughtful and robust reward structures is essential, ensuring agents focus on achieving desired outcomes rather than exploiting loopholes.
Agents are literal optimizers - they find the easiest way to maximize the reward signal, which may not align with human intentions. The agent doesn’t understand the true goal; it only sees the numerical reward.
Here are some concrete examples of reward hacking:
Q-Learning in ActionQ-learning updates Q-values using the best possible future reward, combining old knowledge with new experience to learn optimal action values. The update formula is:
Q(s,a) ← (1-α)Q(s,a) + α(reward + γ·max Q(s',a')),
import numpy as npimport randomimport gymnasium as gym# Q-learning implementation from the lessonenv = gym.make("FrozenLake-v1", is_slippery=False, render_mode="ansi")env.action_space.seed(42)state, info = env.reset(seed=42)# Parametersalpha = 0.3 # Learning rategamma = 0.99 # Discount factorepsilon = 0.4 # Exploration raten_episodes = 5000# Initialize Q-tableq_table = np.zeros((env.observation_space.n, env.action_space.n))successes = 0for episode in range(n_episodes):state, _ = env.reset()done = Falsewhile not done:# Epsilon-greedy action selectionaction = env.action_space.sample() if random.random() < epsilon else np.argmax(q_table[state])# Take action and observe resultsnext_state, reward, terminated, truncated, _ = env.step(action)done = terminated or truncated# Q-LEARNING UPDATE RULE (key part!)best_next_action = np.max(q_table[next_state])q_table[state, action] += alpha * (reward + gamma * best_next_action - q_table[state, action])state = next_stateif reward > 0:successes += 1if (episode + 1) % 500 == 0:print(f"Episode {episode+1}: Successes in last 500 episodes = {successes}")successes = 0print("Q-learning training completed!")