Codecademy Logo

Intermediate Reinforcement Learning

Related learning

  • Learn reinforcement learning fundamentals and build learning agents with Gymnasium in this hands-on Python course.
    • With Certificate
    • Intermediate.
      2 hours

MAB Overview

The Multi-Armed Bandit (MAB) is a classic reinforcement learning problem where an agent chooses between multiple actions (arms) repeatedly to maximize cumulative reward. Each action yields a reward from an unknown distribution.

In product recommendation, each product is an “arm.” The agent recommends a product to a user and receives a reward if the user clicks. The goal is to learn which product yields the highest click-through rate (CTR) over time.

Key Assumptions

  • Stationary rewards: The reward distribution for each arm doesn’t change over time.
  • Independent pulls: Each choice is independent of prior ones.
  • Known action space: All available actions (arms) are known in advance.
  • Immediate feedback: The agent observes a reward right after pulling an arm.

Thanks to the law of large numbers, estimated CTRs converge to true CTRs as arms are sampled more. But in real-world scenarios, time and cost constraints make infinite sampling impractical. Therefore, MABs may help you find a decent solution quickly.

Python Custom RL Environments

Discover how to build a custom reinforcement learning environment using Python! By subclassing gym.Env, you can define custom states, rewards, and transitions with the __init__, reset(), and step() methods. This enables simulations tailored for specific tasks, such as click-through optimization.

class ProductRecommendationEnv(gym.Env):
def __init__(self, product_ctrs):
self.product_ctrs = product_ctrs
self.n_products = len(product_ctrs)
self.action_space = spaces.Discrete(self.n_products)
self.observation_space = spaces.Discrete(1)
def reset(self, seed=None, options=None):
return 0, {}
def step(self, action):
click = np.random.rand() < self.product_ctrs[action]
reward = 10.0 if click else -10.0
return 0, reward, False, False, {"clicked": click}

Python UCB Strategy

The Upper Confidence Bound (UCB) strategy smartly selects actions by considering both the estimated reward value and its uncertainty. By focusing on options with high potential, it avoids random exploration, which increases efficient learning for solving multi-armed bandit problems.

def calculate_ucb(Q, N, t, c=2):
return Q + c * np.sqrt(np.log(t) / (N + 1e-5))
ucb_values = calculate_ucb(Q=np.array([0.25, 0.15, 0.15]),
N=np.array([40, 25, 35]),
t=101)
action = np.argmax(ucb_values)

Understanding MDPs

Markov Decision Processes (MDPs) are key in reinforcement learning. They offer a structured way to model scenarios where decisions need to consider current states and actions. Importantly, MDPs leverage the Markov property, which indicates that a future state only depends on the present, minimizing the full historical information.

Python Transition Probabilities

Transition probabilities define how likely an agent is to reach a particular next state from a given state-action pair. Deterministic transitions always lead to the same outcome, while stochastic transitions introduce randomness.

import gymnasium as gym
env = gym.make('Taxi-v3')
P = env.unwrapped.P # Transition dictionary
# Check transitions for state 0 and action 0 in deterministic env
for prob, next_state, reward, done in P[0][0]:
print(f"Prob: {prob}, Next State: {next_state}, Reward: {reward}, Done: {done}")
# Simulated stochastic transition model for illustration
P_stochastic = {
0: {
0: [
(0.7, 1, -1, False),
(0.3, 2, -1, False)
]}}
for prob, next_state, reward, done in P_stochastic[0][0]:
print(f"Prob: {prob}, Next State: {next_state}, Reward: {reward}")

Monte Carlo RL

Monte Carlo methods in reinforcement learning involve estimating value functions by sampling and averaging complete episodes. These model-free approaches are handy when the environment model isn’t known, focusing on episodic experiences without requiring a prior model.

def monte_carlo_learning(env, n_episodes, epsilon=0.1, gamma=0.99):
q_table = defaultdict(lambda: defaultdict(float))
returns = defaultdict(list)
total_reward = 0
for episode in range(n_episodes):
state, info = env.reset()
episode_states = []
episode_actions = []
episode_rewards = []
done = False
while not done:
action = policy(q_table, state, epsilon)
next_state, reward, terminated, truncated, info = env.step(action)
episode_states.append(state)
episode_actions.append(action)
episode_rewards.append(reward)
done = terminated or truncated
total_reward += reward
state = next_state
G = 0
for t in reversed(range(len(episode_states))):
state_t = episode_states[t]
action_t = episode_actions[t]
reward_t = episode_rewards[t]
G = reward_t + gamma * G
returns[(state_t, action_t)].append(G)
q_table[state_t][action_t] = np.mean(returns[(state_t, action_t)])
return q_table, returns

Monte Carlo Estimations

Monte Carlo methods estimate value functions from complete episodes of experience by averaging returns or using a learning rate.

  • First-Visit MC updates only the first occurrence of each state-action pair in an episode. Use First-Visit MC to avoid correlation and ensure unbiased estimation.
  • Every-Visit MC updates every occurrence of the state-action pair, allowing more frequent updates but potentially introducing bias. Use Every-Visit MC when faster learning is preferred and exact statistical properties are less critical.
# First-Visit MC (learning rate)
def first_visit_monte_carlo(q_table, episode_states, episode_actions, episode_rewards, alpha, gamma):
visited_state_actions = set()
G = 0
for t in reversed(range(len(episode_states))):
state_t = episode_states[t]
action_t = episode_actions[t]
reward_t = episode_rewards[t]
G = reward_t + gamma * G
# Update only first visit
if (state_t, action_t) not in visited_state_actions:
visited_state_actions.add((state_t, action_t))
q_table[state_t, action_t] += alpha * (G - q_table[state_t, action_t])
return q_table
# Every-Visit MC (learning rate)
def every_visit_monte_carlo(q_table, episode_states, episode_actions, episode_rewards, alpha, gamma):
G = 0
for t in reversed(range(len(episode_states))):
state_t = episode_states[t]
action_t = episode_actions[t]
reward_t = episode_rewards[t]
G = reward_t + gamma * G
# Update every visit
q_table[state_t, action_t] += alpha * (G - q_table[state_t, action_t])
return q_table

Model-Free vs Model-Based

In reinforcement learning:

  • Model-free algorithms learn directly from environment interactions via trial and error. They do not require knowledge of transition probabilities or reward models. Examples include Q-learning, SARSA, and Monte Carlo methods. They are flexible but may be trained slower.
  • Model-based algorithms use a known or learned model of the environment (i.e., transition and reward functions) to plan ahead and compute optimal policies through methods like Value Iteration or Policy Iteration. They use known dynamics to plan more efficiently but may require more computation.
# Example: Comparison of Model-Free and Model-Based Approaches
# Model-Free: Q-learning
import gymnasium as gym
import numpy as np
env = gym.make("CliffWalking-v0")
q_table = np.zeros((env.observation_space.n, env.action_space.n))
alpha = 0.1
gamma = 0.99
epsilon = 0.1
state, _ = env.reset()
done = False
while not done:
if np.random.rand() < epsilon:
action = env.action_space.sample()
else:
action = np.argmax(q_table[state])
next_state, reward, terminated, truncated, _ = env.step(action)
done = terminated or truncated
q_table[state, action] += alpha * (
reward + gamma * np.max(q_table[next_state]) - q_table[state, action]
)
state = next_state
# Model-Based: Value Iteration
env = gym.make("CliffWalking-v0")
P = env.unwrapped.P # Transition model
n_states = env.observation_space.n
n_actions = env.action_space.n
gamma = 0.99
theta = 1e-6
V = np.zeros(n_states)
# Value Iteration loop
while True:
delta = 0
for s in range(n_states):
v = V[s]
V[s] = max(
sum(prob * (reward + gamma * V[next_s])
for prob, next_s, reward, done in P[s][a])
for a in range(n_actions)
)
delta = max(delta, abs(v - V[s]))
if delta < theta:
break

Value Iteration in Python

Value Iteration is a model-based dynamic programming method used in reinforcement learning to compute the optimal value function. It uses the Bellman Optimality Equation to iteratively update state values until convergence. Once optimal values are computed, the optimal policy can be derived by choosing the best action in each state.

Summary:

  • Requires a known transition model (i.e., model-based)
  • Iteratively updates values until convergence
  • Extracts optimal policy by acting greedily with respect to final state values
  • Best suited for small, fully observable environments with tabular state/action spaces
import gymnasium as gym
import numpy as np
env = gym.make("Taxi-v3")
P = env.unwrapped.P # Transition model
n_states = env.observation_space.n
n_actions = env.action_space.n
def value_iteration(max_iterations, gamma, theta):
V = np.zeros(n_states)
Q = np.zeros((n_states, n_actions))
for iteration in range(max_iterations):
delta = 0
old_V = V.copy()
for state in range(n_states):
for action in range(n_actions):
action_value = 0
for prob, next_state, reward, done in P[state][action]:
if done: # Terminal state
action_value += prob * reward
else: # Non-terminal state
action_value += prob * (reward + gamma * V[next_state])
Q[state, action] = action_value
V[state] = np.max(Q[state])
delta = max(delta, abs(old_V[state] - V[state]))
# Check convergence
if delta < theta:
break
return V, Q

Learn more on Codecademy

  • Learn reinforcement learning fundamentals and build learning agents with Gymnasium in this hands-on Python course.
    • With Certificate
    • Intermediate.
      2 hours