The Multi-Armed Bandit (MAB) is a classic reinforcement learning problem where an agent chooses between multiple actions (arms) repeatedly to maximize cumulative reward. Each action yields a reward from an unknown distribution.
In product recommendation, each product is an “arm.” The agent recommends a product to a user and receives a reward if the user clicks. The goal is to learn which product yields the highest click-through rate (CTR) over time.
Key Assumptions
Thanks to the law of large numbers, estimated CTRs converge to true CTRs as arms are sampled more. But in real-world scenarios, time and cost constraints make infinite sampling impractical. Therefore, MABs may help you find a decent solution quickly.
Python Custom RL EnvironmentsDiscover how to build a custom reinforcement learning environment using Python! By subclassing gym.Env, you can define custom states, rewards, and transitions with the __init__, reset(), and step() methods. This enables simulations tailored for specific tasks, such as click-through optimization.
class ProductRecommendationEnv(gym.Env):def __init__(self, product_ctrs):self.product_ctrs = product_ctrsself.n_products = len(product_ctrs)self.action_space = spaces.Discrete(self.n_products)self.observation_space = spaces.Discrete(1)def reset(self, seed=None, options=None):return 0, {}def step(self, action):click = np.random.rand() < self.product_ctrs[action]reward = 10.0 if click else -10.0return 0, reward, False, False, {"clicked": click}
Python UCB StrategyThe Upper Confidence Bound (UCB) strategy smartly selects actions by considering both the estimated reward value and its uncertainty. By focusing on options with high potential, it avoids random exploration, which increases efficient learning for solving multi-armed bandit problems.
def calculate_ucb(Q, N, t, c=2):return Q + c * np.sqrt(np.log(t) / (N + 1e-5))ucb_values = calculate_ucb(Q=np.array([0.25, 0.15, 0.15]),N=np.array([40, 25, 35]),t=101)action = np.argmax(ucb_values)
Markov Decision Processes (MDPs) are key in reinforcement learning. They offer a structured way to model scenarios where decisions need to consider current states and actions. Importantly, MDPs leverage the Markov property, which indicates that a future state only depends on the present, minimizing the full historical information.
Python Transition ProbabilitiesTransition probabilities define how likely an agent is to reach a particular next state from a given state-action pair. Deterministic transitions always lead to the same outcome, while stochastic transitions introduce randomness.
import gymnasium as gymenv = gym.make('Taxi-v3')P = env.unwrapped.P # Transition dictionary# Check transitions for state 0 and action 0 in deterministic envfor prob, next_state, reward, done in P[0][0]:print(f"Prob: {prob}, Next State: {next_state}, Reward: {reward}, Done: {done}")# Simulated stochastic transition model for illustrationP_stochastic = {0: {0: [(0.7, 1, -1, False),(0.3, 2, -1, False)]}}for prob, next_state, reward, done in P_stochastic[0][0]:print(f"Prob: {prob}, Next State: {next_state}, Reward: {reward}")
Monte Carlo methods in reinforcement learning involve estimating value functions by sampling and averaging complete episodes. These model-free approaches are handy when the environment model isn’t known, focusing on episodic experiences without requiring a prior model.
def monte_carlo_learning(env, n_episodes, epsilon=0.1, gamma=0.99):q_table = defaultdict(lambda: defaultdict(float))returns = defaultdict(list)total_reward = 0for episode in range(n_episodes):state, info = env.reset()episode_states = []episode_actions = []episode_rewards = []done = Falsewhile not done:action = policy(q_table, state, epsilon)next_state, reward, terminated, truncated, info = env.step(action)episode_states.append(state)episode_actions.append(action)episode_rewards.append(reward)done = terminated or truncatedtotal_reward += rewardstate = next_stateG = 0for t in reversed(range(len(episode_states))):state_t = episode_states[t]action_t = episode_actions[t]reward_t = episode_rewards[t]G = reward_t + gamma * Greturns[(state_t, action_t)].append(G)q_table[state_t][action_t] = np.mean(returns[(state_t, action_t)])return q_table, returns
Monte Carlo methods estimate value functions from complete episodes of experience by averaging returns or using a learning rate.
# First-Visit MC (learning rate)def first_visit_monte_carlo(q_table, episode_states, episode_actions, episode_rewards, alpha, gamma):visited_state_actions = set()G = 0for t in reversed(range(len(episode_states))):state_t = episode_states[t]action_t = episode_actions[t]reward_t = episode_rewards[t]G = reward_t + gamma * G# Update only first visitif (state_t, action_t) not in visited_state_actions:visited_state_actions.add((state_t, action_t))q_table[state_t, action_t] += alpha * (G - q_table[state_t, action_t])return q_table# Every-Visit MC (learning rate)def every_visit_monte_carlo(q_table, episode_states, episode_actions, episode_rewards, alpha, gamma):G = 0for t in reversed(range(len(episode_states))):state_t = episode_states[t]action_t = episode_actions[t]reward_t = episode_rewards[t]G = reward_t + gamma * G# Update every visitq_table[state_t, action_t] += alpha * (G - q_table[state_t, action_t])return q_table
In reinforcement learning:
# Example: Comparison of Model-Free and Model-Based Approaches# Model-Free: Q-learningimport gymnasium as gymimport numpy as npenv = gym.make("CliffWalking-v0")q_table = np.zeros((env.observation_space.n, env.action_space.n))alpha = 0.1gamma = 0.99epsilon = 0.1state, _ = env.reset()done = Falsewhile not done:if np.random.rand() < epsilon:action = env.action_space.sample()else:action = np.argmax(q_table[state])next_state, reward, terminated, truncated, _ = env.step(action)done = terminated or truncatedq_table[state, action] += alpha * (reward + gamma * np.max(q_table[next_state]) - q_table[state, action])state = next_state# Model-Based: Value Iterationenv = gym.make("CliffWalking-v0")P = env.unwrapped.P # Transition modeln_states = env.observation_space.nn_actions = env.action_space.ngamma = 0.99theta = 1e-6V = np.zeros(n_states)# Value Iteration loopwhile True:delta = 0for s in range(n_states):v = V[s]V[s] = max(sum(prob * (reward + gamma * V[next_s])for prob, next_s, reward, done in P[s][a])for a in range(n_actions))delta = max(delta, abs(v - V[s]))if delta < theta:break
PythonValue Iteration is a model-based dynamic programming method used in reinforcement learning to compute the optimal value function. It uses the Bellman Optimality Equation to iteratively update state values until convergence. Once optimal values are computed, the optimal policy can be derived by choosing the best action in each state.
Summary:
import gymnasium as gymimport numpy as npenv = gym.make("Taxi-v3")P = env.unwrapped.P # Transition modeln_states = env.observation_space.nn_actions = env.action_space.ndef value_iteration(max_iterations, gamma, theta):V = np.zeros(n_states)Q = np.zeros((n_states, n_actions))for iteration in range(max_iterations):delta = 0old_V = V.copy()for state in range(n_states):for action in range(n_actions):action_value = 0for prob, next_state, reward, done in P[state][action]:if done: # Terminal stateaction_value += prob * rewardelse: # Non-terminal stateaction_value += prob * (reward + gamma * V[next_state])Q[state, action] = action_valueV[state] = np.max(Q[state])delta = max(delta, abs(old_V[state] - V[state]))# Check convergenceif delta < theta:breakreturn V, Q