【RL】Q-learning算法简单讲解

朔漠君

966人浏览 · 2025-11-05 18:55:23

朔漠君 · 2025-11-05 18:55:23 发布

Q-learning描述

一种无模型的基于价值的强化学习，比如你被蒙眼放入一个迷宫汇总，目标是找到宝藏。只能通过摸索前行，每次撞墙就会扣分，没撞墙就奖励一下，找到宝藏就大幅奖励一下，经过无数次尝试和失败后，你的大脑就会逐渐形成一张“认知地图”（Q-Table）。强化学习的入门 hello world
其核心问题是 在一个完全未知的环境中，智能体如何通过试错来学会选择最优行动，以达到最终目标。

核心概念

除了智能体、环境、动作、奖励、策略都是常规强化学习内容，需注意点是价值函数使用动作-价值函数 Bellman Equation
一个状态-动作对的价值（Q值），等于它带来的即时奖励，加上后续所有可能状态的价值按概率的折扣总和。
最优Q值的贝尔曼方程如下：
$Q^{*}(s,a)=E[R(s,a)+\gamma max_{a^{'}}Q^{*}(s^{'},a^{'})]$

$s^{'}是执行动作a后达到的新的状态$
$γ∈[0,1]\gamma \in [0,1]$ 是折扣因子决定对未来奖励的重视程度
$max a^{'}Q(s^{'},a^{'})$ 表示新状态 $s^{'}$ 下能获得的最大Q值
公式可调整为：
$Q(s_t,a_t) <- Q(s_t,a_t)+\alpha [r+\gamma max_{a^{t+1}}Q(s_{t+1},a^{t+1})-Q(s_t,a_t)]$
$Q(s_t,a_t)$ 当前状态s和动作a的Q值
$α\alpha$ 控制更新步长
$r$ 执行动作a后获取的即时奖励
$γ\gamma$ 折扣因子表示未来奖励的重要性
$max_{a^{t+1}}Q(s_{t+1},a^{t+1})$ 在下一状态中，所有可能动作的最大Q值
TD目标（Temporal Difference target） $r+γmaxat+1Q(st+1,at+1)r+\gamma max_{a^{t+1}}Q(s_{t+1},a^{t+1})$
TD误差：TD目标减去当前Q值

例子

假设我们有一个简单的环境，有两个状态（状态0和状态1）和两个动作（动作0和动作1）。
初始值Q表都为0，我们设置学习率 $α=0.1\alpha=0.1$ 折扣因子 $γ=0.9\gamma=0.9$

状态	动作0	动作1
0	0	0
1	0	0

step1：从状态0开始

当前状态s=0
假设我们选择到动作0
执行动作时获得即时奖励 $r = 1$ ，并转移到下一个状态 $s^{'}=1$
更新Q(0,0)
– 当前Q(0,0)=0
– 计算TD目标: $r+γmaxat+1Q(1,at+1)r+\gamma max_{a^{t+1}}Q(1,a^{t+1})$
— 在状态1中，最大Q值: $max_{a^{t+1}}Q(1,a^{t+1})=max(0,0)=0$
— 所以TD目标= $1 + 0.9 * 1 = 1$
– TD误差 1-0=1
– 更新 $Q (0, 0) = 0 + 0.1 * 1 = 0.1$
step2: 依次计算

状态	动作0	动作1
0	0.1	0
1	0	0.209

python实现Q-learning求解CVRP问题

import numpy as np
import random

# ====== CVRP环境 ======
class CVRPEnv:
    def __init__(self, dist_matrix, demands, capacity):
        self.dist = dist_matrix
        self.demands = demands
        self.capacity = capacity
        self.n_customers = len(demands) - 1  # 不含仓库
        self.reset()

    def reset(self):
        self.visited = [False] * len(self.demands)
        self.visited[0] = True
        self.current_node = 0
        self.remaining_capacity = self.capacity
        self.total_distance = 0
        return (self.current_node, self.remaining_capacity, tuple(self.visited))

    def get_valid_actions(self):
        valid = []
        for i in range(1, len(self.demands)):
            if (not self.visited[i]) and (self.demands[i] <= self.remaining_capacity):
                valid.append(i)
        # 若没有可行客户，则允许返回仓库
        if not valid:
            valid = [0]
        return valid

    def step(self, action):
        distance = self.dist[self.current_node][action]
        self.total_distance += distance
        reward = -distance

        # 到达仓库则重置容量
        if action == 0:
            self.remaining_capacity = self.capacity
        else:
            self.visited[action] = True
            self.remaining_capacity -= self.demands[action]

        self.current_node = action
        done = all(self.visited)
        next_state = (self.current_node, self.remaining_capacity, tuple(self.visited))
        return next_state, reward, done

# ====== 构造简单CVRP ======
dist_matrix = np.array([
    [0, 2, 9, 10, 7, 3],
    [2, 0, 8, 5, 6, 4],
    [9, 8, 0, 7, 3, 10],
    [10, 5, 7, 0, 4, 6],
    [7, 6, 3, 4, 0, 5],
    [3, 4, 10, 6, 5, 0],
])
demands = [0, 2, 3, 1, 2, 4]
capacity = 7

env = CVRPEnv(dist_matrix, demands, capacity)

# ====== Q-learning参数 ======
alpha = 0.1     # 学习率
gamma = 0.9     # 折扣因子
epsilon = 0.2   # 探索率
episodes = 2000

Q = {}

def get_Q(state, action):
    return Q.get((state, action), 0.0)

def set_Q(state, action, value):
    Q[(state, action)] = value

# ====== 训练过程 ======
for ep in range(episodes):
    state = env.reset()
    done = False
    while not done:
        valid_actions = env.get_valid_actions()
        print(valid_actions)
        # ε-greedy 策略
        if random.random() < epsilon:
            action = random.choice(valid_actions)
        else:
            q_values = [get_Q(state, a) for a in valid_actions]
            action = valid_actions[np.argmax(q_values)]

        next_state, reward, done = env.step(action)
        print(action,state,next_state,done)
        # 更新Q值
        next_valid_actions = env.get_valid_actions()
        max_next_q = max([get_Q(next_state, a) for a in next_valid_actions]) if next_valid_actions else 0
        old_q = get_Q(state, action)
        new_q = old_q + alpha * (reward + gamma * max_next_q - old_q)
        set_Q(state, action, new_q)

        state = next_state

    if (ep + 1) % 200 == 0:
        print(f"Episode {ep+1}/{episodes} 完成, 总行驶距离: {env.total_distance:.2f}")
# ====== 测试策略 ======
state = env.reset()
done = False
route = [0]
while not done:
    valid_actions = env.get_valid_actions()
    q_values = [get_Q(state, a) for a in valid_actions]
    action = valid_actions[np.argmax(q_values)]
    state, reward, done = env.step(action)
    route.append(action)

print("\n最优近似路径:", route)
print("总行驶距离:", env.total_distance)

2048 AI社区

有“AI”的1024 = 2048，欢迎大家加入2048 AI社区

更多推荐

小红书多工具集成模式实战：如何连接 CLI/MCP/API 构建统一工作流

/ 新增工具只需实现统一接口本文详细解析了多工具集成的架构设计与实现方案，涵盖 CLI、MCP、API 三种技术形态的集成模式，以及工具链编排、状态管理、错误处理和降级策略。这种设计使得系统能够灵活整合各种外部工具，构建高效的自动化工作流。

2048 AI社区

GitHub 热门项目日报 | 2026-06-12 ~ 2026-06-13

在代码的洪流中，每48小时都是一次技术的微型大爆炸。2026年6月的第二周，GitHub的热门榜单再次向我们揭示了开发者社区的脉搏：Python与JavaScript的“双雄对决”并未因时间推移而褪色，反而在AI应用落地与Web全栈进化的交汇点上，迸发出更为激烈的火花。本周，25个高热度项目不仅刷新了Star记录，更重新定义了效率与体验的边界。从智能代理的自动化突破到前端框架的性能极限压榨，我们看