【RL】Q-learning算法简单讲解
·
Q-learning描述
一种无模型的基于价值的强化学习,比如你被蒙眼放入一个迷宫汇总,目标是找到宝藏。只能通过摸索前行,每次撞墙就会扣分,没撞墙就奖励一下,找到宝藏就大幅奖励一下,经过无数次尝试和失败后,你的大脑就会逐渐形成一张“认知地图”(Q-Table)。 强化学习的入门 hello world
其核心问题是 在一个完全未知的环境中,智能体如何通过试错来学会选择最优行动,以达到最终目标。
核心概念
除了智能体、环境、动作、奖励、策略都是常规强化学习内容,需注意点是价值函数 使用动作-价值函数 Bellman Equation
一个状态-动作对的价值(Q值),等于它带来的即时奖励,加上后续所有可能状态的价值按概率的折扣总和。
最优Q值的贝尔曼方程如下:
Q∗(s,a)=E[R(s,a)+γmaxa′Q∗(s′,a′)] Q^{*}(s,a)=E[R(s,a)+\gamma max_{a^{'}}Q^{*}(s^{'},a^{'})] Q∗(s,a)=E[R(s,a)+γmaxa′Q∗(s′,a′)]
- s′是执行动作a后达到的新的状态s^{'}是执行动作a后达到的新的状态s′是执行动作a后达到的新的状态
- γ∈[0,1]\gamma \in [0,1]γ∈[0,1]是折扣因子决定对未来奖励的重视程度
- maxa′Q(s′,a′)max a^{'}Q(s^{'},a^{'})maxa′Q(s′,a′)表示新状态s′s^{'}s′ 下能获得的最大Q值
公式可调整为:
Q(st,at)<−Q(st,at)+α[r+γmaxat+1Q(st+1,at+1)−Q(st,at)] Q(s_t,a_t) <- Q(s_t,a_t)+\alpha [r+\gamma max_{a^{t+1}}Q(s_{t+1},a^{t+1})-Q(s_t,a_t)] Q(st,at)<−Q(st,at)+α[r+γmaxat+1Q(st+1,at+1)−Q(st,at)] - Q(st,at)Q(s_t,a_t)Q(st,at) 当前状态s和动作a的Q值
- α\alphaα 控制更新步长
- rrr 执行动作a后获取的即时奖励
- γ\gammaγ 折扣因子 表示未来奖励的重要性
- maxat+1Q(st+1,at+1)max_{a^{t+1}}Q(s_{t+1},a^{t+1})maxat+1Q(st+1,at+1) 在下一状态中,所有可能动作的最大Q值
- TD目标(Temporal Difference target) r+γmaxat+1Q(st+1,at+1)r+\gamma max_{a^{t+1}}Q(s_{t+1},a^{t+1})r+γmaxat+1Q(st+1,at+1)
- TD误差:TD目标减去当前Q值
例子
假设我们有一个简单的环境,有两个状态(状态0和状态1)和两个动作(动作0和动作1)。
初始值Q表都为0,我们设置学习率α=0.1\alpha=0.1α=0.1 折扣因子γ=0.9\gamma=0.9γ=0.9
| 状态 | 动作0 | 动作1 |
|---|---|---|
| 0 | 0 | 0 |
| 1 | 0 | 0 |
step1:从状态0开始
- 当前状态s=0
- 假设我们选择到动作0
- 执行动作时 获得即时奖励 r=1r=1r=1 ,并转移到下一个状态s′=1s^{'}=1s′=1
- 更新Q(0,0)
– 当前Q(0,0)=0
– 计算TD目标:r+γmaxat+1Q(1,at+1)r+\gamma max_{a^{t+1}}Q(1,a^{t+1})r+γmaxat+1Q(1,at+1)
— 在状态1中,最大Q值:maxat+1Q(1,at+1)=max(0,0)=0max_{a^{t+1}}Q(1,a^{t+1})=max(0,0)=0maxat+1Q(1,at+1)=max(0,0)=0
— 所以TD目标=1+0.9∗1=11+0.9 * 1=11+0.9∗1=1
– TD误差 1-0=1
– 更新 Q(0,0)=0+0.1∗1=0.1Q(0,0)=0+0.1 * 1=0.1Q(0,0)=0+0.1∗1=0.1
step2: 依次计算
| 状态 | 动作0 | 动作1 |
|---|---|---|
| 0 | 0.1 | 0 |
| 1 | 0 | 0.209 |
python实现Q-learning求解CVRP问题
import numpy as np
import random
# ====== CVRP环境 ======
class CVRPEnv:
def __init__(self, dist_matrix, demands, capacity):
self.dist = dist_matrix
self.demands = demands
self.capacity = capacity
self.n_customers = len(demands) - 1 # 不含仓库
self.reset()
def reset(self):
self.visited = [False] * len(self.demands)
self.visited[0] = True
self.current_node = 0
self.remaining_capacity = self.capacity
self.total_distance = 0
return (self.current_node, self.remaining_capacity, tuple(self.visited))
def get_valid_actions(self):
valid = []
for i in range(1, len(self.demands)):
if (not self.visited[i]) and (self.demands[i] <= self.remaining_capacity):
valid.append(i)
# 若没有可行客户,则允许返回仓库
if not valid:
valid = [0]
return valid
def step(self, action):
distance = self.dist[self.current_node][action]
self.total_distance += distance
reward = -distance
# 到达仓库则重置容量
if action == 0:
self.remaining_capacity = self.capacity
else:
self.visited[action] = True
self.remaining_capacity -= self.demands[action]
self.current_node = action
done = all(self.visited)
next_state = (self.current_node, self.remaining_capacity, tuple(self.visited))
return next_state, reward, done
# ====== 构造简单CVRP ======
dist_matrix = np.array([
[0, 2, 9, 10, 7, 3],
[2, 0, 8, 5, 6, 4],
[9, 8, 0, 7, 3, 10],
[10, 5, 7, 0, 4, 6],
[7, 6, 3, 4, 0, 5],
[3, 4, 10, 6, 5, 0],
])
demands = [0, 2, 3, 1, 2, 4]
capacity = 7
env = CVRPEnv(dist_matrix, demands, capacity)
# ====== Q-learning参数 ======
alpha = 0.1 # 学习率
gamma = 0.9 # 折扣因子
epsilon = 0.2 # 探索率
episodes = 2000
Q = {}
def get_Q(state, action):
return Q.get((state, action), 0.0)
def set_Q(state, action, value):
Q[(state, action)] = value
# ====== 训练过程 ======
for ep in range(episodes):
state = env.reset()
done = False
while not done:
valid_actions = env.get_valid_actions()
print(valid_actions)
# ε-greedy 策略
if random.random() < epsilon:
action = random.choice(valid_actions)
else:
q_values = [get_Q(state, a) for a in valid_actions]
action = valid_actions[np.argmax(q_values)]
next_state, reward, done = env.step(action)
print(action,state,next_state,done)
# 更新Q值
next_valid_actions = env.get_valid_actions()
max_next_q = max([get_Q(next_state, a) for a in next_valid_actions]) if next_valid_actions else 0
old_q = get_Q(state, action)
new_q = old_q + alpha * (reward + gamma * max_next_q - old_q)
set_Q(state, action, new_q)
state = next_state
if (ep + 1) % 200 == 0:
print(f"Episode {ep+1}/{episodes} 完成, 总行驶距离: {env.total_distance:.2f}")
# ====== 测试策略 ======
state = env.reset()
done = False
route = [0]
while not done:
valid_actions = env.get_valid_actions()
q_values = [get_Q(state, a) for a in valid_actions]
action = valid_actions[np.argmax(q_values)]
state, reward, done = env.step(action)
route.append(action)
print("\n最优近似路径:", route)
print("总行驶距离:", env.total_distance)
更多推荐


所有评论(0)