第6章 强化学习在自动驾驶决策规划中的关键作用
摘要:本章探讨强化学习在自动驾驶决策规划中的应用。首先介绍了强化学习的基本概念与原理,包括智能体、环境、状态、动作等关键要素,以及马尔可夫决策过程和Q-learning算法。通过Python代码示例展示了Q-learning在简单交通场景中的实现。随后分析了深度强化学习的主流算法,特别是DQN及其关键技术(经验回放和目标网络),并提供了基于PyTorch的车道保持任务实现代码。这些方法为自动驾驶系
第6章 强化学习在自动驾驶决策规划中的关键作用
6.1 强化学习基础概念与原理
强化学习是机器学习的一个重要分支,其核心思想是智能体通过与环境交互来学习最优行为策略。在自动驾驶领域,强化学习能够帮助车辆在复杂交通环境中做出合理的决策。
强化学习的基本框架包含以下几个关键要素:
- 智能体(Agent):学习并做出决策的主体,在自动驾驶中即为车辆控制系统
- 环境(Environment):智能体交互的外部世界,包括道路、其他车辆、行人等
- 状态(State):环境在特定时刻的描述
- 动作(Action):智能体可以执行的操作
- 奖励(Reward):环境对智能体动作的反馈信号
- 策略(Policy):从状态到动作的映射关系
强化学习的数学基础是马尔可夫决策过程(MDP),其目标是通过最大化累积奖励来学习最优策略。值函数包括状态值函数V(s)和动作值函数Q(s,a),后者表示在状态s下执行动作a后能获得的期望累积奖励。
代码实例:以下是一个使用Python实现的简单Q-learning算法示例,模拟自动驾驶车辆在交叉路口的选择行为。
import numpy as np
import random
class TrafficEnvironment:
def __init__(self):
self.states = ['approach', 'intersection', 'clear']
self.actions = ['slow_down', 'maintain_speed', 'accelerate']
self.current_state = 'approach'
self.rewards = {
'approach': {'slow_down': -1, 'maintain_speed': 2, 'accelerate': -3},
'intersection': {'slow_down': 3, 'maintain_speed': -2, 'accelerate': -5},
'clear': {'slow_down': -1, 'maintain_speed': 1, 'accelerate': 2}
}
self.transitions = {
'approach': {'slow_down': 'intersection', 'maintain_speed': 'intersection', 'accelerate': 'intersection'},
'intersection': {'slow_down': 'clear', 'maintain_speed': 'clear', 'accelerate': 'clear'},
'clear': {'slow_down': 'approach', 'maintain_speed': 'approach', 'accelerate': 'approach'}
}
def reset(self):
self.current_state = 'approach'
return self.current_state
def step(self, action):
reward = self.rewards[self.current_state][action]
self.current_state = self.transitions[self.current_state][action]
return self.current_state, reward, self.current_state == 'clear'
class QLearningAgent:
def __init__(self, states, actions, learning_rate=0.1, discount_factor=0.9, exploration_rate=0.1):
self.q_table = {state: {action: 0 for action in actions} for state in states}
self.lr = learning_rate
self.gamma = discount_factor
self.epsilon = exploration_rate
self.actions = actions
def choose_action(self, state):
if random.uniform(0, 1) < self.epsilon:
return random.choice(self.actions)
else:
return max(self.q_table[state].items(), key=lambda x: x[1])[0]
def learn(self, state, action, reward, next_state):
current_q = self.q_table[state][action]
max_next_q = max(self.q_table[next_state].values())
new_q = current_q + self.lr * (reward + self.gamma * max_next_q - current_q)
self.q_table[state][action] = new_q
# 训练过程
env = TrafficEnvironment()
agent = QLearningAgent(env.states, env.actions)
for episode in range(100):
state = env.reset()
total_reward = 0
done = False
while not done:
action = agent.choose_action(state)
next_state, reward, done = env.step(action)
agent.learn(state, action, reward, next_state)
state = next_state
total_reward += reward
if episode % 20 == 0:
print(f"Episode {episode}, Total Reward: {total_reward}")
# 测试训练结果
print("\n训练后的Q表:")
for state in env.states:
print(f"{state}: {agent.q_table[state]}")
这个示例展示了强化学习在简单交通场景中的应用,车辆通过学习在不同状态下选择最优的速度控制策略。
6.2 主流强化学习算法原理与实现
深度强化学习将深度神经网络与强化学习相结合,能够处理高维状态空间问题。在自动驾驶中,常用的算法包括DQN、DDPG、PPO等。
深度Q网络(DQN) 使用神经网络来近似Q函数,解决了传统Q-learning在高维状态空间中的维度灾难问题。DQN的关键技术包括经验回放和目标网络,前者通过存储和重复利用经验来提高数据效率,后者通过固定目标Q网络来稳定训练过程。
代码实例:以下是一个使用PyTorch实现的DQN算法,用于自动驾驶中的车道保持任务。
import torch
import torch.nn as nn
import torch.optim as optim
import random
import numpy as np
from collections import deque
class DQN(nn.Module):
def __init__(self, state_dim, action_dim):
super(DQN, self).__init__()
self.fc1 = nn.Linear(state_dim, 128)
self.fc2 = nn.Linear(128, 128)
self.fc3 = nn.Linear(128, action_dim)
self.relu = nn.ReLU()
def forward(self, x):
x = self.relu(self.fc1(x))
x = self.relu(self.fc2(x))
return self.fc3(x)
class LaneKeepingEnvironment:
def __init__(self):
self.state_dim = 4 # [横向位置, 横向速度, 航向角, 航向角速度]
self.action_dim = 3 # [左转, 保持, 右转]
self.reset()
def reset(self):
self.lateral_position = random.uniform(-1, 1)
self.lateral_velocity = 0
self.heading_angle = random.uniform(-0.5, 0.5)
self.heading_velocity = 0
return self.get_state()
def get_state(self):
return np.array([self.lateral_position, self.lateral_velocity,
self.heading_angle, self.heading_velocity])
def step(self, action):
# 模拟车辆动力学
steering_map = {-1: -0.1, 0: 0, 1: 0.1}
steering = steering_map[action - 1] # 将0,1,2映射到-1,0,1
# 更新状态
self.heading_velocity = steering
self.heading_angle += self.heading_velocity
self.lateral_velocity = np.sin(self.heading_angle)
self.lateral_position += self.lateral_velocity
# 计算奖励
reward = -abs(self.lateral_position) - 0.1 * abs(self.heading_angle)
# 检查是否结束
done = abs(self.lateral_position) > 2 or abs(self.heading_angle) > 1
return self.get_state(), reward, done, {}
class DQNAgent:
def __init__(self, state_dim, action_dim):
self.state_dim = state_dim
self.action_dim = action_dim
self.memory = deque(maxlen=10000)
self.batch_size = 32
self.gamma = 0.99
self.epsilon = 1.0
self.epsilon_min = 0.01
self.epsilon_decay = 0.995
self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
self.model = DQN(state_dim, action_dim).to(self.device)
self.target_model = DQN(state_dim, action_dim).to(self.device)
self.optimizer = optim.Adam(self.model.parameters(), lr=0.001)
self.update_target_model()
def update_target_model(self):
self.target_model.load_state_dict(self.model.state_dict())
def remember(self, state, action, reward, next_state, done):
self.memory.append((state, action, reward, next_state, done))
def act(self, state):
if np.random.rand() <= self.epsilon:
return random.randrange(self.action_dim)
state_tensor = torch.FloatTensor(state).unsqueeze(0).to(self.device)
q_values = self.model(state_tensor)
return np.argmax(q_values.cpu().data.numpy())
def replay(self):
if len(self.memory) < self.batch_size:
return
batch = random.sample(self.memory, self.batch_size)
states = torch.FloatTensor([e[0] for e in batch]).to(self.device)
actions = torch.LongTensor([e[1] for e in batch]).to(self.device)
rewards = torch.FloatTensor([e[2] for e in batch]).to(self.device)
next_states = torch.FloatTensor([e[3] for e in batch]).to(self.device)
dones = torch.BoolTensor([e[4] for e in batch]).to(self.device)
current_q_values = self.model(states).gather(1, actions.unsqueeze(1))
next_q_values = self.target_model(next_states).max(1)[0].detach()
target_q_values = rewards + (self.gamma * next_q_values * ~dones)
loss = nn.MSELoss()(current_q_values.squeeze(), target_q_values)
self.optimizer.zero_grad()
loss.backward()
self.optimizer.step()
if self.epsilon > self.epsilon_min:
self.epsilon *= self.epsilon_decay
# 训练循环
env = LaneKeepingEnvironment()
agent = DQNAgent(env.state_dim, env.action_dim)
for episode in range(1000):
state = env.reset()
total_reward = 0
done = False
while not done:
action = agent.act(state)
next_state, reward, done, _ = env.step(action)
agent.remember(state, action, reward, next_state, done)
state = next_state
total_reward += reward
agent.replay()
if episode % 100 == 0:
agent.update_target_model()
print(f"Episode: {episode}, Total Reward: {total_reward:.2f}, Epsilon: {agent.epsilon:.3f}")
在ROS环境中,可以将强化学习算法集成到自动驾驶系统中。以下是一个C++ ROS节点的示例框架,用于实现基于强化学习的决策模块。
#include <ros/ros.h>
#include <std_msgs/Float32MultiArray.h>
#include <sensor_msgs/Image.h>
#include <nav_msgs/Path.h>
#include <geometry_msgs/Twist.h>
class RLDecisionNode {
private:
ros::NodeHandle nh;
ros::Subscriber state_sub;
ros::Publisher cmd_pub;
// 强化学习模型参数
torch::jit::script::Module rl_model;
public:
RLDecisionNode() {
state_sub = nh.subscribe("/vehicle_state", 1, &RLDecisionNode::stateCallback, this);
cmd_pub = nh.advertise<geometry_msgs::Twist>("/cmd_vel", 1);
// 加载预训练的PyTorch模型
try {
rl_model = torch::jit::load("/path/to/trained_model.pt");
} catch (const c10::Error& e) {
ROS_ERROR("无法加载模型: %s", e.what());
}
}
void stateCallback(const std_msgs::Float32MultiArray::ConstPtr& msg) {
// 将状态数据转换为Tensor
std::vector<float> state_data = msg->data;
auto options = torch::TensorOptions().dtype(torch::kFloat32);
torch::Tensor state_tensor = torch::from_blob(state_data.data(),
{1, static_cast<int>(state_data.size())},
options);
// 使用模型进行推理
std::vector<torch::jit::IValue> inputs;
inputs.push_back(state_tensor);
auto output = rl_model.forward(inputs).toTensor();
// 解析输出并发布控制命令
geometry_msgs::Twist cmd;
auto action = output.argmax(1).item<int>();
// 根据动作生成控制命令
switch(action) {
case 0: // 左转
cmd.angular.z = 0.5;
cmd.linear.x = 1.0;
break;
case 1: // 直行
cmd.angular.z = 0.0;
cmd.linear.x = 1.5;
break;
case 2: // 右转
cmd.angular.z = -0.5;
cmd.linear.x = 1.0;
break;
}
cmd_pub.publish(cmd);
}
};
int main(int argc, char** argv) {
ros::init(argc, argv, "rl_decision_node");
RLDecisionNode node;
ros::spin();
return 0;
}
6.3 基于强化学习的自动驾驶决策方法
强化学习在自动驾驶决策层面临的主要挑战包括连续状态空间、多智能体交互和安全约束等问题。为了解决这些问题,研究人员提出了多种改进算法。
近端策略优化(PPO) 是一种基于策略梯度的算法,通过限制策略更新的幅度来保证训练的稳定性。PPO特别适合处理连续动作空间问题,如车辆的速度和转向控制。
代码实例:以下是一个使用Stable Baselines3库实现的PPO算法,用于自动驾驶中的超车决策任务。
import gym
from gym import spaces
import numpy as np
from stable_baselines3 import PPO
from stable_baselines3.common.env_checker import check_env
class OvertakingEnvironment(gym.Env):
def __init__(self):
super(OvertakingEnvironment, self).__init__()
# 动作空间:[-1, 1]表示转向,[0, 1]表示油门,[0, 1]表示刹车
self.action_space = spaces.Box(
low=np.array([-1, 0, 0]),
high=np.array([1, 1, 1]),
dtype=np.float32
)
# 状态空间:自车位置、速度、前车位置、速度、对向车道车辆信息
self.observation_space = spaces.Box(
low=np.array([-10, 0, -5, 0, -10, 0]),
high=np.array([10, 30, 5, 30, 10, 30]),
dtype=np.float32
)
self.reset()
def reset(self):
# 初始化状态
self.ego_position = 0.0 # 自车横向位置
self.ego_speed = 15.0 # 自车速度 m/s
self.front_vehicle_position = 20.0 # 前车距离
self.front_vehicle_speed = 10.0 # 前车速度
self.opposite_vehicle_position = -50.0 # 对向车辆距离
self.opposite_vehicle_speed = 15.0 # 对向车辆速度
return self._get_obs()
def _get_obs(self):
return np.array([
self.ego_position,
self.ego_speed,
self.front_vehicle_position,
self.front_vehicle_speed,
self.opposite_vehicle_position,
self.opposite_vehicle_speed
])
def step(self, action):
steering, throttle, brake = action
# 更新自车状态
self.ego_position += steering * 0.1
self.ego_speed += (throttle - brake) * 2.0
self.ego_speed = np.clip(self.ego_speed, 0, 30)
# 更新其他车辆状态
self.front_vehicle_position -= (self.ego_speed - self.front_vehicle_speed) * 0.1
self.opposite_vehicle_position += self.opposite_vehicle_speed * 0.1
# 计算奖励
reward = 0
# 进度奖励
reward += self.ego_speed * 0.1
# 安全惩罚
if abs(self.ego_position) > 3.5: # 偏离车道
reward -= 10
if self.front_vehicle_position < 5: # 跟车太近
reward -= 20
if (self.opposite_vehicle_position > -10 and
self.ego_position < -1 and
self.front_vehicle_position < 15): # 危险超车
reward -= 50
# 成功超车奖励
if self.front_vehicle_position < -5 and self.ego_position > -1:
reward += 100
# 检查是否结束
done = False
if (self.front_vehicle_position < 2 or
(abs(self.opposite_vehicle_position) < 5 and abs(self.ego_position) > 2) or
abs(self.ego_position) > 4):
done = True
reward -= 100
# 重置场景如果车辆超出视野
if self.front_vehicle_position < -20:
self.front_vehicle_position = 50
self.front_vehicle_speed = np.random.uniform(8, 12)
if self.opposite_vehicle_position > 50:
self.opposite_vehicle_position = -50
self.opposite_vehicle_speed = np.random.uniform(12, 18)
return self._get_obs(), reward, done, {}
# 创建环境和训练模型
env = OvertakingEnvironment()
check_env(env) # 检查环境是否符合gym标准
model = PPO("MlpPolicy", env, verbose=1, learning_rate=0.0003, n_steps=2048)
model.learn(total_timesteps=100000)
model.save("ppo_overtaking")
# 测试训练好的模型
obs = env.reset()
for i in range(1000):
action, _states = model.predict(obs, deterministic=True)
obs, reward, done, info = env.step(action)
if done:
obs = env.reset()
6.4 自动驾驶决策系统架构与实现
自动驾驶决策系统需要综合考虑感知、预测、规划和控制等多个模块。基于强化学习的决策系统通常采用分层架构,将高层决策与底层控制分离。
在真实自动驾驶系统中,决策模块需要处理以下关键问题:
- 行为决策:跟车、换道、超车、路口通行等
- 运动规划:生成平滑、安全的轨迹
- 风险评估:预测其他交通参与者的行为并评估风险
代码实例:以下是一个基于ROS的自动驾驶决策系统框架,集成强化学习决策模块。
#!/usr/bin/env python3
import rospy
from std_msgs.msg import Float32MultiArray
from geometry_msgs.msg import Twist, PoseStamped
from nav_msgs.msg import Path
import numpy as np
import torch
class AutonomousDecisionSystem:
def __init__(self):
rospy.init_node('autonomous_decision_system')
# 订阅者
self.perception_sub = rospy.Subscriber('/perception/objects', Float32MultiArray,
self.perception_callback)
self.localization_sub = rospy.Subscriber('/localization/pose', PoseStamped,
self.localization_callback)
self.global_plan_sub = rospy.Subscriber('/global_plan', Path,
self.global_plan_callback)
# 发布者
self.cmd_pub = rospy.Publisher('/cmd_vel', Twist, queue_size=1)
self.local_plan_pub = rospy.Publisher('/local_plan', Path, queue_size=1)
# 强化学习模型
self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
try:
self.rl_policy = torch.jit.load('/path/to/rl_policy.pt')
self.rl_policy.to(self.device)
except Exception as e:
rospy.logerr(f"无法加载强化学习策略: {e}")
# 系统状态
self.ego_state = None
self.objects = []
self.global_path = None
self.current_goal = None
# 控制参数
self.control_rate = rospy.Rate(10) # 10Hz
def perception_callback(self, msg):
# 处理感知数据
self.objects = self.parse_objects(msg.data)
def localization_callback(self, msg):
# 更新自车状态
pose = msg.pose
self.ego_state = {
'x': pose.position.x,
'y': pose.position.y,
'yaw': self.quaternion_to_yaw(pose.orientation),
'v': 0.0, # 需要从其他话题获取
'a': 0.0 # 需要从其他话题获取
}
def global_plan_callback(self, msg):
# 更新全局路径
self.global_path = msg.poses
if self.global_path:
self.current_goal = self.global_path[0]
def quaternion_to_yaw(self, quat):
# 四元数转偏航角
x, y, z, w = quat.x, quat.y, quat.z, quat.w
siny_cosp = 2 * (w * z + x * y)
cosy_cosp = 1 - 2 * (y * y + z * z)
return np.arctan2(siny_cosp, cosy_cosp)
def parse_objects(self, data):
# 解析感知数据
objects = []
# 假设数据格式为 [id, x, y, vx, vy, ...] 的数组
return objects
def get_state_representation(self):
# 将当前状态转换为强化学习模型输入
if not self.ego_state or not self.global_path:
return None
state_features = []
# 自车状态
state_features.extend([self.ego_state['x'], self.ego_state['y'],
self.ego_state['yaw'], self.ego_state['v']])
# 目标状态
if self.current_goal:
goal_x = self.current_goal.pose.position.x
goal_y = self.current_goal.pose.position.y
dx = goal_x - self.ego_state['x']
dy = goal_y - self.ego_state['y']
state_features.extend([dx, dy])
else:
state_features.extend([0, 0])
# 障碍物信息(取最近的几个障碍物)
for obj in self.objects[:5]: # 最多5个障碍物
state_features.extend([obj['x'], obj['y'], obj['vx'], obj['vy']])
# 填充到固定长度
while len(state_features) < 50: # 假设状态维度为50
state_features.append(0.0)
return torch.FloatTensor(state_features[:50]).unsqueeze(0).to(self.device)
def rl_decision(self, state):
# 使用强化学习模型进行决策
with torch.no_grad():
action = self.rl_policy(state)
return action.cpu().numpy()[0]
def safety_check(self, action):
# 安全性检查
steering, acceleration = action[0], action[1]
# 简单的安全性检查
if abs(steering) > 0.8: # 转向过大
steering = np.clip(steering, -0.8, 0.8)
if acceleration > 3.0: # 加速度过大
acceleration = 3.0
elif acceleration < -5.0: # 减速度过大
acceleration = -5.0
return np.array([steering, acceleration])
def run(self):
while not rospy.is_shutdown():
if self.ego_state and self.global_path:
# 获取状态表示
state = self.get_state_representation()
if state is not None:
# 强化学习决策
raw_action = self.rl_decision(state)
# 安全性检查
safe_action = self.safety_check(raw_action)
# 发布控制命令
cmd_msg = Twist()
cmd_msg.linear.x = safe_action[1] # 加速度
cmd_msg.angular.z = safe_action[0] # 转向
self.cmd_pub.publish(cmd_msg)
self.control_rate.sleep()
if __name__ == '__main__':
try:
ads = AutonomousDecisionSystem()
ads.run()
except rospy.ROSInterruptException:
pass
对应的C++ ROS节点实现:
#include <ros/ros.h>
#include <torch/script.h>
#include <torch/torch.h>
#include <geometry_msgs/Twist.h>
#include <geometry_msgs/PoseStamped.h>
#include <nav_msgs/Path.h>
#include <std_msgs/Float32MultiArray.h>
class RLDecisionNode {
public:
RLDecisionNode() : nh_("~") {
// 初始化订阅者和发布者
cmd_pub_ = nh_.advertise<geometry_msgs::Twist>("/cmd_vel", 1);
perception_sub_ = nh_.subscribe("/perception/objects", 1,
&RLDecisionNode::perceptionCallback, this);
// 加载PyTorch模型
try {
module_ = torch::jit::load("/path/to/model.pt");
} catch (const c10::Error& e) {
ROS_ERROR("Failed to load model: %s", e.what());
}
// 初始化定时器
decision_timer_ = nh_.createTimer(ros::Duration(0.1),
&RLDecisionNode::decisionCallback, this);
}
private:
ros::NodeHandle nh_;
ros::Publisher cmd_pub_;
ros::Subscriber perception_sub_;
ros::Timer decision_timer_;
torch::jit::script::Module module_;
std::vector<float> current_state_;
void perceptionCallback(const std_msgs::Float32MultiArray::ConstPtr& msg) {
// 更新当前状态
current_state_ = msg->data;
}
void decisionCallback(const ros::TimerEvent& event) {
if (current_state_.empty()) return;
// 准备输入张量
auto options = torch::TensorOptions().dtype(torch::kFloat32);
torch::Tensor state_tensor = torch::from_blob(current_state_.data(),
{1, static_cast<int>(current_state_.size())},
options);
// 前向传播
std::vector<torch::jit::IValue> inputs;
inputs.push_back(state_tensor);
auto output = module_.forward(inputs).toTensor();
// 发布控制命令
geometry_msgs::Twist cmd;
auto action_data = output.accessor<float, 2>();
cmd.linear.x = action_data[0][0]; // 加速度
cmd.angular.z = action_data[0][1]; // 转向
cmd_pub_.publish(cmd);
}
};
int main(int argc, char** argv) {
ros::init(argc, argv, "rl_decision_node");
RLDecisionNode node;
ros::spin();
return 0;
}
6.5 参考文献与扩展阅读
- Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction. MIT Press.
- Mnih, V., et al. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540), 529-533.
- Lillicrap, T. P., et al. (2015). Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971.
- Schulman, J., et al. (2017). Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
- ROS Wiki: http://wiki.ros.org/
- PyTorch Documentation: https://pytorch.org/docs/stable/
- Stable Baselines3 Documentation: https://stable-baselines3.readthedocs.io/
这些文献和资源为深入理解强化学习在自动驾驶中的应用提供了理论基础和实践指导。通过结合理论学习和代码实践,可以更好地掌握强化学习在复杂自动驾驶场景中的决策能力。
更多推荐


所有评论(0)