DQN DDPG SAC

特性DQNDDPGSAC动作空间离散连续连续策略类型值函数确定性策略随机策略探索方式ε-贪婪OU噪声自动熵调节网络数量稳定性中等中等高样本效率中等中等高实现难度简单中等复杂典型应用游戏AI机器人控制复杂操作任务。

ZhuBin365

789人浏览 · 2025-09-04 12:39:59

ZhuBin365 · 2025-09-04 12:39:59 发布

一、DQN (Deep Q-Network) 深度Q网络

核心概念

DQN是2015年DeepMind提出的将深度学习与Q-learning结合的算法，首次在Atari游戏上达到人类水平。

基本原理

# Q-learning的核心公式
Q(s,a) = Q(s,a) + α[r + γ·max Q(s',a') - Q(s,a)]
#         当前Q值        奖励  折扣·未来最大Q值   TD误差

网络架构

class DQN(nn.Module):
    def __init__(self, state_dim, action_dim):
        super().__init__()
        self.network = nn.Sequential(
            nn.Linear(state_dim, 128),
            nn.ReLU(),
            nn.Linear(128, 128),
            nn.ReLU(),
            nn.Linear(128, action_dim)  # 输出每个动作的Q值
        )
    
    def forward(self, state):
        return self.network(state)  # Q(s,a1), Q(s,a2), ..., Q(s,an)

两大创新

经验回放 (Experience Replay)

# 存储经验
replay_buffer.store(state, action, reward, next_state, done)

# 随机采样训练，打破数据相关性
batch = replay_buffer.sample(batch_size=32)

目标网络 (Target Network)

# 两个网络：评估网络和目标网络
q_eval = DQN()      # 频繁更新
q_target = DQN()    # 定期同步（如每1000步）

# 计算目标值时用目标网络，更稳定
target = reward + gamma * q_target(next_state).max()

训练过程

def train_dqn():
    for episode in range(num_episodes):
        state = env.reset()
        
        while not done:
            # ε-贪婪策略选择动作
            if random.random() < epsilon:
                action = random.choice(action_space)
            else:
                action = argmax(q_eval(state))
            
            # 执行动作
            next_state, reward, done = env.step(action)
            
            # 存储经验
            memory.push(state, action, reward, next_state)
            
            # 从经验池采样学习
            if len(memory) > batch_size:
                batch = memory.sample(batch_size)
                
                # 计算损失
                q_values = q_eval(batch.states)
                next_q_values = q_target(batch.next_states)
                target_q = batch.rewards + gamma * next_q_values.max()
                
                loss = F.mse_loss(q_values, target_q)
                optimizer.zero_grad()
                loss.backward()
                optimizer.step()

优缺点

优点：

可处理高维输入（如图像）
样本效率高（可重复使用历史数据）
相对简单，容易实现

缺点：

仅适用于离散动作空间
容易过估计Q值
对超参数敏感

二、DDPG (Deep Deterministic Policy Gradient) 深度确定性策略梯度

核心概念

DDPG是将DQN扩展到连续动作空间的算法，被称为"连续动作空间的DQN"。

算法特点

Actor-Critic架构：Actor生成动作，Critic评估动作
确定性策略：给定状态，输出确定的动作（而非概率分布）
Off-policy：可以使用任何策略收集的数据训练

网络架构

class DDPG:
    def __init__(self):
        # Actor网络：状态→动作
        self.actor = nn.Sequential(
            nn.Linear(state_dim, 400),
            nn.ReLU(),
            nn.Linear(400, 300),
            nn.ReLU(),
            nn.Linear(300, action_dim),
            nn.Tanh()  # 输出范围[-1, 1]
        )
        
        # Critic网络：(状态,动作)→Q值
        self.critic = nn.Sequential(
            nn.Linear(state_dim + action_dim, 400),
            nn.ReLU(),
            nn.Linear(400, 300),
            nn.ReLU(),
            nn.Linear(300, 1)  # 输出单个Q值
        )
        
        # 目标网络（软更新）
        self.actor_target = copy.deepcopy(self.actor)
        self.critic_target = copy.deepcopy(self.critic)

关键技术

Ornstein-Uhlenbeck噪声（探索机制）

class OUNoise:
    def __init__(self, action_dim, theta=0.15, sigma=0.2):
        self.theta = theta
        self.sigma = sigma
        self.state = np.zeros(action_dim)
    
    def sample(self):
        dx = self.theta * (-self.state) + self.sigma * np.random.randn()
        self.state += dx
        return self.state

# 使用噪声探索
action = actor(state) + ou_noise.sample()

软更新（Soft Update）

# 缓慢更新目标网络，tau通常为0.001
def soft_update(target, source, tau):
    for target_param, param in zip(target.parameters(), source.parameters()):
        target_param.data.copy_(tau * param.data + (1-tau) * target_param.data)

训练算法

def train_ddpg():
    for episode in range(num_episodes):
        state = env.reset()
        
        while not done:
            # Actor选择动作 + 噪声探索
            action = actor(state) + noise.sample()
            next_state, reward, done = env.step(action)
            
            # 存储到经验池
            replay_buffer.push(state, action, reward, next_state)
            
            # 采样训练
            if len(replay_buffer) > batch_size:
                batch = replay_buffer.sample(batch_size)
                
                # 更新Critic
                next_actions = actor_target(batch.next_states)
                target_q = batch.rewards + gamma * critic_target(batch.next_states, next_actions)
                current_q = critic(batch.states, batch.actions)
                critic_loss = F.mse_loss(current_q, target_q)
                
                # 更新Actor
                actor_loss = -critic(batch.states, actor(batch.states)).mean()
                
                # 软更新目标网络
                soft_update(actor_target, actor, tau=0.001)
                soft_update(critic_target, critic, tau=0.001)

优缺点

优点：

处理连续动作空间
相对稳定的训练
适合机器人控制等任务

缺点：

对超参数敏感
探索能力有限
容易陷入局部最优

三、SAC (Soft Actor-Critic) 软演员-评论家

核心概念

SAC是基于最大熵强化学习的算法，目标是最大化期望奖励的同时最大化策略的熵（随机性）。

独特优势

# SAC的目标函数
J = E[∑(r + α·H(π))]
#     奖励  熵正则化项（鼓励探索）

网络架构

class SAC:
    def __init__(self):
        # 策略网络（输出高斯分布）
        self.policy = GaussianPolicy(state_dim, action_dim)
        
        # 双Q网络（减少过估计）
        self.q1 = QNetwork(state_dim, action_dim)
        self.q2 = QNetwork(state_dim, action_dim)
        self.q1_target = copy.deepcopy(self.q1)
        self.q2_target = copy.deepcopy(self.q2)
        
        # 自动调节的温度参数
        self.log_alpha = nn.Parameter(torch.zeros(1))
        self.alpha = self.log_alpha.exp()
        
class GaussianPolicy(nn.Module):
    def __init__(self, state_dim, action_dim):
        super().__init__()
        self.network = nn.Sequential(
            nn.Linear(state_dim, 256),
            nn.ReLU(),
            nn.Linear(256, 256),
            nn.ReLU()
        )
        self.mean = nn.Linear(256, action_dim)
        self.log_std = nn.Linear(256, action_dim)
        
    def forward(self, state):
        features = self.network(state)
        mean = self.mean(features)
        log_std = self.log_std(features)
        log_std = torch.clamp(log_std, -20, 2)  # 限制标准差范围
        return mean, log_std.exp()
    
    def sample(self, state):
        mean, std = self.forward(state)
        normal = torch.distributions.Normal(mean, std)
        z = normal.rsample()  # 重参数化采样
        action = torch.tanh(z)  # 压缩到[-1, 1]
        
        # 计算对数概率（考虑tanh变换）
        log_prob = normal.log_prob(z) - torch.log(1 - action.pow(2) + 1e-6)
        return action, log_prob.sum(-1, keepdim=True)

关键创新

最大熵框架

# 策略损失 = 期望Q值 - 熵正则化
policy_loss = (alpha * log_prob - min(q1, q2)).mean()

自动温度调节

# 自动调节α（温度参数）
target_entropy = -action_dim  # 启发式目标熵
alpha_loss = -(self.log_alpha * (log_prob + target_entropy).detach()).mean()

双Q网络

# 使用较小的Q值减少过估计
min_q = torch.min(q1(state, action), q2(state, action))

完整训练流程

def train_sac():
    for step in range(total_steps):
        # 采集数据
        state = env.get_state()
        action, log_prob = policy.sample(state)
        next_state, reward, done = env.step(action)
        replay_buffer.push(state, action, reward, next_state, done)
        
        # 采样批次
        batch = replay_buffer.sample(batch_size)
        
        # 1. 更新Q函数
        with torch.no_grad():
            next_action, next_log_prob = policy.sample(batch.next_states)
            q1_next = q1_target(batch.next_states, next_action)
            q2_next = q2_target(batch.next_states, next_action)
            min_q_next = torch.min(q1_next, q2_next)
            # 加入熵项
            target_q = batch.rewards + gamma * (1 - batch.dones) * (min_q_next - alpha * next_log_prob)
        
        current_q1 = q1(batch.states, batch.actions)
        current_q2 = q2(batch.states, batch.actions)
        q1_loss = F.mse_loss(current_q1, target_q)
        q2_loss = F.mse_loss(current_q2, target_q)
        
        # 2. 更新策略
        new_actions, log_probs = policy.sample(batch.states)
        q1_new = q1(batch.states, new_actions)
        q2_new = q2(batch.states, new_actions)
        min_q_new = torch.min(q1_new, q2_new)
        policy_loss = (alpha * log_probs - min_q_new).mean()
        
        # 3. 更新温度参数
        alpha_loss = -(log_alpha * (log_probs + target_entropy).detach()).mean()
        
        # 4. 软更新目标网络
        soft_update(q1_target, q1)
        soft_update(q2_target, q2)

优缺点

优点：

自动平衡探索与利用
训练稳定性高
样本效率好
不需要手动调节探索参数

缺点：

计算复杂度较高
实现相对复杂
需要更多内存（多个网络）

四、算法对比总结

特性	DQN	DDPG	SAC
动作空间	离散	连续	连续
策略类型	值函数	确定性策略	随机策略
探索方式	ε-贪婪	OU噪声	自动熵调节
网络数量	2(Q+Target)	4(Actor+Critic×2)	5(Policy+Q×4)
稳定性	中等	中等	高
样本效率	中等	中等	高
实现难度	简单	中等	复杂
典型应用	游戏AI	机器人控制	复杂操作任务