一、DQN (Deep Q-Network) 深度Q网络

核心概念

DQN是2015年DeepMind提出的将深度学习与Q-learning结合的算法,首次在Atari游戏上达到人类水平。

基本原理

# Q-learning的核心公式
Q(s,a) = Q(s,a) + α[r + γ·max Q(s',a') - Q(s,a)]
#         当前Q值        奖励  折扣·未来最大Q值   TD误差

网络架构

class DQN(nn.Module):
    def __init__(self, state_dim, action_dim):
        super().__init__()
        self.network = nn.Sequential(
            nn.Linear(state_dim, 128),
            nn.ReLU(),
            nn.Linear(128, 128),
            nn.ReLU(),
            nn.Linear(128, action_dim)  # 输出每个动作的Q值
        )
    
    def forward(self, state):
        return self.network(state)  # Q(s,a1), Q(s,a2), ..., Q(s,an)

两大创新

  1. 经验回放 (Experience Replay)

    # 存储经验
    replay_buffer.store(state, action, reward, next_state, done)
    
    # 随机采样训练,打破数据相关性
    batch = replay_buffer.sample(batch_size=32)
    
  2. 目标网络 (Target Network)

    # 两个网络:评估网络和目标网络
    q_eval = DQN()      # 频繁更新
    q_target = DQN()    # 定期同步(如每1000步)
    
    # 计算目标值时用目标网络,更稳定
    target = reward + gamma * q_target(next_state).max()
    

训练过程

def train_dqn():
    for episode in range(num_episodes):
        state = env.reset()
        
        while not done:
            # ε-贪婪策略选择动作
            if random.random() < epsilon:
                action = random.choice(action_space)
            else:
                action = argmax(q_eval(state))
            
            # 执行动作
            next_state, reward, done = env.step(action)
            
            # 存储经验
            memory.push(state, action, reward, next_state)
            
            # 从经验池采样学习
            if len(memory) > batch_size:
                batch = memory.sample(batch_size)
                
                # 计算损失
                q_values = q_eval(batch.states)
                next_q_values = q_target(batch.next_states)
                target_q = batch.rewards + gamma * next_q_values.max()
                
                loss = F.mse_loss(q_values, target_q)
                optimizer.zero_grad()
                loss.backward()
                optimizer.step()

优缺点

优点:

  • 可处理高维输入(如图像)
  • 样本效率高(可重复使用历史数据)
  • 相对简单,容易实现

缺点:

  • 仅适用于离散动作空间
  • 容易过估计Q值
  • 对超参数敏感

二、DDPG (Deep Deterministic Policy Gradient) 深度确定性策略梯度

核心概念

DDPG是将DQN扩展到连续动作空间的算法,被称为"连续动作空间的DQN"。

算法特点

  • Actor-Critic架构:Actor生成动作,Critic评估动作
  • 确定性策略:给定状态,输出确定的动作(而非概率分布)
  • Off-policy:可以使用任何策略收集的数据训练

网络架构

class DDPG:
    def __init__(self):
        # Actor网络:状态→动作
        self.actor = nn.Sequential(
            nn.Linear(state_dim, 400),
            nn.ReLU(),
            nn.Linear(400, 300),
            nn.ReLU(),
            nn.Linear(300, action_dim),
            nn.Tanh()  # 输出范围[-1, 1]
        )
        
        # Critic网络:(状态,动作)→Q值
        self.critic = nn.Sequential(
            nn.Linear(state_dim + action_dim, 400),
            nn.ReLU(),
            nn.Linear(400, 300),
            nn.ReLU(),
            nn.Linear(300, 1)  # 输出单个Q值
        )
        
        # 目标网络(软更新)
        self.actor_target = copy.deepcopy(self.actor)
        self.critic_target = copy.deepcopy(self.critic)

关键技术

  1. Ornstein-Uhlenbeck噪声(探索机制)

    class OUNoise:
        def __init__(self, action_dim, theta=0.15, sigma=0.2):
            self.theta = theta
            self.sigma = sigma
            self.state = np.zeros(action_dim)
        
        def sample(self):
            dx = self.theta * (-self.state) + self.sigma * np.random.randn()
            self.state += dx
            return self.state
    
    # 使用噪声探索
    action = actor(state) + ou_noise.sample()
    
  2. 软更新(Soft Update)

    # 缓慢更新目标网络,tau通常为0.001
    def soft_update(target, source, tau):
        for target_param, param in zip(target.parameters(), source.parameters()):
            target_param.data.copy_(tau * param.data + (1-tau) * target_param.data)
    

训练算法

def train_ddpg():
    for episode in range(num_episodes):
        state = env.reset()
        
        while not done:
            # Actor选择动作 + 噪声探索
            action = actor(state) + noise.sample()
            next_state, reward, done = env.step(action)
            
            # 存储到经验池
            replay_buffer.push(state, action, reward, next_state)
            
            # 采样训练
            if len(replay_buffer) > batch_size:
                batch = replay_buffer.sample(batch_size)
                
                # 更新Critic
                next_actions = actor_target(batch.next_states)
                target_q = batch.rewards + gamma * critic_target(batch.next_states, next_actions)
                current_q = critic(batch.states, batch.actions)
                critic_loss = F.mse_loss(current_q, target_q)
                
                # 更新Actor
                actor_loss = -critic(batch.states, actor(batch.states)).mean()
                
                # 软更新目标网络
                soft_update(actor_target, actor, tau=0.001)
                soft_update(critic_target, critic, tau=0.001)

优缺点

优点:

  • 处理连续动作空间
  • 相对稳定的训练
  • 适合机器人控制等任务

缺点:

  • 对超参数敏感
  • 探索能力有限
  • 容易陷入局部最优

三、SAC (Soft Actor-Critic) 软演员-评论家

核心概念

SAC是基于最大熵强化学习的算法,目标是最大化期望奖励的同时最大化策略的熵(随机性)。

独特优势

# SAC的目标函数
J = E[(r + α·H(π))]
#     奖励  熵正则化项(鼓励探索)

网络架构

class SAC:
    def __init__(self):
        # 策略网络(输出高斯分布)
        self.policy = GaussianPolicy(state_dim, action_dim)
        
        # 双Q网络(减少过估计)
        self.q1 = QNetwork(state_dim, action_dim)
        self.q2 = QNetwork(state_dim, action_dim)
        self.q1_target = copy.deepcopy(self.q1)
        self.q2_target = copy.deepcopy(self.q2)
        
        # 自动调节的温度参数
        self.log_alpha = nn.Parameter(torch.zeros(1))
        self.alpha = self.log_alpha.exp()
        
class GaussianPolicy(nn.Module):
    def __init__(self, state_dim, action_dim):
        super().__init__()
        self.network = nn.Sequential(
            nn.Linear(state_dim, 256),
            nn.ReLU(),
            nn.Linear(256, 256),
            nn.ReLU()
        )
        self.mean = nn.Linear(256, action_dim)
        self.log_std = nn.Linear(256, action_dim)
        
    def forward(self, state):
        features = self.network(state)
        mean = self.mean(features)
        log_std = self.log_std(features)
        log_std = torch.clamp(log_std, -20, 2)  # 限制标准差范围
        return mean, log_std.exp()
    
    def sample(self, state):
        mean, std = self.forward(state)
        normal = torch.distributions.Normal(mean, std)
        z = normal.rsample()  # 重参数化采样
        action = torch.tanh(z)  # 压缩到[-1, 1]
        
        # 计算对数概率(考虑tanh变换)
        log_prob = normal.log_prob(z) - torch.log(1 - action.pow(2) + 1e-6)
        return action, log_prob.sum(-1, keepdim=True)

关键创新

  1. 最大熵框架

    # 策略损失 = 期望Q值 - 熵正则化
    policy_loss = (alpha * log_prob - min(q1, q2)).mean()
    
  2. 自动温度调节

    # 自动调节α(温度参数)
    target_entropy = -action_dim  # 启发式目标熵
    alpha_loss = -(self.log_alpha * (log_prob + target_entropy).detach()).mean()
    
  3. 双Q网络

    # 使用较小的Q值减少过估计
    min_q = torch.min(q1(state, action), q2(state, action))
    

完整训练流程

def train_sac():
    for step in range(total_steps):
        # 采集数据
        state = env.get_state()
        action, log_prob = policy.sample(state)
        next_state, reward, done = env.step(action)
        replay_buffer.push(state, action, reward, next_state, done)
        
        # 采样批次
        batch = replay_buffer.sample(batch_size)
        
        # 1. 更新Q函数
        with torch.no_grad():
            next_action, next_log_prob = policy.sample(batch.next_states)
            q1_next = q1_target(batch.next_states, next_action)
            q2_next = q2_target(batch.next_states, next_action)
            min_q_next = torch.min(q1_next, q2_next)
            # 加入熵项
            target_q = batch.rewards + gamma * (1 - batch.dones) * (min_q_next - alpha * next_log_prob)
        
        current_q1 = q1(batch.states, batch.actions)
        current_q2 = q2(batch.states, batch.actions)
        q1_loss = F.mse_loss(current_q1, target_q)
        q2_loss = F.mse_loss(current_q2, target_q)
        
        # 2. 更新策略
        new_actions, log_probs = policy.sample(batch.states)
        q1_new = q1(batch.states, new_actions)
        q2_new = q2(batch.states, new_actions)
        min_q_new = torch.min(q1_new, q2_new)
        policy_loss = (alpha * log_probs - min_q_new).mean()
        
        # 3. 更新温度参数
        alpha_loss = -(log_alpha * (log_probs + target_entropy).detach()).mean()
        
        # 4. 软更新目标网络
        soft_update(q1_target, q1)
        soft_update(q2_target, q2)

优缺点

优点:

  • 自动平衡探索与利用
  • 训练稳定性高
  • 样本效率好
  • 不需要手动调节探索参数

缺点:

  • 计算复杂度较高
  • 实现相对复杂
  • 需要更多内存(多个网络)

四、算法对比总结

特性 DQN DDPG SAC
动作空间 离散 连续 连续
策略类型 值函数 确定性策略 随机策略
探索方式 ε-贪婪 OU噪声 自动熵调节
网络数量 2(Q+Target) 4(Actor+Critic×2) 5(Policy+Q×4)
稳定性 中等 中等
样本效率 中等 中等
实现难度 简单 中等 复杂
典型应用 游戏AI 机器人控制 复杂操作任务

五、选择建议

  1. 选DQN:动作离散、实现简单、计算资源有限
  2. 选DDPG:连续控制、确定性环境、中等复杂度
  3. 选SAC:需要高稳定性、自动探索、复杂任务

这三个算法代表了深度强化学习从离散到连续、从简单到复杂的发展历程,各有适用场景。

Logo

有“AI”的1024 = 2048,欢迎大家加入2048 AI社区

更多推荐