强化学习1.1 使用Gymnasium库

本文介绍了使用Gymnasium库进行MountainCar环境强化学习的基础实践。首先通过初始化脚本自动安装依赖并创建虚拟显示，然后导入必要库并设置环境。文中详细说明了Gymnasium的三大主接口（reset、step、render）及其功能，并演示了如何通过手动编码策略控制小车。重点介绍了基于速度的简单策略设计：当速度为正时向右加速，否则向左加速。最后通过可视化展示了小车成功到达终点的过程，

L.fountain

440人浏览 · 2025-09-18 23:52:57

L.fountain · 2025-09-18 23:52:57 发布

学习资料链接

环境初始化脚本->自动安装依赖并启动虚拟显示

import sys, os
if 'google.colab' in sys.modules and not os.path.exists('.setup_complete'):
    !wget -q https://raw.githubusercontent.com/yandexdataschool/Practical_RL/master/setup_colab.sh -O- | bash

    !touch .setup_complete

# This code creates a virtual display to draw game images on.
# It will have no effect if your machine has a monitor.
if type(os.environ.get("DISPLAY")) is not str or len(os.environ.get("DISPLAY")) == 0:
    !bash ../xvfb start
    os.environ['DISPLAY'] = ':1'

导入依赖库

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

安装Gymnasium

!pip install gymnasium

把一帧渲染成RGB数组，打印observation_space / action_space得到输入和输出的维度

import gymnasium as gym

env = gym.make("MountainCar-v0", render_mode="rgb_array")
env.reset()

plt.imshow(env.render())
print("Observation space:", env.observation_space)
print("Action space:", env.action_space)

Gymnasium三条主接口

方法	调用格式	回什么	用法
`reset`	`obs, info = env.reset(seed=?)`	初始观测 + 辅助字典	重新开局
`step`	`obs, r, terminated, truncated, info = env.step(a)`	5 元组（注意拆包）	推一步环境
`render`	`rgb = env.render()`	当前画面（rgb_array）或弹窗	可视化

step返回的5个元素：
new_observation：动作后的新状态，给 agent 下一帧用。
reward：这一步的即时分数。
terminated：游戏自己说“结束”。
truncated：时间到了被“强制下班”。
info：调试用的额外信息，可先无视。

用不同的seed种子把环境“倒带”两次

# Set seed to reproduce initial state in stochastic environment
obs0, info = env.reset(seed=0)
print("initial observation code:", obs0)

obs0, info = env.reset(seed=1)
print("initial observation code:", obs0)

# Note: in MountainCar, observation is just two numbers: car position and velocity

在这里插入图片描述

把动作 2（向右推）送进环境，拿到下一步的“五件套”：

new_obs 里位置比原来大了约 0.0008，说明车确实向右挪了一点；
reward 是这一步的即时分数（MountainCar 里通常是 -1，除非到终点）；
terminated 当前还是 False——还没到达旗帜；
truncated 也是 False——步数还没耗完。

简单说：就是手动点了一帧右键，打印出新状态看看变化。

print("taking action 2 (right)")
new_obs, reward, terminated, truncated, _ = env.step(2)

print("new observation code:", new_obs)
print("reward:", reward)
print("is game over?:", terminated)
print("is game truncated due to time limit?:", truncated)

# Note: as you can see, the car has moved to the right slightly (around 0.0005)

在这里插入图片描述

完成“手搓策略”小作业：
给你一辆 MountainCar，默认代码只会一直往右踩油门，但坡度太大、重力会把车拖回左边，永远到不了旗。
目标：不用任何 RL 算法，靠硬编码（if-else、循环、加速度利用、来回摆）让车自己冲到最右边的旗帜。

环境定义如下

from IPython import display

# Create env manually to set time limit. Please don't change this.
TIME_LIMIT = 250
env = gym.wrappers.TimeLimit(
    gym.make("MountainCar-v0", render_mode="rgb_array"),
    max_episode_steps=TIME_LIMIT + 1,
)
actions = {"left": 0, "stop": 1, "right": 2}

只考虑速度即可，引导小车向行驶方向加速

def policy(obs, t):
    # Write the code for your policy here. You can use the observation
    # (a tuple of position and velocity), the current time step, or both,
    # if you want.
    position, velocity = obs
    
    if velocity > 0:
        return actions["right"]
    else:
        return actions["left"]

    # This is an example policy. You can try running it, but it will not work.
    # Your goal is to fix that. You don't need anything sophisticated here,
    # and you can hard-code any policy that seems to work.
    # Hint: think how you would make a swing go farther and faster.

复位，进行小游戏

plt.figure(figsize=(4, 3))
display.clear_output(wait=True)

obs, _ = env.reset()
for t in range(TIME_LIMIT):
    plt.gca().clear()

    action = policy(obs, t)  # Call your policy
    obs, reward, terminated, truncated, _ = env.step(
        action
    )  # Pass the action chosen by the policy to the environment

    # We don't do anything with reward here because MountainCar is a very simple environment,
    # and reward is a constant -1. Therefore, your goal is to end the episode as quickly as possible.

    # Draw game image on display.
    plt.imshow(env.render())

    display.display(plt.gcf())
    display.clear_output(wait=True)

    if terminated or truncated:
        print("Well done!")
        break
else:
    print("Time limit exceeded. Try again.")

display.clear_output(wait=True)

在这里插入图片描述

验证是否完成

assert obs[0] > 0.47
print("You solved it!")

任务完成
在这里插入图片描述

2048 AI社区

有“AI”的1024 = 2048，欢迎大家加入2048 AI社区

更多推荐

上下文工程驱动

但即便是最聪明的人，如果不清楚自己要做的事情的上下文，也很难给出令人满意的交付。两款产品可能在做完全相同的事情，一款给人感觉充满魔力，但另一款却像个廉价的演示品。技术术语的更迭，不仅是语言表达的更替，更代表着思维范式的转变。上下文工程这一新术语，之所以能引起业内共鸣，折射的是智能体复杂性的演化和应对策略的转变，是对现实中算法和工程挑战的一种集体回应，尤其是在垂直/领域的智能体。在大模型能力日益强大

2048 AI社区

解读ISO IEC 23053-2022

该摘要概述了ISO/IEC23053-2022标准的核心内容与价值。标准旨在为基于机器学习的AI系统建立统一术语和通用框架，明确区分ML模型与ML系统的概念，并定义系统开发生命周期的关键阶段（需求分析、数据工程、模型训练、系统集成、部署运维等）。其核心价值在于促进AI领域的互操作性、提高透明度、指导风险管理，并为后续细分标准奠定基础。该框架适用于AI项目管理、系统设计、教育培训及政策制定等场景，是