学习资料链接

环境初始化脚本->自动安装依赖并启动虚拟显示

import sys, os
if 'google.colab' in sys.modules and not os.path.exists('.setup_complete'):
    !wget -q https://raw.githubusercontent.com/yandexdataschool/Practical_RL/master/setup_colab.sh -O- | bash

    !touch .setup_complete

# This code creates a virtual display to draw game images on.
# It will have no effect if your machine has a monitor.
if type(os.environ.get("DISPLAY")) is not str or len(os.environ.get("DISPLAY")) == 0:
    !bash ../xvfb start
    os.environ['DISPLAY'] = ':1'

导入依赖库

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

安装Gymnasium

!pip install gymnasium

把一帧渲染成RGB数组,打印observation_space / action_space得到输入和输出的维度

import gymnasium as gym

env = gym.make("MountainCar-v0", render_mode="rgb_array")
env.reset()

plt.imshow(env.render())
print("Observation space:", env.observation_space)
print("Action space:", env.action_space)

Gymnasium三条主接口

方法 调用格式 回什么 用法
reset obs, info = env.reset(seed=?) 初始观测 + 辅助字典 重新开局
step obs, r, terminated, truncated, info = env.step(a) 5 元组(注意拆包) 推一步环境
render rgb = env.render() 当前画面(rgb_array)或弹窗 可视化

step返回的5个元素:
new_observation:动作后的新状态,给 agent 下一帧用。
reward:这一步的即时分数。
terminated:游戏自己说“结束”。
truncated:时间到了被“强制下班”。
info:调试用的额外信息,可先无视。

用不同的seed种子把环境“倒带”两次

# Set seed to reproduce initial state in stochastic environment
obs0, info = env.reset(seed=0)
print("initial observation code:", obs0)

obs0, info = env.reset(seed=1)
print("initial observation code:", obs0)

# Note: in MountainCar, observation is just two numbers: car position and velocity

在这里插入图片描述

把动作 2(向右推)送进环境,拿到下一步的“五件套”:

  • new_obs 里位置比原来大了约 0.0008,说明车确实向右挪了一点;
  • reward 是这一步的即时分数(MountainCar 里通常是 -1,除非到终点);
  • terminated 当前还是 False——还没到达旗帜;
  • truncated 也是 False——步数还没耗完。

简单说:就是手动点了一帧右键,打印出新状态看看变化。

print("taking action 2 (right)")
new_obs, reward, terminated, truncated, _ = env.step(2)

print("new observation code:", new_obs)
print("reward:", reward)
print("is game over?:", terminated)
print("is game truncated due to time limit?:", truncated)

# Note: as you can see, the car has moved to the right slightly (around 0.0005)

在这里插入图片描述

完成“手搓策略”小作业:
给你一辆 MountainCar,默认代码只会一直往右踩油门,但坡度太大、重力会把车拖回左边,永远到不了旗。
目标:不用任何 RL 算法,靠硬编码(if-else、循环、加速度利用、来回摆)让车自己冲到最右边的旗帜。

环境定义如下

from IPython import display

# Create env manually to set time limit. Please don't change this.
TIME_LIMIT = 250
env = gym.wrappers.TimeLimit(
    gym.make("MountainCar-v0", render_mode="rgb_array"),
    max_episode_steps=TIME_LIMIT + 1,
)
actions = {"left": 0, "stop": 1, "right": 2}

只考虑速度即可,引导小车向行驶方向加速

def policy(obs, t):
    # Write the code for your policy here. You can use the observation
    # (a tuple of position and velocity), the current time step, or both,
    # if you want.
    position, velocity = obs
    
    if velocity > 0:
        return actions["right"]
    else:
        return actions["left"]

    # This is an example policy. You can try running it, but it will not work.
    # Your goal is to fix that. You don't need anything sophisticated here,
    # and you can hard-code any policy that seems to work.
    # Hint: think how you would make a swing go farther and faster.

复位,进行小游戏

plt.figure(figsize=(4, 3))
display.clear_output(wait=True)

obs, _ = env.reset()
for t in range(TIME_LIMIT):
    plt.gca().clear()

    action = policy(obs, t)  # Call your policy
    obs, reward, terminated, truncated, _ = env.step(
        action
    )  # Pass the action chosen by the policy to the environment

    # We don't do anything with reward here because MountainCar is a very simple environment,
    # and reward is a constant -1. Therefore, your goal is to end the episode as quickly as possible.

    # Draw game image on display.
    plt.imshow(env.render())

    display.display(plt.gcf())
    display.clear_output(wait=True)

    if terminated or truncated:
        print("Well done!")
        break
else:
    print("Time limit exceeded. Try again.")

display.clear_output(wait=True)

在这里插入图片描述

验证是否完成

assert obs[0] > 0.47
print("You solved it!")

任务完成
在这里插入图片描述

Logo

有“AI”的1024 = 2048,欢迎大家加入2048 AI社区

更多推荐