强化学习基本概念 & 贝尔曼方程 Bellman Equation

kexin197

812人浏览 · 2026-01-23 20:58:39

kexin197 · 2026-01-23 20:58:39 发布

学的这个课程

强化学习的数学原理_中国大学MOOC(慕课)

配套的书比讲义要详细，可以同步看书

建议直接去看书，本篇整理内容出处为讲义&书

state: the status of the agent with respect to the environment

state space: the set of all states, agent can have many states

action: possible actions for each state, different states can have different sets of actions

action space of a state: the set of all possible actions of a state

state transition: when taking an action, the agent may move from one state to another. It can be deterministic or stochastic.

tabular representation: we can use a table to describe the state transition, but can only represent deterministic cases

policy tells the agent what actions to take at a state. It can be deterministic or stochastic (all possible actions)

tabular representation of a policy can represent either deterministic or stochastic cases.

reward: a real number we get after taking an action, can represent encouragement or punishment; positive can mean punishment. In fact, it is the relative reward values instead of the absolute values that determine encouragement or discouragement.reward can be interpreted as a human-machine interface, with which we can guide the agent to behave as what we expect.

a trajectory is a state-action-reward chain, it can be infinite. Therefore, we can introduce a discount rate, then the sum becomes finite and it can balance the far and near future rewards.

the return of this trajectory is the sum of all the rewards collected along the trajectory, return could be used to evaluate whether a policy is good or not

episode: when interacting with the environmnet following a policy, the agent may stop at some terminal states. The resulting trajectory is called an episode.

episodic/continuing tasks can be treated in a unified mathematical way by converting episodic tasks to continuing tasks.

Markov decision process:

1) Markov property: memoryless property of a stochastic process.

Mathematically, it means that

$p(s_{t+1}|s_t,a_t,s_{t-1},a_{t-1},...,s_0,a_0)=p(s_{t+1}|s_t,a_t),\\ p(r_{t+1}|s_t,a_t,s_{t-1},a_{t-1},...,s_0,a_0)=p(r_{t+1}|s_t,a_t)$

Equation indicates that the next state or reward depends merely on the current state and action and is independent of the previous ones.

2) policy: in state s, the probability to choose action a is $\pi(a|s)$ ..It holds that $\sum_{a\in{\mathcal A}(s)}\pi(a|s)=1$ for any $s \in {\mathcal S}$ .

3) state/action/ reward/ state transition probability/ reward probability

state transition probability: In state $s$ , when taking action $a$ , the probability of transitioning to state $s'$ is $p(s'|s,a)$ , it holds that $\sum_{s'\in {\mathcal S}}p(s'|s,a)=1$ for any $(s,a)$

reward probability: In state $s$ , when taking action $a$ , the probability of obtaining reward $r$ is $p(r|s,a)$ . it holds that $\sum_{r\in {\mathcal R(s,a)}}p(r|s,a)=1$ for any $(s,a)$ .

4) Markov decision process becomes Markov process once the policy is given

model or dynamics: $p(s'|s,a)$ and $p(r|s,a)$ for all $(s,a)$ . The model can be either stationary or nonstationary. A stationary model does not change over time, a nonstationary model may vary over time.

I. a specific example

v1=r1+γv2

v2=r2+γv3

v3=r3+γv4

v4=r4+γv1

which can be written compactly as

v=r+γPv

Though simple, the Bellman equation demonstrates the core idea:

the value of one state relies on the values of other states.

policy tells the agent what actions to take at a state. It can be deterministic or stochastic

this means even if St and At are known, Rt+1 and St+1 still are random variables.

state value:

introduce some notations

Consider a sequence of time steps t=0,1,2,... At time t, the agent is in state $S_t$ , and the action taken following a policy $\pi$ is $A_t$ . The next state is $S_{t+1}$ , and the immediate reward obtained is $R_{t+1}$ . This process can be expressed concisely as

$S_t \xrightarrow{A_t} S_{t+1},R_{t+1}$

Note that these are all random variables.

Moreover, $S_t,S_{t+1} \in {\mathcal S}, A_t \in {\mathcal A}(S_t),$ and $R_{t+1} \in {\mathcal R}(S_t,A_t)$

By definition, the discounted return along the trajectory is

$G_t \doteq R_{t+1} + \gamma R_{t+2}+\gamma^2R_{t+3}+...$

Since $G_t$ is a random variable, we can calculate its expected value:

it depends on s. It is a conditional expectation with the condition that the state starts from s.

It depends on $\pi$ . For a different policy, the state value can be different

It does not depend on t.

It represents the value of a state. If the state value is greater, then the policy is better because greater cumulative rewards can be obtained. if both the policy and the system model are deterministic, return can be used to evaluate policies. By contrast, when either the policy or the system model is stochastic, starting from the same state may generate different trajectories and the state value can be used to evaluate policies.

The state value is the expected return starting from that state and following policy $\pi$ thereafter

II. Bellman equation: Derivation

the Bellman equation can be used to calculate state value

the Bellman equation describes the relationship among the values of all states.

the first term is the expectation of the immediate rewards.

r depends on s,a and s', however, since s' also depends on s and a, we can equivalently write r as a function of s and a.

the second term

these two formulas are the same because of the interchangeability of double summation

the first formula: given every possible s', calculate its return vΠ(s') multiply by total transition probability( s-->s')

the second formula: calculate the expectation of the return of the next state given a specific policy, then multiply it by action selection

St=s can be deleted because of the memoryless Markov property. Given St+1=s', St=s provides no additional information

Therefore, we have

Bellman equation: characterizes the relationship among the state-value functions of different states.

It consists of two terms, the immediate reward term and the future reward term.

III. Bellman equation: Matrix-vector form

The above elementwise form is valid for every state. That means there are S equations like this

$r_\pi(s)$ denotes the mean of the immediate rewards

$p_\pi(s'|s)$ is the probability of transitioning from s to s' under policy $\pi$

suppose that the states are indexed as si with i=1,…,n, where $n=|{\mathcal S}|$ . For state si, (1) can be written as $v_\pi(s_i)=r_\pi(s_i)+\gamma \sum\limits_{s_j\in{\mathcal S}}p_\pi(s_j|s_i)v_\pi(s_j)$

let $v_\pi=[v_\pi(s_1),...,v_\pi(s_n)]^T\in {\mathbb R}^n, r_\pi=[r_\pi(s_1),...,r_\pi(s_n)]^T\in {\mathbb R}^n$ , and $P_\pi \in {\mathbb R}^{n\times n}$ with $[P_\pi]_{ij}=p_\pi(s_j|s_i).$

where $v_\pi$ is the unknown to be solved, and $r_\pi, P_\pi$ are known.

properties of $P_\pi$ : 1. $P_\pi \geq0$ 2. $P_\pi$ is a stochastic matrix, meaning that the sum of the values in every row is equal to one.

IV. Bellman equation: solve the state values(policy evaluation)

V. Action value

state value: the average return the agent can get starting from a state

action value: the average return the agent can get starting from a state and taking an action

$q_\pi(s,a) \doteq {\mathbb E}[G_t|S_t=s,A_t=a].$

（2） shows how to obtain state values from action values

an action value relies on the values of the next states that the agent may transition to after taking the action.

（4）shows how to obtain action values from state values

The Bellman equation in terms of action values

substituting (2) into (4) yields

$q_\pi(s,a)=\sum\limits_{r \in{\mathcal R}}p(r|s,a)r+\gamma \sum\limits_{s' \in {\mathcal S}}p(s'|s,a) \sum\limits_{a'\in{\mathcal A}(s')}\pi(a'|s')q_\pi(s',a')$

tips:

state value is the function of state and policy
action value is relevant to action, state and policy
even if an action would not be selected by a policy, it still has an action value, since the pur-pose of reinforcement learning is to find optimal policies, we must keep exploring all actions to determine better actions for each state.

2048 AI社区

有“AI”的1024 = 2048，欢迎大家加入2048 AI社区

更多推荐

OpenCode 接入 GLM Coding Plan 教程（完整配置指南）

2048 AI社区

智源 RoboBrain-X0 开源，打破机器人跨本体泛化困境

感激生命中的每一位陪伴者，他们像星星一般，在我们的旅程里闪耀着光芒，照亮前行的路途。人生的意义在于探索，总要学会质疑与尝试，只有这样才能逐步接近梦想的终点，拥抱光辉的未来。智源研究院开源的 RoboBrain-X0 是一个突破性的机器人通用智能模型，旨在解决机器人跨本体泛化的核心难题。RoboBrain-X0 通过统一的表示学习和迁移框架，实现了知识在不同机器人平台间的共享与迁移。模型在计算效率方

2048 AI社区

AI-调查研究-94-具身智能机器人算法真机验证全流程解析：测试平台、部署方案与接口对接

当你用心去生活，世界会以意想不到的方式回应你，让我们在平凡中也能创造出不平凡的故事。时间的流转让我们更加珍惜每一次的晨昏，认真生活，创造出属于自己的独特篇章。坚持自己的梦想，拒绝外界的干扰，保持内心的宁静，让每一个新生的日子都是值得珍惜的时光。我们的每一天都在诉说着故事，无论大小，都有其独特的意义，值得我们细细品味与回味。经历过风雨，也更懂得阳光的珍贵，愿我们在挑战中找寻内心的力量，勇敢成就未来。

2048 AI社区

所有评论(0)

查看更多评论

kexin197

@kexin197

已为社区贡献6条内容