学的这个课程

强化学习的数学原理_中国大学MOOC(慕课)

配套的书比讲义要详细,可以同步看书

建议直接去看书,本篇整理内容出处为讲义&书

state: the status of the agent with respect to the environment

state space: the set of all states, agent can have many states

action: possible actions for each state, different states can have different sets of actions

action space of a state: the set of all possible actions of a state

state transition: when taking an action, the agent may move from one state to another. It can be deterministic or stochastic.

tabular representation: we can use a table to describe the state transition, but can only represent deterministic cases

policy tells the agent what actions to take at a state. It can be deterministic or stochastic (all possible actions)

tabular representation of a policy can represent either deterministic or stochastic cases.

reward: a real number we get after taking an action, can represent encouragement or punishment; positive can mean punishment. In fact, it is the relative reward values instead of the absolute values that determine encouragement or discouragement.reward can be interpreted as a human-machine interface, with which we can guide the agent to behave as what we expect.

a trajectory is a state-action-reward chain, it can be infinite. Therefore, we can introduce a discount rate, then the sum becomes finite and it can balance the far and near future rewards.

the return of this trajectory is the sum of all the rewards collected along the trajectory, return could be used to evaluate whether a policy is good or not

episode: when interacting with the environmnet following a policy, the agent may stop at some terminal states. The resulting trajectory is called an episode.

episodic/continuing tasks can be treated in a unified mathematical way by converting episodic tasks to continuing tasks.

Markov decision process:

1) Markov property: memoryless property of a stochastic process.

Mathematically, it means that

p(s_{t+1}|s_t,a_t,s_{t-1},a_{t-1},...,s_0,a_0)=p(s_{t+1}|s_t,a_t),\\ p(r_{t+1}|s_t,a_t,s_{t-1},a_{t-1},...,s_0,a_0)=p(r_{t+1}|s_t,a_t)

Equation indicates that the next state or reward depends merely on the current state and action and is independent of the previous ones.

2) policy: in state s, the probability to choose action a is \pi(a|s)..It holds that\sum_{a\in{\mathcal A}(s)}\pi(a|s)=1 for any s \in {\mathcal S}.

3) state/action/ reward/ state transition probability/ reward probability

state transition probability: In state s, when taking action a, the probability of transitioning to state s' is p(s'|s,a), it holds that \sum_{s'\in {\mathcal S}}p(s'|s,a)=1 for any (s,a)

reward probability: In state s, when taking action a, the probability of obtaining reward r is p(r|s,a). it holds that \sum_{r\in {\mathcal R(s,a)}}p(r|s,a)=1for any (s,a).

4) Markov decision process becomes Markov process once the policy is given

model or dynamics: p(s'|s,a) and p(r|s,a) for all (s,a). The model can be either stationary or nonstationary. A stationary model does not change over time, a nonstationary model may vary over time.

I. a specific example

v1=r1+γv2

v2=r2+γv3

v3=r3+γv4

v4=r4+γv1

which can be written compactly as

v=r+γPv

Though simple, the Bellman equation demonstrates the core idea:

the value of one state relies on the values of other states.

policy tells the agent what actions to take at a state. It can be deterministic or stochastic

this means even if St and At are known, Rt+1 and St+1 still are random variables.

state value:

  • introduce some notations

Consider a sequence of time steps t=0,1,2,... At time t, the agent is in state S_t, and the action taken following a policy \pi is A_t. The next state is S_{t+1}, and the immediate reward obtained is R_{t+1}. This process can be expressed concisely as 

S_t \xrightarrow{A_t} S_{t+1},R_{t+1}

Note that these are all random variables.

Moreover, S_t,S_{t+1} \in {\mathcal S}, A_t \in {\mathcal A}(S_t), and R_{t+1} \in {\mathcal R}(S_t,A_t)

  • By definition, the discounted return along the trajectory is

G_t \doteq R_{t+1} + \gamma R_{t+2}+\gamma^2R_{t+3}+...

  • Since G_t is a random variable, we can calculate its expected value:

it depends on s. It is a conditional expectation with the condition that the state starts from s.

It depends on  \pi. For a different policy, the state value can be different

It does not depend on t.

It represents the value of a state. If the state value is greater, then the policy is better because greater cumulative rewards can be obtained. if both the policy and the system model are deterministic, return can be used to evaluate policies. By contrast, when either the policy or the system model is stochastic, starting from the same state may generate different trajectories and the state value can be used to evaluate policies.

The state value is the expected return starting from that state and following policy \pi thereafter

II. Bellman equation: Derivation

the Bellman equation can be used to calculate state value

the Bellman equation describes the relationship among the values of all states.

the first term is the expectation of the immediate rewards.

r depends on s,a and s', however, since s' also depends on s and a, we can equivalently write r as a function of s and a.

the second term

=

these two formulas are the same because of the interchangeability of double summation

the first formula: given every possible s', calculate its return vΠ(s') multiply by total transition probability( s-->s')

the second formula: calculate the expectation of the return of the next state given a specific policy, then multiply it by action selection

St=s can be deleted because of the memoryless Markov property. Given St+1=s', St=s provides no additional information

Therefore, we have

Bellman equation: characterizes the relationship among the state-value functions of different states.

It consists of two terms, the immediate reward term and the future reward term.

III. Bellman equation: Matrix-vector form

The above elementwise form is valid for every state. That means there are S equations like this

r_\pi(s) denotes the mean of the immediate rewards

p_\pi(s'|s) is the probability of transitioning from s to s' under policy \pi

suppose that the states are indexed as si with i=1,…,n, where n=|{\mathcal S}|. For state si, (1) can be written as v_\pi(s_i)=r_\pi(s_i)+\gamma \sum\limits_{s_j\in{\mathcal S}}p_\pi(s_j|s_i)v_\pi(s_j)

let v_\pi=[v_\pi(s_1),...,v_\pi(s_n)]^T\in {\mathbb R}^n, r_\pi=[r_\pi(s_1),...,r_\pi(s_n)]^T\in {\mathbb R}^n, and P_\pi \in {\mathbb R}^{n\times n} with [P_\pi]_{ij}=p_\pi(s_j|s_i).

where v_\pi is the unknown to be solved, and r_\pi, P_\pi are known.

properties of P_\pi: 1. P_\pi \geq0 2. P_\pi is a stochastic matrix, meaning that the sum of the values in every row is equal to one.

IV. Bellman equation: solve the state values(policy evaluation) 

V. Action value

state value: the average return the agent can get starting from a state

action value: the average return the agent can get starting from a state and taking an action

q_\pi(s,a) \doteq {\mathbb E}[G_t|S_t=s,A_t=a].

(2) shows how to obtain state values from action values

an action value relies on the values of the next states that the agent may transition to after taking the action.

(4)shows how to obtain action values from state values

The Bellman equation in terms of action values

substituting (2) into (4) yields

q_\pi(s,a)=\sum\limits_{r \in{\mathcal R}}p(r|s,a)r+\gamma \sum\limits_{s' \in {\mathcal S}}p(s'|s,a) \sum\limits_{a'\in{\mathcal A}(s')}\pi(a'|s')q_\pi(s',a')

tips:

  • state value is the function of state and policy

  • action value is relevant to action, state and policy

  • even if an action would not be selected by a policy, it still has an action value, since the pur-pose of reinforcement learning is to find optimal policies, we must keep exploring all actions to determine better actions for each state.

Logo

有“AI”的1024 = 2048,欢迎大家加入2048 AI社区

更多推荐