Written by Hyungkyu Kim
on June 23, 2018

Lecture 2: Markov Decision Process

Overview

개념

MDP는 RL에서 Goal에 도달하기위한 핵심 구조.
“Markov decision processes formally describe an environment for reinforcement learning”
“Where the environment is fully observable”( Partially observable problems can be converted into MDPs)

정의

$\mathbb{P}[S_{t+1} | S_t] = \mathbb{P}[S_{t+1} | S_1,...,S_t]$

” The future is independent of the past given the present”
다음 state를 만들때 현재 state를 통해 생성될 확률과, 현재포함 모든 과거 state를 통해 생성될 확률이 같다

용어

agent: The learner and decision maker.
environment: The thing it interacts with, comprising everything outside the agent

“The agent–environment boundary represents the limit of the agent’s absolute control, not of its knowledge”

재밌는점은 agent와 environment와의 boundary는 물리적인 경계가 아니다. 로봇의 모터나, 관절이 환경으로 보여질 수 있다. 일반적으로 agent에의해 임의적으로 변경이 불가능한것은 environment라고 본다. 매우 복잡하게 동작하는 로봇같은경우 여러개의 agent로 만들어질 수 있다. agent는 각자의 목적을 위해서 움직이는데 상위 decision making을 위해서 low-level agent들의 결과를 이용도 가능하고, 다른 agent들은 RL이 아닌 다른 알고리즘일 수도 있다. 한 agent가 동작할 때 다른 agent들은 environment로 본다.

MDP framework은 유연한 구조이기 때문에 다양한 방식으로 적용이가능하다. 반드시 시간이 물리적인 고정시간일 필요없으며, action또한 low-level부터 high-level decision making까지 다양하게 사용가능. state 또한 마찬가지. 이 state, action, reward를 어떻게 설정하느냐가 매우중요.

reward 설정 시 주의할 부분은 reward의 최대화 할때 우리의 최종 Goal을 달성하는 방식으로 되어야한다. 예를들어 체스게임에서 winning이 reward로 설정되어야지, 지엽적인 문제들 상대방 말을 딴다던지 특정상황을 컨트롤한다던지가 리워드가되면 안된다. 그렇게 하게되면 진짜 goal인 winning 보다 이런 subgoal들을 노리게 된다.

State Transition Matrix

한 state s에서 다음 스테이트 s’으로 전이될 확률을 나타냄

$P_{ss'} = \mathbb{P}[S_{t+1} = s1 | S_t = s]$

행렬로 한번에 모든 transition 관계 표현 가능 -> 각 row 합은 1

$P = \textrm{from} \stackrel{\mbox{to}}{\begin{bmatrix} P_{11} & \cdots & P_{1n} \\ \vdots & \ddots \\ P_{n1} & \cdots & P_{nn} \end{bmatrix}}$ Art text

Markov Process

A Markov process is a memoryless random process, i.e. a sequence of random states S1, S2, … with the Markov property.
아직 action이나 value 개념 없음

정의

A Markov Process (or Markov Chain) is a tuple <S,Pi>
- S is a (finite) set of states
- P is a state transition probability matrix,

$P_{ss'} = \mathbb{P}[S_{t+1} = s1 | S_t = s]$

Art text

Markov Reward Process(MRP)

A Markov reward process is a Markov chain with values
액션 개념은 없음

정의

A Markov Reward Process is a tuple hS,P, R, γi
S is a finite set of states
P is a state transition probability matrix,

$P_{ss'} = \mathbb{P}[S_{t+1} = s1 | S_t = s]$

R is a reward function, Rs = E [Rt+1 St = s]
γ is a discount factor, γ ∈ [0, 1]

1. return

특정 시퀀스를 따랐을때 얻을 수 있는 리워드의 합산

정의

The return Gt is the total discounted reward from time-step t.

$G_t \doteq R_{t+1} + R_{t+2} + R_{t+3} + ... + R_T$

task에 따라 리워드 합산을 조금 다르게 세팅할 수 있다. 크게 2가지. 타임 시퀀스가 무한대로 가버리면 discount factor $\gamma$ 가 없다면 무한대로 가버린다.

episodic task: when the agent–environment interaction breaks naturally into subsequences

$G_t \doteq R_{t+1} + R_{t+2} + R_{t+3} + ... + R_T$

continuing tasks: when the agent–environment interaction goes on continually without limit

$G_t \doteq R_t+1 + {\gamma}R_{t+2} + \gamma^2R_{t+3} + ... = \sum_{k=0}^{\infty}\gamma^kR_{t+k+1}$

2. discount factor

Mathematically convenient to discount rewards
Avoids infinite returns in cyclic Markov processes
즉시 들어오는 리워드를 중시할지 미래의 리워드를 중시할지 설계자의 의도를 담은 변수 -> 0에 가까워질수록 근시안적으로, 1에 가까워 질수록 미래 지향적으로

3. Value Function

Value Function은 Bellman Equation을 씀. 수식을 보면 나오지만 기대값을 구하는 것이므로 얻을 수 있는 value들의 평균을 값으로 갖는다

$\begin{align} v(s) &= \mathbb{E} [Gt| S_t = s] \\&= \mathbb{E} [R_{t+1} + \gamma R_{t+2} + \gamma^2R_{t+3} + ... | S_t = s] \\&= \mathbb{E} [R_{t+1} + \gamma (R_{t+2} + \gamma^2R_{t+3} + ... )| S_t = s] \\&= \mathbb{E} [R_{t+1} + \gamma G_{t+1}| S_t = s] \\&= \mathbb{E} [R_{t+1} + \gamma v S_{t+1}| S_t = s] \end{align} \\\downarrow \\v(s) = R_s + \gamma \sum_{s' \in S} P_{ss'}v(s')$

Art text 위에서부터 아래로 갈수록 수식을 recursive한 형태로 정리. 결국 정리된 의미를 보자면 현재 value는 즉시 얻을 수있는 reward + 미래의 reward의 합산으로 표현된다. 예제 위의 그래프에서 주의할점은 immediate reward는 원으로 표현된 현재 state안의 숫자가 아니라는 점. 원 안의 숫자는 계산된 현재 state의 value이지 “앞으로 얻을 “ reward가 아니다. 원 우측 하단의 R= … 가 immediate reward.

Bellman Equation in Matrix Form

행렬형태로 변환해서 풀 수 있음

$v = R + \gamma Pv \\\downarrow \\\begin{bmatrix} v_{1} \\\vdots \\v_{n} \end{bmatrix} = \begin{bmatrix} R_{1} \\\vdots \\R_{n} \end{bmatrix} + \gamma \begin{bmatrix} P_{11} & \cdots & P_{1n} \\\vdots \\P_{n1} & \cdots & P_{nn} \end{bmatrix} \begin{bmatrix} v_{1} \\\vdots \\v_{n} \end{bmatrix}$

이를 수식적으로 정리하면 한번에 계산이 가능하다

$\begin{align} v &= R + \gamma Pv\\ (l - \gamma P)v &= R\\ v &= (l - \gamma P)^{-1}R \end{align}$

그러나 계산 복잡도가 $O(n^3)$ , 작은 MRP에서만 적용가능하다.

MDP

Markov reward process with decisions. MRP에 action 추가

정의

A Markov Decision Process is a tuple hS, A,P, R, γi
- S is a finite set of states
- A is a finite set of actions
- P is a state transition probability matrix, $P_{ss'}^a = \mathbb{P}[S_{t+1} = s' | S_t = s, A_t =a ]$
- R is a reward function, $R_{s}^a = \mathbb{E}[R_{t+1} = s' | S_t = s, A_t =a ]$
- γ is a discount factor γ ∈ [0, 1].

action 개념이 들어가면서 action이 선택될 확률도 함께 고려되어야 하기때문에, 모든 수식에 추가된다. 또한 이 확률은 policy에 의해서 결정되는데 이를 어떻게 효과적으로 업데이트할 것인지가 관건.

1. Policies

스테이트에서 액션 pick할 확률

정의

A policy π is a distribution over actions given states,

$π(a|s) = P [At = a | St = s]$

이 policy 라는 부분이 들어가면서 state transition 부분이 변화가 생긴다

$P_{s,s'}^{\pi} = \sum_{a \in A} \pi (a|S) P_{ss'}^a$

크게 2파트로 볼 수 있을듯하다. 이 2파트의 곱을 통해 다음 state로 전이될 확률이 계산됨

$P_{ss'}^a$ : 선택한 action에 따라 다음 state로 변환될 확률. 원 transition 함수. 환경에서 결정되는 요인(?)
π(a S): 정책에따라 현재 state에서 해당 action을 선택할 확률. agent가 선택의 주체

transition 함수는 stationary하나, policy는 update 가능

2. value function

policy가 들어가면서 action에대한 확률도 고려가 된다. state 의 value를 구하는 function과 action의 value를 구하는 function 2가지가 있다

우선 state value function

$v_{\pi}(s) = \mathbb{E}_{\pi}[G_t | S_t = s] = \sum_{a \in A}\pi(a|s)q_{\pi}(s,a)$

Art text

그리고 action value

$q_{\pi}(s, a) = \mathbb{E}_{\pi}[G_t | S_t = s, A_t = a] = R_s^a + \gamma \sum_{s' \in S}P_{ss'}^av_{\pi}(s')$

Art text

개인적으로 위 2가지 개념을 해석할때 서로 서로 재귀적인 느낌이라 어려웠다. 실제 예제를 보면서 생각해보자 Art text state의 value는 실제 value 수치들을 들고 있고, action value의 역할은 해당 action을 택했을 때 도달 할 수 있는 state의 value들을 합해주는 역할이라 생각된다

3. Optimal Value Function

반복을 통해 최적의 state value, action value function을 찾아낸다. 이 optimal 값들을 찾아내면 MDP는 solved됐다고 한다.

정의

The optimal state-value function v∗(s) is the maximum value function over all policies

$v_*(s) = \max_{\pi} v_{\pi} (s)$

The optimal action-value function q∗(s, a) is the maximum action-value function over all policies

$q_*(s, a) = \max_{\pi} q_{\pi} (s, a)$

optimal 값들을 찾는 과정은 다음과 같다. 주의할 점은 여기서의 $\pi'$ 은 next step 의 ‘이 아니다. 시간순서가 아닌 단순히 다르다는 것을 나타내는말.

$\pi ≥ \pi' \quad \textrm{if} \quad v_{\pi}(s) ≥ v_{\pi'}(s), \forall s$

수식을 보자면, 크게 2가지가 계속 반복되면서 수행된다.

현재 policy를 통해 모든 state들의 value를 만듦
state들의 value를 바탕으로 policy를 update
value값이 더이상 좋은 것이 안나올때까지 반복

Art text 여기서 값들을 구할때는 항상 goal 에서 역순으로 구해온다. 그렇기때문에 모든 state의 그래프가 없는 초기 상황에서 이를 예측하기위해 몬테카를로 같은 방법을 이용해서 구한다.

최적화 관련 루틴은 카파시가 만든 REINFORCEjs에서 돌려보자.

← → Top

Lecture 2: Markov Decision Process

Overview

Contents

State Transition Matrix

Markov Process

Markov Reward Process(MRP)

1. return

2. discount factor

3. Value Function

Bellman Equation in Matrix Form

MDP

1. Policies

2. value function

3. Optimal Value Function