Markov Decision Process

Consider an agent in an environment with state S, where the agent can choose to select an action A, and the environment responds with a reward R and a new state.

St, Rt, and At denote the state, reward given, and action taken at time t.

Agent Environment Diagram

Markov Property

An environment has the Markov Property if and only if the next state St+1 of the system depends only on the current state St and the current action At, and not on past states or actions.

P{Rt+1=r,St+1=s|S0,A0,R1,...,St1,At1,Rt,St,At}=P{Rt+1=r,St+1=s|St,At}

 

Policy

The policy πt(a|s) at step t is the probability of choosing action a at state s.

Episodic & Continuing Tasks

Episodic tasks are ones that have finite steps and a terminal state.

Continuing tasks are ones that do not end or have a terminal state,

Return

Given the rewards Rt given to the agent, the return Gt is the expression we aim to maximize at any step.

Total Reward

Gt=Rt+1+Rt+2+Rt+3+...=k=0Rt+k+1

Discounted Reward

Gt=Rt+1+γRt+2+γ2Rt+3+...=k=0γkRt+k+1

Value & Action-Value Functions

The value function of a state s under a policy π is the expected return when starting in state s and following π hereafter:

vπ(s)=Eπ[Gt|St=s]=Eπ[k=0γkRt+k+1|St=s]

The value function of a state s under a policy π is the expected return when starting in state s, choosing action a, and following π thereafter:

qπ(s,a)=Eπ[Gt|St=s,At=a]=Eπ[k=0γkRt+k+1|St=s,At=a]

Bellman Equation

vπ(s)=Eπ[Gt|St=s]=Eπ[k=0γkRt+k+1|St=s]=Eπ[Rt+1+γGt+1|St=s]=aπ(a|s)s,rp(s,r|s,a)[r+γvπ(s)]