CS229 Notes12 Reinforcement Learning

1 Markov decision process (MDP)

  • MDP is a tuple of \((S, A, \{P_{sa} \}, \gamma, R)\)

    • \(S\): states space
    • \(A\): action space
    • \(P_{sa}\): state transition probabilities. Distribution over the state space: Given state \(s\), if we take action \(a\), then the distribution of \(s'\)
    • \(\gamma\) : discount factor, discount future rewards
    • \(R\): reward function, \(S\times A \to \mathbb{R}\). Sometimes it is a function of state only, then \(S \to \mathbb{R}\)
  • Goal of RL: maximize the expectation of discounted future rewards \[E[R(s_0) + \gamma R(s_1) + \gamma^2 R(s_2)...]\]

  • Policy function: \(\pi : S \to A\), action given certain state. \[a = \pi(s)\]

  • Value function for a policy function \(\pi\): \[ V^\pi(s) = E[R(s_0) + \gamma R(s_1) + \gamma^2 R(s_2)... | s_0=s, \pi] \]

  • Bellman Equation: Discrete value case

    • immediate Reward + the expected future sums \[\begin{aligned} V^\pi(s) &= E[R(s_0) + \gamma V^\pi(s') |s_0=s, \pi] \\ &= R(s) + \gamma E[V^\pi(s') |s_0=s, s_1=s', \pi] \\ &= R(s) + \gamma E_{s'\sim P_{s\pi(s)}}[V^\pi(s')] \\ &= R(s) + \gamma\sum_{s'} P_{s\pi(s)}(s')V^\pi(s') \end{aligned}\]
    • Finite-state MDP (\(|S| < \infty\)) has linear solution.
  • Optimal value function \[V^*(s) = \max_\pi V^\pi(s) \]

    \[V^*(s) = R(s) + \max_{a\in A} {\gamma\sum_{s'} P_{sa(s)}(s')V^*(s')}\]

    • equivalently optimize the discounted future rewards

    • define the optimal policy \(\pi^*\) \[\pi^*(s) = \argmax_{a\in A}{\sum_{s'} P_{sa(s)}(s')V^*(s')}\]

    • \(V^*(s) = V^{\pi^*}(s)\), the optimal value function is the value function under the optimal policy. Note that the optimal policy does not depend on the initial state.

2 Value Iteration and Policy Iteration

For finding the optimal policy and the corresponding value.

Value Iteration

Consider finite \(S, A\) MDPs. Assume that we know the \(P_{sa}\) and the reward function \(R\):

For each state \(s\), initialize \(V(s) = 0\);

Until convergence, for every state, update \(V(s)\) with optimal policy. \[V(s) := R(s) + \max_{a\in A}{\gamma \sum_{s'}{P_{sa}(s')V(s')}}\]

Synchronous update: each time map all states and then update;

Asynchronous: ...

Finally the value function would converge to optimal and so does the policy.

Policy iteration

Initialize \(\pi\) randomly,
for until convergence,

update value function with policy \(\pi\) (linear solution by Bellman equations)
for each state \(s\), greedily optimize the \(\pi(s)\)

3 Learning a model for an MDP

Estimate \(P_{sa}, R\)

\(S, A, \gamma\) are known, but \(P_{sa}, R\) are unknown.

We sample and estimate (MLE).

The probability distribution of \(s'\), \(P_{sa}(s')\) is the average ratio of transition \(s\to s'\), in the discrete and finite case.

The immediate reward \(R(s)\) is the average of samples.

Algorithm for MDP: combining value iteration and \(P_{sa}\) estimation

...

The value function does not have to be initialized with 0 after updating the \(P_{sa}\).

4 Continuous state MDPs

\(S = \mathbb{R}^d\) can be infinite: high-dimensional real space.

Discretization

When the dimensionality is low, up to 4d or even 6d, discretize is feasible.

curse of dimensionality: complexity \(|S|^d\). Action space is usually lower-dimensional and can be discretized.

Value function approximation

If we get the expression of value function?

Model/simulator

Assume we have a model/simulator for MDP. It maps \(s_t, a_t: s_{t+1}\). - It can be some physics simulation engine. - Or it can be learned through monte carlo simulation. \(s_{t+1} = f(s_t, a_t)\) - Linear form: \(s_{t+1} = As_t + Ba_t\) - Optimize the parameters by minimizing l2 loss - \(\argmin_{A,B} \sum_{i=1}^n\sum_{t=0}^{T-1} {\big|\big| s_{t+1}^{(i)} - \hat{s}_{t+1}^{(i)}\big|\big|^2 }\) - Non-Linear form: use non-linear feature mappings \(\phi_s(s_t), \phi_a(a_t)\)

Fitted value iteration

Estimate: \(q(a) \sim R(s^{(i)} + \gamma E[V(s')])\), through averaging the results of k samples (k=1 is okay if the simulator is deterministic. single sample)

set \(y^{(i)} = \max_aq(a)\), estimate of optimal value,

Fit the value function, \(V(S^{(i)}) \sim y^{(i)}\) using supervised learning. Update parameters.