diff --git "a/Cap\303\255tulo 5/Ex5.4.ipynb" "b/Cap\303\255tulo 5/Ex5.4.ipynb" new file mode 100644 index 0000000..dcf964b --- /dev/null +++ "b/Cap\303\255tulo 5/Ex5.4.ipynb" @@ -0,0 +1,64 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The pseudocode for Monte Carlo ES is inefficient because, for each state–action pair, it maintains a list of all returns and repeatedly calculates their mean. It would be more efficient to use techniques similar to those explained in Section 2.4 to maintain just the mean and a count (for each state–action pair) and update them incrementally. Describe how the pseudocode would be altered to achieve this." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Monte Carlo ES (Exploring Start) Incremental Implemention\n", + "\n", + "#### Initialize:\n", + "$\\:\\:\\:\\:\\:\\;\\;\\;\\;\\:\\:\\: \\space \\pi(s) \\in A(s) (arbitrairly), \\text{for all}\\: s \\in S $\n", + "\n", + "$\\:\\:\\:\\:\\:\\;\\;\\;\\;\\:\\:\\: Q(s, a) \\in R (arbitrarily), \\text{for all}\\: s \\in S, a \\in A(s)$\n", + "\n", + "$\\:\\:\\:\\:\\:\\;\\;\\;\\;\\:\\:\\: Returns (a,s) \\gets \\text{empty list}, \\text{for all}\\: s \\in S, a \\in A(s)$\n", + "\n", + "#### Loop forever (for each episode):\n", + "$\\:\\:\\:\\:\\:\\;\\;\\;\\;\\:\\:\\: Choose \\: S_0 \\in S, A_0 \\in A(S_0) \\:\\text{randomly such that all pairs have probability > 0}$\n", + "$\\:\\:\\:\\:\\:\\;\\;\\;\\;\\:\\:\\: \\text{Generate an episode from $S_0, A_0$, following $\\pi: S_0, A_0, R_1, ..., S_{T-1}, A_{T-1}, R_T$ }$\n", + "\n", + "$\\:\\:\\:\\:\\:\\;\\;\\;\\;\\:\\:\\: G \\gets 0$\n", + "\n", + "$\\:\\:\\:\\:\\:\\;\\;\\;\\;\\:\\:\\: \\text{Loop for each step of episode, $t = T-1, T-2,..., 0:$}$\n", + "\n", + "$\\:\\:\\:\\:\\:\\;\\;\\;\\;\\:\\:\\:\\:\\:\\:\\:\\: G \\gets \\gamma G + R_{t+1}$\n", + "\n", + "$\\:\\:\\:\\:\\:\\;\\;\\;\\;\\:\\:\\:\\:\\:\\:\\:\\: \\text{Unless the pair $S_t, A_t$ appears in $S_0, A_0, S_1, A_1,...,S_{t-1},A_{t-1}$:}$\n", + "\n", + "$\\:\\:\\:\\:\\:\\;\\;\\;\\;\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\: N(S_t, A_t) \\gets N(S_t, A_t) + 1 $\n", + "\n", + "$\\:\\:\\:\\:\\:\\;\\;\\;\\;\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\: Q(S_t, A_t) \\gets Q(S_t, A_t) + \\frac{1}{N(S_t, A_t)}[G - Q(S_t, A_t)]$\n", + "\n", + "$\\:\\:\\:\\:\\:\\;\\;\\;\\;\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\:\\: \\pi(S_t) \\gets \\text{argmax}_aQ(S_t, a)$\n" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.3" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +}