In the last blog post, we’ve talked about how Monte Carlo can solve the evaluation problem for a model-free environment.
But Monte Carlo has its downsides. It is not ideal when considering continuous sequences. For example what about the cooling system of a nuclear reactor. Maybe it’s just me… but it would be great if this would work day and night, seven days a week. No?
We would like to have an estimate about how good a state is before we actually run many different episodes. This makes a lot of sense for our nuclear reactor example, where we don’t want to wait until reaching the end state “extinction by explosion” of the episode before changing the value of a bad state is considered. …
In the previous blog post, we’ve used dynamic programming to find a solution for problems where the model of the environment is known by the agent. However, in real-world problems, the model is not always given. This is why making money on the stock market, for example, is so difficult. We don’t have complete information about the environment and have any idea about the transition probabilities in advance.
In this post, we will talk about how to use Montecarlo to solve the prediction/evaluation problem. By obtaining the value function without having an MDP.
In this post, we will use Monte-Carlo to evaluate policy π only. This means we are looking for the state-value function v_π(s) given policy π. It would work for the action-value function q_π(s) also, but for simplicity, we concentrate on v_π(s) here. …
In the last post, we were talking about some fundamentals of Reinforcement Learning and MDPs. Now, we are going to describe how to solve an MDP by finding the optimal policy using dynamic programming.
The environment for a planning problem is fully observable by an agent. A fully observable MDP, therefore, is the basis for solving planning problems.
E.g. during a game of poker where the agent sees the cards of all opponents and knows the order of all cards in the deck.
There are two cases of planning problems:
One of them is the prediction problem. Here the input is a fully observable MDP and a policy π. As output, we obtain the state-value function v_π. Having the value function we are able to evaluate the future for a given policy π now. We can say how good it will be to be in a state by following the policy π. …
Reinforcement learning is based on the reward hypothesis.
All good can be described by the maximisation of the expected cumulative reward
The word cumulative is important here because it allows an agent to take actions that generate a low reward but which probably lead to a higher reward in the long run.
Let’s think about an untrained person as our agent wanting to lose some weight. This person, let’s call him Mr Willy, has the choice of going to the gym or eating a whole cake every day.
Now, the immediate reward of going to the gym would be really low for Mr Willy, because it is hard work and not really enjoyable… and he is kinda lazy. …
Just head to npm and type in something like “datepicker”. You will see something like this:
Wouldn’t it be great to have a single datepicker which you can just use with every library or framework?
The technology behind web components allows you to create your own element natively, without using any framework (!), in the browser and use it in your document just like any other HTML element as
This is a great advantage especially for creating elements which should be shared between projects where different frameworks are used or on projects where no complex framework is needed, but you still want to have some structure. …