Become a member
Sign in

…ink of a Policy gradient network is one which produces explicit outputs. In the case of our bandit, we don’t need to condition these outputs on any state. As such, our network will consist of just a set of weights, with each corresponding to each of the …

Simple Reinforcement Learning in Tensorflow: Part 1 - Two-armed Bandit
3.1K
27
Arthur Juliani
Ran Feldesh
Ran Feldesh
Sep 2, 2018 · 1 min read

Implemented by two networks?

    Ran Feldesh

    Written by

    Ran Feldesh

    Write the first response

    Discover Medium

    Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch

    Make Medium yours

    Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore

    Become a member

    Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade
    AboutHelpLegal