Reinforcement Learning ====================== We provide a full re-implementation of the PPO algorithm [1]_. The policy supports three types of input features: - ``obs``: the observation from the base environment. - ``hidden``: the hidden representations of the novice. - ``dist``: the output softmax distribution of the novice. Users can specify any non-empty combination of these feature types (e.g., ``obs_hidden``, ``obs_hidden_dist``). References ---------- .. [1] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. "Proximal Policy Optimization Algorithms." arXiv preprint arXiv:1707.06347, 2017.