Reinforcement Learning¶
We provide a full re-implementation of the PPO algorithm [1]. The policy supports three types of input features:
obs: the observation from the base environment.hidden: the hidden representations of the novice.dist: the output softmax distribution of the novice.
Users can specify any non-empty combination of these feature types (e.g., obs_hidden, obs_hidden_dist).