Reinforcement Learning

We provide a full re-implementation of the PPO algorithm [1]. The policy supports three types of input features:

  • obs: the observation from the base environment.

  • hidden: the hidden representations of the novice.

  • dist: the output softmax distribution of the novice.

Users can specify any non-empty combination of these feature types (e.g., obs_hidden, obs_hidden_dist).

References