Add a New Environment ===================== Duo works with any environment that can be converted into a Stable Baselines3 (SB3) environment. This includes both Gym and Gymnasium environments. In this tutorial, you will learn how to: - Intergreate a `Gymnasium `_ environment (`MiniGrid `_) into the Duo pipeline - Add a custom policy model - Train a policy on the new environment using PPO The code of this example is at `examples/minigrid_yrc.py `_. .. _add-env: 0. Install requirements for Minigrid environments ------------------------------------------------ .. code-block:: bash pip install -r requirements/requirements_minigrid.txt 1. Add a New Gym Environment ---------------------------- We will use the MiniGrid environments ``DistShift1-v0`` and ``DistShift2-v0``. ``DistShift1-v0`` will be used to train the novice agent, and ``DistShift2-v0`` the expert agent. We will then train a policy to coordinate the two agents on ``DistShift2-v0``. The DistShift environments originate from the paper `AI Safety Gridworlds `_ (Leike et al., 2017). The task is to reach the goal location while avoiding deadly lava. The agent always starts at the top-left corner, and the goal is at the top-right corner. The lava is distributed differently in the two variants. Episode returns are between 0 and 1. .. figure:: ../images/distshift.jpg :width: 100% :alt: DistShift environments DistShift1-v0 (left) and DistShift2-v0 (right). Source: `Leike et al., 2017 `_ 1.1. Define and Register Environment Configuration ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ By defining configuration dataclass for your new environment, you can customize it using YAML or command-line flags. Here is a simple configuration class that lets you set the number of parallel environments and choose the training and test tasks: .. code-block:: python @dataclass class MiniGridConfig: name: str = "minigrid" num_envs: int = 8 seed: int = 0 train: Optional[str] = "DistShift2-v0" test_easy: Optional[str] = "DistShift1-v0" test_hard: Optional[str] = "DistShift2-v0" Next, register this configuration class with Duo: .. code-block:: python duo_ai.register_environment(MiniGridConfig.name, MiniGridConfig) Once registered, you can override the default parameters using YAML or command-line flags. For example, specify ``env.num_env=8`` or ``env.train=DistShift1-v0`` on the command line. .. note:: Registration must happen before creating the ``config`` object, so the configuration parser includes the registered arguments. 1.2. Convert a Gymnasium Environment to Stable Baselines3 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Duo's PPOAlgorithm expects SB3 environments, which have the following features (see the `SB3 documentation `_): - The environment resets automatically when an episode ends or is truncated. The returned observation at that time is the first observation of the next episode. - The ``reset()`` method returns only an observation. - The ``step()`` method returns a tuple ``(obs, reward, done, info)`` (the original Gym API). Below is sample code to convert a MiniGrid (Gymnasium) environment to an SB3 environment: .. code-block:: python import gymnasium as gym from minigrid.wrappers import ImgObsWrapper from stable_baselines3.common.env_util import make_vec_env def make_base_env(config, split, render_mode="rgb_array"): # config is an instance of MiniGridConfig env_id = f"MiniGrid-{getattr(config, split)}" # env_fn returns a new environment instance def env_fn(env_id=env_id, render_mode=render_mode): return ImgObsWrapper(gym.make(env_id, render_mode=render_mode)) return make_vec_env(env_fn, n_envs=config.num_envs, seed=config.seed) .. _add-model: 2. Add a New Policy Model ------------------------- We need a custom model to process observations from the newly added environment. As with the environment, you can cutomize the model using YAML or command-line flags, by defining a configuration dataclass and registering it with Duo. Here is an example model class, used for the novice, expert, and coordination policy: .. code-block:: python @dataclass class MiniGridPPOModelConfig: name: str = "minigrid_ppo" class MiniGridPPOModel(nn.Module): config_cls = MiniGridPPOModelConfig def __init__(self, config, env): # some code # The model must have these attributes for CoordEnv: self.hidden_dim = 128 self.logit_dim = env.action_space.n def forward(self, obs): # some code # Register model class with Duo duo_ai.register_model("minigrid_ppo", MiniGridPPOModel) .. note:: The model class must have a ``config_cls`` attribute that points to the configuration dataclass. ``CoordEnv`` requires the model to have ``hidden_dim`` and ``logit_dim`` attributes. .. _run-experiments: 3. Run Experiments ------------------ We are now ready to train a coordination policy to help the novice efficiently leverage assistance from the expert while performing the ``DistShift2-v0`` task. We provide a configuration file at `configs/minigrid_ppo.yaml`: .. code-block:: yaml name: "minigrid_ppo" seed: 10 env: "minigrid" policy: name: "ppo" model: "minigrid_ppo" algorithm: name: "ppo" log_freq: 10 save_freq: 0 num_steps: 512 total_timesteps: 500000 update_epochs: 4 gamma: 0.99 gae_lambda: 0.95 num_minibatches: 8 clip_coef: 0.2 norm_adv: true clip_vloss: true vf_coef: 0.5 ent_coef: 0.01 max_grad_norm: 0.5 learning_rate: 0.00025 critic_pretrain_steps: 0 anneal_lr: false log_action_id: 1 evaluation: num_episodes: 32 max_num_steps: 50 temperature: 1.0 log_action_id: 1 train_novice: "experiments/minigrid_novice/best_test_easy.ckpt" train_expert: "experiments/minigrid_expert/best_test_hard.ckpt" **Train and evaluate the novice:** .. code-block:: bash python -u examples/minigrid_yrc.py \ --config configs/minigrid_ppo.yaml \ --mode train \ --type agent \ overwrite=1 \ name=minigrid_novice \ env.name=minigrid \ env.train=DistShift1-v0 Example output:: [0:02:08 INFO]: BEST test_easy so far [0:02:08 INFO]: Steps: 549 Episode length: mean 17.16 min 17.00 max 18.00 Reward: mean 0.94 ± 0.00 Base Reward: mean 0.00 ± 0.00 Action 1 fraction: 0.06 [0:02:08 INFO]: BEST test_hard so far [0:02:08 INFO]: Steps: 1096 Episode length: mean 34.25 min 2.00 max 50.00 Reward: mean 0.00 ± 0.00 Base Reward: mean 0.00 ± 0.00 Action 1 fraction: 0.15 As expected, the novice performs well on ``DistShift1-v0`` poorly on ``DistShift2-v0`` (see ``Reward``, not ``Base Reward`` on the ``test_hard`` split). **Next, train and evaluate the expert:** .. code-block:: bash python -u examples/minigrid_yrc.py \ --config configs/minigrid_ppo.yaml \ --mode train \ --type agent \ overwrite=1 \ name=minigrid_expert \ env.name=minigrid \ env.train=DistShift2-v0 Example output:: [0:01:53 INFO]: BEST test_easy so far [0:01:53 INFO]: Steps: 587 Episode length: mean 18.34 min 15.00 max 32.00 Reward: mean 0.93 ± 0.00 Base Reward: mean 0.00 ± 0.00 Action 1 fraction: 0.07 [0:01:53 INFO]: BEST test_hard so far [0:01:53 INFO]: Steps: 634 Episode length: mean 19.81 min 19.00 max 24.00 Reward: mean 0.93 ± 0.00 Base Reward: mean 0.00 ± 0.00 Action 1 fraction: 0.10 The expert performs well on both task variants. **Finally, train the coordination policy:** .. code-block:: bash GYM_BACKEND=gymnasium python -u examples/minigrid_yrc.py \ --config configs/minigrid_ppo.yaml \ --mode train \ --type coord \ overwrite=1 \ name=minigrid_coord \ env.name=minigrid \ env.train=DistShift2-v0 .. note:: Since we are using the Gymnasium version of MiniGrid, the environment variable ``GYM_BACKEND=gymnasium`` must be set so that Duo initializes CoordEnv correctly. Example output:: [0:05:42 INFO]: BEST test_easy so far [0:05:42 INFO]: Steps: 571 Episode length: mean 17.84 min 15.00 max 20.00 Reward: mean 0.84 ± 0.02 Base Reward: mean 0.94 ± 0.00 Action 1 fraction: 0.27 [0:05:42 INFO]: BEST test_hard so far [0:05:42 INFO]: Steps: 656 Episode length: mean 20.50 min 19.00 max 25.00 Reward: mean 0.74 ± 0.01 Base Reward: mean 0.93 ± 0.00 Action 1 fraction: 0.46 As seen, the learned coordination policy enables the novice to request help only 46% of the time while achieving expert-level performance (0.93) on ``DistShift2-v0`` (see ``Base Reward`` on ``test_hard``; meanwhile, ``Reward`` reflects the base reward substracted by coordination cost).