Add a New Environment
=====================
Duo works with any environment that can be converted into a Stable Baselines3 (SB3) environment.
This includes both Gym and Gymnasium environments.
In this tutorial, you will learn how to:
- Intergreate a `Gymnasium `_ environment (`MiniGrid `_) into the Duo pipeline
- Add a custom policy model
- Train a policy on the new environment using PPO
The code of this example is at `examples/minigrid_yrc.py `_.
.. _add-env:
0. Install requirements for Minigrid environments
------------------------------------------------
.. code-block:: bash
pip install -r requirements/requirements_minigrid.txt
1. Add a New Gym Environment
----------------------------
We will use the MiniGrid environments ``DistShift1-v0`` and ``DistShift2-v0``.
``DistShift1-v0`` will be used to train the novice agent, and ``DistShift2-v0`` the expert agent.
We will then train a policy to coordinate the two agents on ``DistShift2-v0``.
The DistShift environments originate from the paper `AI Safety Gridworlds `_ (Leike et al., 2017).
The task is to reach the goal location while avoiding deadly lava. The agent always starts at the top-left corner, and the goal is at the top-right corner.
The lava is distributed differently in the two variants. Episode returns are between 0 and 1.
.. figure:: ../images/distshift.jpg
:width: 100%
:alt: DistShift environments
DistShift1-v0 (left) and DistShift2-v0 (right).
Source: `Leike et al., 2017 `_
1.1. Define and Register Environment Configuration
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
By defining configuration dataclass for your new environment, you can customize it using YAML or command-line flags.
Here is a simple configuration class that lets you set the number of parallel environments and choose the training and test tasks:
.. code-block:: python
@dataclass
class MiniGridConfig:
name: str = "minigrid"
num_envs: int = 8
seed: int = 0
train: Optional[str] = "DistShift2-v0"
test_easy: Optional[str] = "DistShift1-v0"
test_hard: Optional[str] = "DistShift2-v0"
Next, register this configuration class with Duo:
.. code-block:: python
duo_ai.register_environment(MiniGridConfig.name, MiniGridConfig)
Once registered, you can override the default parameters using YAML or command-line flags.
For example, specify ``env.num_env=8`` or ``env.train=DistShift1-v0`` on the command line.
.. note::
Registration must happen before creating the ``config`` object, so the configuration parser includes the registered arguments.
1.2. Convert a Gymnasium Environment to Stable Baselines3
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Duo's PPOAlgorithm expects SB3 environments, which have the following features
(see the `SB3 documentation `_):
- The environment resets automatically when an episode ends or is truncated. The returned observation at that time is the first observation of the next episode.
- The ``reset()`` method returns only an observation.
- The ``step()`` method returns a tuple ``(obs, reward, done, info)`` (the original Gym API).
Below is sample code to convert a MiniGrid (Gymnasium) environment to an SB3 environment:
.. code-block:: python
import gymnasium as gym
from minigrid.wrappers import ImgObsWrapper
from stable_baselines3.common.env_util import make_vec_env
def make_base_env(config, split, render_mode="rgb_array"):
# config is an instance of MiniGridConfig
env_id = f"MiniGrid-{getattr(config, split)}"
# env_fn returns a new environment instance
def env_fn(env_id=env_id, render_mode=render_mode):
return ImgObsWrapper(gym.make(env_id, render_mode=render_mode))
return make_vec_env(env_fn, n_envs=config.num_envs, seed=config.seed)
.. _add-model:
2. Add a New Policy Model
-------------------------
We need a custom model to process observations from the newly added environment.
As with the environment, you can cutomize the model using YAML or command-line flags, by defining a configuration dataclass and registering it with Duo.
Here is an example model class, used for the novice, expert, and coordination policy:
.. code-block:: python
@dataclass
class MiniGridPPOModelConfig:
name: str = "minigrid_ppo"
class MiniGridPPOModel(nn.Module):
config_cls = MiniGridPPOModelConfig
def __init__(self, config, env):
# some code
# The model must have these attributes for CoordEnv:
self.hidden_dim = 128
self.logit_dim = env.action_space.n
def forward(self, obs):
# some code
# Register model class with Duo
duo_ai.register_model("minigrid_ppo", MiniGridPPOModel)
.. note::
The model class must have a ``config_cls`` attribute that points to the configuration dataclass.
``CoordEnv`` requires the model to have ``hidden_dim`` and ``logit_dim`` attributes.
.. _run-experiments:
3. Run Experiments
------------------
We are now ready to train a coordination policy to help the novice efficiently leverage assistance from the expert while performing the ``DistShift2-v0`` task.
We provide a configuration file at `configs/minigrid_ppo.yaml`:
.. code-block:: yaml
name: "minigrid_ppo"
seed: 10
env: "minigrid"
policy:
name: "ppo"
model: "minigrid_ppo"
algorithm:
name: "ppo"
log_freq: 10
save_freq: 0
num_steps: 512
total_timesteps: 500000
update_epochs: 4
gamma: 0.99
gae_lambda: 0.95
num_minibatches: 8
clip_coef: 0.2
norm_adv: true
clip_vloss: true
vf_coef: 0.5
ent_coef: 0.01
max_grad_norm: 0.5
learning_rate: 0.00025
critic_pretrain_steps: 0
anneal_lr: false
log_action_id: 1
evaluation:
num_episodes: 32
max_num_steps: 50
temperature: 1.0
log_action_id: 1
train_novice: "experiments/minigrid_novice/best_test_easy.ckpt"
train_expert: "experiments/minigrid_expert/best_test_hard.ckpt"
**Train and evaluate the novice:**
.. code-block:: bash
python -u examples/minigrid_yrc.py \
--config configs/minigrid_ppo.yaml \
--mode train \
--type agent \
overwrite=1 \
name=minigrid_novice \
env.name=minigrid \
env.train=DistShift1-v0
Example output::
[0:02:08 INFO]: BEST test_easy so far
[0:02:08 INFO]: Steps: 549
Episode length: mean 17.16 min 17.00 max 18.00
Reward: mean 0.94 ± 0.00
Base Reward: mean 0.00 ± 0.00
Action 1 fraction: 0.06
[0:02:08 INFO]: BEST test_hard so far
[0:02:08 INFO]: Steps: 1096
Episode length: mean 34.25 min 2.00 max 50.00
Reward: mean 0.00 ± 0.00
Base Reward: mean 0.00 ± 0.00
Action 1 fraction: 0.15
As expected, the novice performs well on ``DistShift1-v0`` poorly on ``DistShift2-v0`` (see ``Reward``, not ``Base Reward`` on the ``test_hard`` split).
**Next, train and evaluate the expert:**
.. code-block:: bash
python -u examples/minigrid_yrc.py \
--config configs/minigrid_ppo.yaml \
--mode train \
--type agent \
overwrite=1 \
name=minigrid_expert \
env.name=minigrid \
env.train=DistShift2-v0
Example output::
[0:01:53 INFO]: BEST test_easy so far
[0:01:53 INFO]: Steps: 587
Episode length: mean 18.34 min 15.00 max 32.00
Reward: mean 0.93 ± 0.00
Base Reward: mean 0.00 ± 0.00
Action 1 fraction: 0.07
[0:01:53 INFO]: BEST test_hard so far
[0:01:53 INFO]: Steps: 634
Episode length: mean 19.81 min 19.00 max 24.00
Reward: mean 0.93 ± 0.00
Base Reward: mean 0.00 ± 0.00
Action 1 fraction: 0.10
The expert performs well on both task variants.
**Finally, train the coordination policy:**
.. code-block:: bash
GYM_BACKEND=gymnasium python -u examples/minigrid_yrc.py \
--config configs/minigrid_ppo.yaml \
--mode train \
--type coord \
overwrite=1 \
name=minigrid_coord \
env.name=minigrid \
env.train=DistShift2-v0
.. note::
Since we are using the Gymnasium version of MiniGrid, the environment variable ``GYM_BACKEND=gymnasium`` must be set so that Duo initializes CoordEnv correctly.
Example output::
[0:05:42 INFO]: BEST test_easy so far
[0:05:42 INFO]: Steps: 571
Episode length: mean 17.84 min 15.00 max 20.00
Reward: mean 0.84 ± 0.02
Base Reward: mean 0.94 ± 0.00
Action 1 fraction: 0.27
[0:05:42 INFO]: BEST test_hard so far
[0:05:42 INFO]: Steps: 656
Episode length: mean 20.50 min 19.00 max 25.00
Reward: mean 0.74 ± 0.01
Base Reward: mean 0.93 ± 0.00
Action 1 fraction: 0.46
As seen, the learned coordination policy enables the novice to request help only 46% of the time while achieving expert-level performance (0.93) on ``DistShift2-v0`` (see ``Base Reward`` on ``test_hard``; meanwhile, ``Reward`` reflects the base reward substracted by coordination cost).