Add a New Environment¶

Duo works with any environment that can be converted into a Stable Baselines3 (SB3) environment. This includes both Gym and Gymnasium environments.

In this tutorial, you will learn how to:

Intergreate a Gymnasium environment (MiniGrid) into the Duo pipeline
Add a custom policy model
Train a policy on the new environment using PPO

The code of this example is at examples/minigrid_yrc.py.

0. Install requirements for Minigrid environments¶

pip install -r requirements/requirements_minigrid.txt

1. Add a New Gym Environment¶

We will use the MiniGrid environments DistShift1-v0 and DistShift2-v0. DistShift1-v0 will be used to train the novice agent, and DistShift2-v0 the expert agent. We will then train a policy to coordinate the two agents on DistShift2-v0.

The DistShift environments originate from the paper AI Safety Gridworlds (Leike et al., 2017). The task is to reach the goal location while avoiding deadly lava. The agent always starts at the top-left corner, and the goal is at the top-right corner. The lava is distributed differently in the two variants. Episode returns are between 0 and 1.

1.1. Define and Register Environment Configuration¶

By defining configuration dataclass for your new environment, you can customize it using YAML or command-line flags.

Here is a simple configuration class that lets you set the number of parallel environments and choose the training and test tasks:

@dataclass
class MiniGridConfig:
    name: str = "minigrid"
    num_envs: int = 8
    seed: int = 0
    train: Optional[str] = "DistShift2-v0"
    test_easy: Optional[str] = "DistShift1-v0"
    test_hard: Optional[str] = "DistShift2-v0"

Next, register this configuration class with Duo:

duo_ai.register_environment(MiniGridConfig.name, MiniGridConfig)

Once registered, you can override the default parameters using YAML or command-line flags. For example, specify env.num_env=8 or env.train=DistShift1-v0 on the command line.

Note

Registration must happen before creating the config object, so the configuration parser includes the registered arguments.

1.2. Convert a Gymnasium Environment to Stable Baselines3¶

Duo’s PPOAlgorithm expects SB3 environments, which have the following features (see the SB3 documentation):

The environment resets automatically when an episode ends or is truncated. The returned observation at that time is the first observation of the next episode.
The reset() method returns only an observation.
The step() method returns a tuple (obs, reward, done, info) (the original Gym API).

Below is sample code to convert a MiniGrid (Gymnasium) environment to an SB3 environment:

import gymnasium as gym
from minigrid.wrappers import ImgObsWrapper
from stable_baselines3.common.env_util import make_vec_env

def make_base_env(config, split, render_mode="rgb_array"):
    # config is an instance of MiniGridConfig
    env_id = f"MiniGrid-{getattr(config, split)}"

    # env_fn returns a new environment instance
    def env_fn(env_id=env_id, render_mode=render_mode):
        return ImgObsWrapper(gym.make(env_id, render_mode=render_mode))

    return make_vec_env(env_fn, n_envs=config.num_envs, seed=config.seed)

2. Add a New Policy Model¶

We need a custom model to process observations from the newly added environment.

As with the environment, you can cutomize the model using YAML or command-line flags, by defining a configuration dataclass and registering it with Duo.

Here is an example model class, used for the novice, expert, and coordination policy:

@dataclass
class MiniGridPPOModelConfig:
    name: str = "minigrid_ppo"

class MiniGridPPOModel(nn.Module):
    config_cls = MiniGridPPOModelConfig

    def __init__(self, config, env):
        # some code
        # The model must have these attributes for CoordEnv:
        self.hidden_dim = 128
        self.logit_dim = env.action_space.n

    def forward(self, obs):
        # some code

# Register model class with Duo
duo_ai.register_model("minigrid_ppo", MiniGridPPOModel)

Note

The model class must have a config_cls attribute that points to the configuration dataclass.

CoordEnv requires the model to have hidden_dim and logit_dim attributes.

3. Run Experiments¶

We are now ready to train a coordination policy to help the novice efficiently leverage assistance from the expert while performing the DistShift2-v0 task.

We provide a configuration file at configs/minigrid_ppo.yaml:

name: "minigrid_ppo"
seed: 10

env: "minigrid"

policy:
  name: "ppo"
  model: "minigrid_ppo"

algorithm:
  name: "ppo"
  log_freq: 10
  save_freq: 0
  num_steps: 512
  total_timesteps: 500000
  update_epochs: 4
  gamma: 0.99
  gae_lambda: 0.95
  num_minibatches: 8
  clip_coef: 0.2
  norm_adv: true
  clip_vloss: true
  vf_coef: 0.5
  ent_coef: 0.01
  max_grad_norm: 0.5
  learning_rate: 0.00025
  critic_pretrain_steps: 0
  anneal_lr: false
  log_action_id: 1

evaluation:
  num_episodes: 32
  max_num_steps: 50
  temperature: 1.0
  log_action_id: 1

train_novice: "experiments/minigrid_novice/best_test_easy.ckpt"
train_expert: "experiments/minigrid_expert/best_test_hard.ckpt"

Train and evaluate the novice:

python -u examples/minigrid_yrc.py \
    --config configs/minigrid_ppo.yaml \
    --mode train \
    --type agent \
    overwrite=1 \
    name=minigrid_novice \
    env.name=minigrid \
    env.train=DistShift1-v0

Example output:

[0:02:08 INFO]: BEST test_easy so far
[0:02:08 INFO]:    Steps:         549
   Episode length: mean   17.16  min   17.00  max   18.00
   Reward:         mean 0.94 ± 0.00
   Base Reward:    mean 0.00 ± 0.00
   Action 1 fraction:    0.06

[0:02:08 INFO]: BEST test_hard so far
[0:02:08 INFO]:    Steps:         1096
   Episode length: mean   34.25  min    2.00  max   50.00
   Reward:         mean 0.00 ± 0.00
   Base Reward:    mean 0.00 ± 0.00
   Action 1 fraction:    0.15

As expected, the novice performs well on DistShift1-v0 poorly on DistShift2-v0 (see Reward, not Base Reward on the test_hard split).

Next, train and evaluate the expert:

python -u examples/minigrid_yrc.py \
    --config configs/minigrid_ppo.yaml \
    --mode train \
    --type agent \
    overwrite=1 \
    name=minigrid_expert \
    env.name=minigrid \
    env.train=DistShift2-v0

Example output:

[0:01:53 INFO]: BEST test_easy so far
[0:01:53 INFO]:    Steps:         587
   Episode length: mean   18.34  min   15.00  max   32.00
   Reward:         mean 0.93 ± 0.00
   Base Reward:    mean 0.00 ± 0.00
   Action 1 fraction:    0.07

[0:01:53 INFO]: BEST test_hard so far
[0:01:53 INFO]:    Steps:         634
   Episode length: mean   19.81  min   19.00  max   24.00
   Reward:         mean 0.93 ± 0.00
   Base Reward:    mean 0.00 ± 0.00
   Action 1 fraction:    0.10

The expert performs well on both task variants.

Finally, train the coordination policy:

GYM_BACKEND=gymnasium python -u examples/minigrid_yrc.py \
    --config configs/minigrid_ppo.yaml \
    --mode train \
    --type coord \
    overwrite=1 \
    name=minigrid_coord \
    env.name=minigrid \
    env.train=DistShift2-v0

Note

Since we are using the Gymnasium version of MiniGrid, the environment variable GYM_BACKEND=gymnasium must be set so that Duo initializes CoordEnv correctly.

Example output:

[0:05:42 INFO]: BEST test_easy so far
[0:05:42 INFO]:    Steps:         571
   Episode length: mean   17.84  min   15.00  max   20.00
   Reward:         mean 0.84 ± 0.02
   Base Reward:    mean 0.94 ± 0.00
   Action 1 fraction:    0.27

[0:05:42 INFO]: BEST test_hard so far
[0:05:42 INFO]:    Steps:         656
   Episode length: mean   20.50  min   19.00  max   25.00
   Reward:         mean 0.74 ± 0.01
   Base Reward:    mean 0.93 ± 0.00
   Action 1 fraction:    0.46

As seen, the learned coordination policy enables the novice to request help only 46% of the time while achieving expert-level performance (0.93) on DistShift2-v0 (see Base Reward on test_hard; meanwhile, Reward reflects the base reward substracted by coordination cost).