duo_ai.core.environment

Classes

CoordinationConfig

Configuration for coordination environment parameters.

CoordEnv

Environment for coordinating between novice and expert policies.

GeneralCoordEnv

Coordination environment supporting recurrent policies.

Module Contents

class duo_ai.core.environment.CoordinationConfig[source]

Configuration for coordination environment parameters.

Parameters:
  • expert_query_cost_weight (float, optional) – The cost coefficient for querying the expert policy. Default is 0.4.

  • switch_agent_cost_weight (float, optional) – The cost coefficient for switching between agents. Default is 0.0.

  • temperature (float, optional) – The temperature parameter for action sampling. Default is 1.0.

Examples

>>> config = CoordinationConfig()
expert_query_cost_weight: float = 0.4
switch_agent_cost_weight: float = 0.0
temperature: float = 1.0
class duo_ai.core.environment.CoordEnv(config: CoordinationConfig, base_env: gymnasium.Env, novice: duo.core.Policy, expert: duo.core.Policy, open_novice: bool = True, open_expert: bool = False)[source]

Bases: gymnasium.Env

Environment for coordinating between novice and expert policies.

This class wraps a base environment and enables switching between a novice and expert policy, applying costs for expert queries and agent switching.

Examples

>>> config = CoordinationConfig()
>>> base_env = gym.make(...)
>>> novice = ...
>>> expert = ...
>>> env = CoordEnv(config, base_env, novice, expert)
config_cls
NOVICE = 0
EXPERT = 1
config
base_env
novice
expert
open_novice = True
open_expert = False
action_space
observation_space
expert_query_cost_per_action = None
switch_agent_cost_per_action = None
property num_envs: int

Number of parallel environments.

Returns:

Number of parallel environments.

Return type:

int

Examples

>>> n = env.num_envs
set_costs(base_penalty: float) None[source]

Set the cost per action for expert queries and agent switching.

Parameters:

base_penalty (float) – The reward value per action.

Return type:

None

Examples

>>> env.set_costs(0.05)
reset() Dict[str, Any][source]

Reset the coordination environment to an initial state.

Returns:

The initial observation of the environment, including:
  • ”base_obs”: The initial observation from the base environment.

  • ”novice_hidden”: Numpy array of hidden features from the novice policy.

  • ”novice_logits”: Numpy array of output logits from the novice policy.

  • ”expert_hidden”: Numpy array of hidden features from the expert policy (if open_expert).

  • ”expert_logits”: Numpy array of output logits from the expert policy (if open_expert).

Return type:

dict

Examples

>>> obs = env.reset()
_reset_agents(done: numpy.ndarray) None[source]

Reset the internal state of the novice and expert agents.

Parameters:

done (numpy.ndarray) – Boolean array indicating which episodes in a batch require a reset.

Return type:

None

Examples

>>> env._reset_agents(np.array([True, False]))
step(action: numpy.ndarray) Tuple[Dict[str, Any], numpy.ndarray, numpy.ndarray, List[Dict[str, Any]]][source]

Advance the environment by one step using the provided action.

Parameters:

action (numpy.ndarray) – The action(s) to take in the environment. Should be a numpy array indicating which agent acts.

Returns:

  • obs (dict) –

    The next observation of the environment, including:
    • ”base_obs”: The observation from the base environment.

    • ”novice_hidden”: Numpy array of hidden features from the novice policy.

    • ”novice_logits”: Numpy array of output logits from the novice policy.

    • ”expert_hidden”: Numpy array of hidden features from the expert policy (if open_expert).

    • ”expert_logits”: Numpy array of output logits from the expert policy (if open_expert).

  • reward (numpy.ndarray) – The reward(s) obtained from the environment after taking the action.

  • done (numpy.ndarray) – Boolean flag(s) indicating whether the episode has ended for each environment.

  • info (list of dict) – Additional information from the environment for each agent or environment instance.

Raises:

Exception – Propagates any exceptions raised by the underlying environment’s step method.

Examples

>>> obs, reward, done, info = env.step(action)
_compute_base_action(action: numpy.ndarray) numpy.ndarray[source]

Compute the environment-specific action for each agent.

Parameters:

action (numpy.ndarray) – Array indicating which agent (novice or expert) acts for each environment.

Returns:

Array of actions to be passed to the base environment.

Return type:

numpy.ndarray

Examples

>>> base_action = env._compute_base_action(action)
_get_obs() Dict[str, Any][source]

Return the current observation for the coordination environment.

Returns:

A dictionary containing:
  • ”base_obs”: The current observation from the base environment.

  • ”novice_hidden”: Numpy array of hidden features from the novice policy (if open_novice).

  • ”novice_logits”: Numpy array of output logits from the novice policy (if open_novice).

  • ”expert_hidden”: Numpy array of hidden features from the expert policy (if open_expert).

  • ”expert_logits”: Numpy array of output logits from the expert policy (if open_expert).

Return type:

dict

Examples

>>> obs = env._get_obs()
_get_reward(base_reward: numpy.ndarray, action: numpy.ndarray, done: numpy.ndarray) numpy.ndarray[source]

Compute the reward for the current step, including costs for expert queries and agent switching.

Parameters:
  • base_reward (numpy.ndarray) – The base reward from the environment.

  • action (numpy.ndarray) – The action(s) taken (novice or expert).

  • done (numpy.ndarray) – Boolean flag(s) indicating whether the episode has ended for each environment.

Returns:

The computed reward(s) after applying costs.

Return type:

numpy.ndarray

Examples

>>> reward = env._get_reward(base_reward, action, done)
close() None[source]

Close the coordination environment and release any resources held.

Return type:

None

Examples

>>> env.close()
class duo_ai.core.environment.GeneralCoordEnv(config: CoordinationConfig, base_env: gymnasium.Env, novice: duo.core.Policy, expert: duo.core.Policy, open_novice: bool = True, open_expert: bool = False)[source]

Bases: CoordEnv

Coordination environment supporting recurrent policies.

This class supports policies that maintain a hidden state across steps, but can be less efficient for stateless policies than CoordEnv.

Examples

>>> config = CoordinationConfig()
>>> base_env = gym.make(...)
>>> novice = ...
>>> expert = ...
>>> env = GeneralCoordEnv(config, base_env, novice, expert)
_compute_agents_action() numpy.ndarray[source]

Compute the actions for both novice and expert agents, supporting recurrent policies.

Returns:

Array of actions to be passed to the base environment.

Return type:

numpy.ndarray

Examples

>>> base_action = env._compute_agents_action()
_compute_base_action(action: numpy.ndarray) numpy.ndarray[source]

Compute the environment-specific action for each agent, supporting recurrent policies.

Parameters:

action (numpy.ndarray) – Array indicating which agent (novice or expert) acts for each environment.

Returns:

Array of actions to be passed to the base environment.

Return type:

numpy.ndarray

Examples

>>> base_action = env._compute_base_action(action)
_get_obs() Dict[str, Any][source]

Return the current observation for the coordination environment, supporting recurrent policies.

Returns:

A dictionary containing:
  • ”base_obs”: The current observation from the base environment.

  • ”novice_hidden”: Numpy array of hidden features from the novice policy (if open_novice).

  • ”novice_logits”: Numpy array of output logits from the novice policy (if open_novice).

  • ”expert_hidden”: Numpy array of hidden features from the expert policy (if open_expert).

  • ”expert_logits”: Numpy array of output logits from the expert policy (if open_expert).

Return type:

dict

Examples

>>> obs = env._get_obs()