duo_ai.core.environment¶

Classes¶

`CoordinationConfig`	Configuration for coordination environment parameters.
`CoordEnv`	Environment for coordinating between novice and expert policies.
`GeneralCoordEnv`	Coordination environment supporting recurrent policies.

Module Contents¶

class duo_ai.core.environment.CoordinationConfig[source]¶

Configuration for coordination environment parameters.

Parameters:

expert_query_cost_weight (float, optional) – The cost coefficient for querying the expert policy. Default is 0.4.
switch_agent_cost_weight (float, optional) – The cost coefficient for switching between agents. Default is 0.0.
temperature (float, optional) – The temperature parameter for action sampling. Default is 1.0.

Examples

>>> config = CoordinationConfig()

expert_query_cost_weight: float = 0.4¶

switch_agent_cost_weight: float = 0.0¶

temperature: float = 1.0¶

class duo_ai.core.environment.CoordEnv(config: CoordinationConfig, base_env: gymnasium.Env, novice: duo.core.Policy, expert: duo.core.Policy, open_novice: bool = True, open_expert: bool = False)[source]¶

Bases: gymnasium.Env

Environment for coordinating between novice and expert policies.

This class wraps a base environment and enables switching between a novice and expert policy, applying costs for expert queries and agent switching.

Examples

>>> config = CoordinationConfig()
>>> base_env = gym.make(...)
>>> novice = ...
>>> expert = ...
>>> env = CoordEnv(config, base_env, novice, expert)

config_cls¶

NOVICE = 0¶

EXPERT = 1¶

config¶

base_env¶

novice¶

expert¶

open_novice = True¶

open_expert = False¶

action_space¶

observation_space¶

expert_query_cost_per_action = None¶

switch_agent_cost_per_action = None¶

property num_envs: int¶

Number of parallel environments.

Returns:: Number of parallel environments.
Return type:: int

Examples

>>> n = env.num_envs

set_costs(base_penalty: float) → None[source]¶

Set the cost per action for expert queries and agent switching.

Parameters:: base_penalty (float) – The reward value per action.
Return type:: None

Examples

>>> env.set_costs(0.05)

reset() → Dict[str, Any][source]¶

Reset the coordination environment to an initial state.

Returns:

The initial observation of the environment, including:

”base_obs”: The initial observation from the base environment.
”novice_hidden”: Numpy array of hidden features from the novice policy.
”novice_logits”: Numpy array of output logits from the novice policy.
”expert_hidden”: Numpy array of hidden features from the expert policy (if open_expert).
”expert_logits”: Numpy array of output logits from the expert policy (if open_expert).

Return type:

dict

Examples

>>> obs = env.reset()

_reset_agents(done: numpy.ndarray) → None[source]¶

Reset the internal state of the novice and expert agents.

Parameters:: done (numpy.ndarray) – Boolean array indicating which episodes in a batch require a reset.
Return type:: None

Examples

>>> env._reset_agents(np.array([True, False]))

step(action: numpy.ndarray) → Tuple[Dict[str, Any], numpy.ndarray, numpy.ndarray, List[Dict[str, Any]]][source]¶

Advance the environment by one step using the provided action.

Parameters:

action (numpy.ndarray) – The action(s) to take in the environment. Should be a numpy array indicating which agent acts.

Returns:

obs (dict) –
The next observation of the environment, including:
- ”base_obs”: The observation from the base environment.
- ”novice_hidden”: Numpy array of hidden features from the novice policy.
- ”novice_logits”: Numpy array of output logits from the novice policy.
- ”expert_hidden”: Numpy array of hidden features from the expert policy (if open_expert).
- ”expert_logits”: Numpy array of output logits from the expert policy (if open_expert).
reward (numpy.ndarray) – The reward(s) obtained from the environment after taking the action.
done (numpy.ndarray) – Boolean flag(s) indicating whether the episode has ended for each environment.
info (list of dict) – Additional information from the environment for each agent or environment instance.

Raises:

Exception – Propagates any exceptions raised by the underlying environment’s step method.

Examples

>>> obs, reward, done, info = env.step(action)

_compute_base_action(action: numpy.ndarray) → numpy.ndarray[source]¶

Compute the environment-specific action for each agent.

Parameters:: action (numpy.ndarray) – Array indicating which agent (novice or expert) acts for each environment.
Returns:: Array of actions to be passed to the base environment.
Return type:: numpy.ndarray

Examples

>>> base_action = env._compute_base_action(action)

_get_obs() → Dict[str, Any][source]¶

Return the current observation for the coordination environment.

Returns:

A dictionary containing:

”base_obs”: The current observation from the base environment.
”novice_hidden”: Numpy array of hidden features from the novice policy (if open_novice).
”novice_logits”: Numpy array of output logits from the novice policy (if open_novice).
”expert_hidden”: Numpy array of hidden features from the expert policy (if open_expert).
”expert_logits”: Numpy array of output logits from the expert policy (if open_expert).

Return type:

dict

Examples

>>> obs = env._get_obs()

_get_reward(base_reward: numpy.ndarray, action: numpy.ndarray, done: numpy.ndarray) → numpy.ndarray[source]¶

Compute the reward for the current step, including costs for expert queries and agent switching.

Parameters:

base_reward (numpy.ndarray) – The base reward from the environment.
action (numpy.ndarray) – The action(s) taken (novice or expert).
done (numpy.ndarray) – Boolean flag(s) indicating whether the episode has ended for each environment.

Returns:

The computed reward(s) after applying costs.

Return type:

numpy.ndarray

Examples

>>> reward = env._get_reward(base_reward, action, done)

close() → None[source]¶

Close the coordination environment and release any resources held.

Return type:: None

Examples

>>> env.close()

class duo_ai.core.environment.GeneralCoordEnv(config: CoordinationConfig, base_env: gymnasium.Env, novice: duo.core.Policy, expert: duo.core.Policy, open_novice: bool = True, open_expert: bool = False)[source]¶

Bases: CoordEnv

Coordination environment supporting recurrent policies.

This class supports policies that maintain a hidden state across steps, but can be less efficient for stateless policies than CoordEnv.

Examples

>>> config = CoordinationConfig()
>>> base_env = gym.make(...)
>>> novice = ...
>>> expert = ...
>>> env = GeneralCoordEnv(config, base_env, novice, expert)

_compute_agents_action() → numpy.ndarray[source]¶

Compute the actions for both novice and expert agents, supporting recurrent policies.

Returns:: Array of actions to be passed to the base environment.
Return type:: numpy.ndarray

Examples

>>> base_action = env._compute_agents_action()

_compute_base_action(action: numpy.ndarray) → numpy.ndarray[source]¶

Compute the environment-specific action for each agent, supporting recurrent policies.

Parameters:: action (numpy.ndarray) – Array indicating which agent (novice or expert) acts for each environment.
Returns:: Array of actions to be passed to the base environment.
Return type:: numpy.ndarray

Examples

>>> base_action = env._compute_base_action(action)

_get_obs() → Dict[str, Any][source]¶

Return the current observation for the coordination environment, supporting recurrent policies.

Returns:

A dictionary containing:

”base_obs”: The current observation from the base environment.
”novice_hidden”: Numpy array of hidden features from the novice policy (if open_novice).
”novice_logits”: Numpy array of output logits from the novice policy (if open_novice).
”expert_hidden”: Numpy array of hidden features from the expert policy (if open_expert).
”expert_logits”: Numpy array of output logits from the expert policy (if open_expert).

Return type:

dict

Examples

>>> obs = env._get_obs()