duo_ai.core.environment¶
Classes¶
Configuration for coordination environment parameters. |
|
Environment for coordinating between novice and expert policies. |
|
Coordination environment supporting recurrent policies. |
Module Contents¶
- class duo_ai.core.environment.CoordinationConfig[source]¶
Configuration for coordination environment parameters.
- Parameters:
expert_query_cost_weight (float, optional) – The cost coefficient for querying the expert policy. Default is 0.4.
switch_agent_cost_weight (float, optional) – The cost coefficient for switching between agents. Default is 0.0.
temperature (float, optional) – The temperature parameter for action sampling. Default is 1.0.
Examples
>>> config = CoordinationConfig()
- expert_query_cost_weight: float = 0.4¶
- switch_agent_cost_weight: float = 0.0¶
- temperature: float = 1.0¶
- class duo_ai.core.environment.CoordEnv(config: CoordinationConfig, base_env: gymnasium.Env, novice: duo.core.Policy, expert: duo.core.Policy, open_novice: bool = True, open_expert: bool = False)[source]¶
Bases:
gymnasium.EnvEnvironment for coordinating between novice and expert policies.
This class wraps a base environment and enables switching between a novice and expert policy, applying costs for expert queries and agent switching.
Examples
>>> config = CoordinationConfig() >>> base_env = gym.make(...) >>> novice = ... >>> expert = ... >>> env = CoordEnv(config, base_env, novice, expert)
- config_cls¶
- NOVICE = 0¶
- EXPERT = 1¶
- config¶
- base_env¶
- novice¶
- expert¶
- open_novice = True¶
- open_expert = False¶
- action_space¶
- observation_space¶
- expert_query_cost_per_action = None¶
- switch_agent_cost_per_action = None¶
- property num_envs: int¶
Number of parallel environments.
- Returns:
Number of parallel environments.
- Return type:
int
Examples
>>> n = env.num_envs
- set_costs(base_penalty: float) None[source]¶
Set the cost per action for expert queries and agent switching.
- Parameters:
base_penalty (float) – The reward value per action.
- Return type:
None
Examples
>>> env.set_costs(0.05)
- reset() Dict[str, Any][source]¶
Reset the coordination environment to an initial state.
- Returns:
- The initial observation of the environment, including:
”base_obs”: The initial observation from the base environment.
”novice_hidden”: Numpy array of hidden features from the novice policy.
”novice_logits”: Numpy array of output logits from the novice policy.
”expert_hidden”: Numpy array of hidden features from the expert policy (if open_expert).
”expert_logits”: Numpy array of output logits from the expert policy (if open_expert).
- Return type:
dict
Examples
>>> obs = env.reset()
- _reset_agents(done: numpy.ndarray) None[source]¶
Reset the internal state of the novice and expert agents.
- Parameters:
done (numpy.ndarray) – Boolean array indicating which episodes in a batch require a reset.
- Return type:
None
Examples
>>> env._reset_agents(np.array([True, False]))
- step(action: numpy.ndarray) Tuple[Dict[str, Any], numpy.ndarray, numpy.ndarray, List[Dict[str, Any]]][source]¶
Advance the environment by one step using the provided action.
- Parameters:
action (numpy.ndarray) – The action(s) to take in the environment. Should be a numpy array indicating which agent acts.
- Returns:
obs (dict) –
- The next observation of the environment, including:
”base_obs”: The observation from the base environment.
”novice_hidden”: Numpy array of hidden features from the novice policy.
”novice_logits”: Numpy array of output logits from the novice policy.
”expert_hidden”: Numpy array of hidden features from the expert policy (if open_expert).
”expert_logits”: Numpy array of output logits from the expert policy (if open_expert).
reward (numpy.ndarray) – The reward(s) obtained from the environment after taking the action.
done (numpy.ndarray) – Boolean flag(s) indicating whether the episode has ended for each environment.
info (list of dict) – Additional information from the environment for each agent or environment instance.
- Raises:
Exception – Propagates any exceptions raised by the underlying environment’s step method.
Examples
>>> obs, reward, done, info = env.step(action)
- _compute_base_action(action: numpy.ndarray) numpy.ndarray[source]¶
Compute the environment-specific action for each agent.
- Parameters:
action (numpy.ndarray) – Array indicating which agent (novice or expert) acts for each environment.
- Returns:
Array of actions to be passed to the base environment.
- Return type:
numpy.ndarray
Examples
>>> base_action = env._compute_base_action(action)
- _get_obs() Dict[str, Any][source]¶
Return the current observation for the coordination environment.
- Returns:
- A dictionary containing:
”base_obs”: The current observation from the base environment.
”novice_hidden”: Numpy array of hidden features from the novice policy (if open_novice).
”novice_logits”: Numpy array of output logits from the novice policy (if open_novice).
”expert_hidden”: Numpy array of hidden features from the expert policy (if open_expert).
”expert_logits”: Numpy array of output logits from the expert policy (if open_expert).
- Return type:
dict
Examples
>>> obs = env._get_obs()
- _get_reward(base_reward: numpy.ndarray, action: numpy.ndarray, done: numpy.ndarray) numpy.ndarray[source]¶
Compute the reward for the current step, including costs for expert queries and agent switching.
- Parameters:
base_reward (numpy.ndarray) – The base reward from the environment.
action (numpy.ndarray) – The action(s) taken (novice or expert).
done (numpy.ndarray) – Boolean flag(s) indicating whether the episode has ended for each environment.
- Returns:
The computed reward(s) after applying costs.
- Return type:
numpy.ndarray
Examples
>>> reward = env._get_reward(base_reward, action, done)
- class duo_ai.core.environment.GeneralCoordEnv(config: CoordinationConfig, base_env: gymnasium.Env, novice: duo.core.Policy, expert: duo.core.Policy, open_novice: bool = True, open_expert: bool = False)[source]¶
Bases:
CoordEnvCoordination environment supporting recurrent policies.
This class supports policies that maintain a hidden state across steps, but can be less efficient for stateless policies than CoordEnv.
Examples
>>> config = CoordinationConfig() >>> base_env = gym.make(...) >>> novice = ... >>> expert = ... >>> env = GeneralCoordEnv(config, base_env, novice, expert)
- _compute_agents_action() numpy.ndarray[source]¶
Compute the actions for both novice and expert agents, supporting recurrent policies.
- Returns:
Array of actions to be passed to the base environment.
- Return type:
numpy.ndarray
Examples
>>> base_action = env._compute_agents_action()
- _compute_base_action(action: numpy.ndarray) numpy.ndarray[source]¶
Compute the environment-specific action for each agent, supporting recurrent policies.
- Parameters:
action (numpy.ndarray) – Array indicating which agent (novice or expert) acts for each environment.
- Returns:
Array of actions to be passed to the base environment.
- Return type:
numpy.ndarray
Examples
>>> base_action = env._compute_base_action(action)
- _get_obs() Dict[str, Any][source]¶
Return the current observation for the coordination environment, supporting recurrent policies.
- Returns:
- A dictionary containing:
”base_obs”: The current observation from the base environment.
”novice_hidden”: Numpy array of hidden features from the novice policy (if open_novice).
”novice_logits”: Numpy array of output logits from the novice policy (if open_novice).
”expert_hidden”: Numpy array of hidden features from the expert policy (if open_expert).
”expert_logits”: Numpy array of output logits from the expert policy (if open_expert).
- Return type:
dict
Examples
>>> obs = env._get_obs()