duo_ai.core.environment ======================= .. py:module:: duo_ai.core.environment Classes ------- .. autoapisummary:: duo_ai.core.environment.CoordinationConfig duo_ai.core.environment.CoordEnv duo_ai.core.environment.GeneralCoordEnv Module Contents --------------- .. py:class:: CoordinationConfig Configuration for coordination environment parameters. :param expert_query_cost_weight: The cost coefficient for querying the expert policy. Default is 0.4. :type expert_query_cost_weight: float, optional :param switch_agent_cost_weight: The cost coefficient for switching between agents. Default is 0.0. :type switch_agent_cost_weight: float, optional :param temperature: The temperature parameter for action sampling. Default is 1.0. :type temperature: float, optional .. rubric:: Examples >>> config = CoordinationConfig() .. py:attribute:: expert_query_cost_weight :type: float :value: 0.4 .. py:attribute:: switch_agent_cost_weight :type: float :value: 0.0 .. py:attribute:: temperature :type: float :value: 1.0 .. py:class:: CoordEnv(config: CoordinationConfig, base_env: gymnasium.Env, novice: duo.core.Policy, expert: duo.core.Policy, open_novice: bool = True, open_expert: bool = False) Bases: :py:obj:`gymnasium.Env` Environment for coordinating between novice and expert policies. This class wraps a base environment and enables switching between a novice and expert policy, applying costs for expert queries and agent switching. .. rubric:: Examples >>> config = CoordinationConfig() >>> base_env = gym.make(...) >>> novice = ... >>> expert = ... >>> env = CoordEnv(config, base_env, novice, expert) .. py:attribute:: config_cls .. py:attribute:: NOVICE :value: 0 .. py:attribute:: EXPERT :value: 1 .. py:attribute:: config .. py:attribute:: base_env .. py:attribute:: novice .. py:attribute:: expert .. py:attribute:: open_novice :value: True .. py:attribute:: open_expert :value: False .. py:attribute:: action_space .. py:attribute:: observation_space .. py:attribute:: expert_query_cost_per_action :value: None .. py:attribute:: switch_agent_cost_per_action :value: None .. py:property:: num_envs :type: int Number of parallel environments. :returns: Number of parallel environments. :rtype: int .. rubric:: Examples >>> n = env.num_envs .. py:method:: set_costs(base_penalty: float) -> None Set the cost per action for expert queries and agent switching. :param base_penalty: The reward value per action. :type base_penalty: float :rtype: None .. rubric:: Examples >>> env.set_costs(0.05) .. py:method:: reset() -> Dict[str, Any] Reset the coordination environment to an initial state. :returns: The initial observation of the environment, including: - "base_obs": The initial observation from the base environment. - "novice_hidden": Numpy array of hidden features from the novice policy. - "novice_logits": Numpy array of output logits from the novice policy. - "expert_hidden": Numpy array of hidden features from the expert policy (if open_expert). - "expert_logits": Numpy array of output logits from the expert policy (if open_expert). :rtype: dict .. rubric:: Examples >>> obs = env.reset() .. py:method:: _reset_agents(done: numpy.ndarray) -> None Reset the internal state of the novice and expert agents. :param done: Boolean array indicating which episodes in a batch require a reset. :type done: numpy.ndarray :rtype: None .. rubric:: Examples >>> env._reset_agents(np.array([True, False])) .. py:method:: step(action: numpy.ndarray) -> Tuple[Dict[str, Any], numpy.ndarray, numpy.ndarray, List[Dict[str, Any]]] Advance the environment by one step using the provided action. :param action: The action(s) to take in the environment. Should be a numpy array indicating which agent acts. :type action: numpy.ndarray :returns: * **obs** (*dict*) -- The next observation of the environment, including: - "base_obs": The observation from the base environment. - "novice_hidden": Numpy array of hidden features from the novice policy. - "novice_logits": Numpy array of output logits from the novice policy. - "expert_hidden": Numpy array of hidden features from the expert policy (if open_expert). - "expert_logits": Numpy array of output logits from the expert policy (if open_expert). * **reward** (*numpy.ndarray*) -- The reward(s) obtained from the environment after taking the action. * **done** (*numpy.ndarray*) -- Boolean flag(s) indicating whether the episode has ended for each environment. * **info** (*list of dict*) -- Additional information from the environment for each agent or environment instance. :raises Exception: Propagates any exceptions raised by the underlying environment's `step` method. .. rubric:: Examples >>> obs, reward, done, info = env.step(action) .. py:method:: _compute_base_action(action: numpy.ndarray) -> numpy.ndarray Compute the environment-specific action for each agent. :param action: Array indicating which agent (novice or expert) acts for each environment. :type action: numpy.ndarray :returns: Array of actions to be passed to the base environment. :rtype: numpy.ndarray .. rubric:: Examples >>> base_action = env._compute_base_action(action) .. py:method:: _get_obs() -> Dict[str, Any] Return the current observation for the coordination environment. :returns: A dictionary containing: - "base_obs": The current observation from the base environment. - "novice_hidden": Numpy array of hidden features from the novice policy (if open_novice). - "novice_logits": Numpy array of output logits from the novice policy (if open_novice). - "expert_hidden": Numpy array of hidden features from the expert policy (if open_expert). - "expert_logits": Numpy array of output logits from the expert policy (if open_expert). :rtype: dict .. rubric:: Examples >>> obs = env._get_obs() .. py:method:: _get_reward(base_reward: numpy.ndarray, action: numpy.ndarray, done: numpy.ndarray) -> numpy.ndarray Compute the reward for the current step, including costs for expert queries and agent switching. :param base_reward: The base reward from the environment. :type base_reward: numpy.ndarray :param action: The action(s) taken (novice or expert). :type action: numpy.ndarray :param done: Boolean flag(s) indicating whether the episode has ended for each environment. :type done: numpy.ndarray :returns: The computed reward(s) after applying costs. :rtype: numpy.ndarray .. rubric:: Examples >>> reward = env._get_reward(base_reward, action, done) .. py:method:: close() -> None Close the coordination environment and release any resources held. :rtype: None .. rubric:: Examples >>> env.close() .. py:class:: GeneralCoordEnv(config: CoordinationConfig, base_env: gymnasium.Env, novice: duo.core.Policy, expert: duo.core.Policy, open_novice: bool = True, open_expert: bool = False) Bases: :py:obj:`CoordEnv` Coordination environment supporting recurrent policies. This class supports policies that maintain a hidden state across steps, but can be less efficient for stateless policies than `CoordEnv`. .. rubric:: Examples >>> config = CoordinationConfig() >>> base_env = gym.make(...) >>> novice = ... >>> expert = ... >>> env = GeneralCoordEnv(config, base_env, novice, expert) .. py:method:: _compute_agents_action() -> numpy.ndarray Compute the actions for both novice and expert agents, supporting recurrent policies. :returns: Array of actions to be passed to the base environment. :rtype: numpy.ndarray .. rubric:: Examples >>> base_action = env._compute_agents_action() .. py:method:: _compute_base_action(action: numpy.ndarray) -> numpy.ndarray Compute the environment-specific action for each agent, supporting recurrent policies. :param action: Array indicating which agent (novice or expert) acts for each environment. :type action: numpy.ndarray :returns: Array of actions to be passed to the base environment. :rtype: numpy.ndarray .. rubric:: Examples >>> base_action = env._compute_base_action(action) .. py:method:: _get_obs() -> Dict[str, Any] Return the current observation for the coordination environment, supporting recurrent policies. :returns: A dictionary containing: - "base_obs": The current observation from the base environment. - "novice_hidden": Numpy array of hidden features from the novice policy (if open_novice). - "novice_logits": Numpy array of output logits from the novice policy (if open_novice). - "expert_hidden": Numpy array of hidden features from the expert policy (if open_expert). - "expert_logits": Numpy array of output logits from the expert policy (if open_expert). :rtype: dict .. rubric:: Examples >>> obs = env._get_obs()