duo_ai.core.environment
=======================

.. py:module:: duo_ai.core.environment


Classes
-------

.. autoapisummary::

   duo_ai.core.environment.CoordinationConfig
   duo_ai.core.environment.CoordEnv
   duo_ai.core.environment.GeneralCoordEnv


Module Contents
---------------

.. py:class:: CoordinationConfig

   Configuration for coordination environment parameters.

   :param expert_query_cost_weight: The cost coefficient for querying the expert policy. Default is 0.4.
   :type expert_query_cost_weight: float, optional
   :param switch_agent_cost_weight: The cost coefficient for switching between agents. Default is 0.0.
   :type switch_agent_cost_weight: float, optional
   :param temperature: The temperature parameter for action sampling. Default is 1.0.
   :type temperature: float, optional

   .. rubric:: Examples

   >>> config = CoordinationConfig()


   .. py:attribute:: expert_query_cost_weight
      :type:  float
      :value: 0.4


   .. py:attribute:: switch_agent_cost_weight
      :type:  float
      :value: 0.0


   .. py:attribute:: temperature
      :type:  float
      :value: 1.0


.. py:class:: CoordEnv(config: CoordinationConfig, base_env: gymnasium.Env, novice: duo.core.Policy, expert: duo.core.Policy, open_novice: bool = True, open_expert: bool = False)

   Bases: :py:obj:`gymnasium.Env`


   Environment for coordinating between novice and expert policies.

   This class wraps a base environment and enables switching between a novice and expert policy,
   applying costs for expert queries and agent switching.

   .. rubric:: Examples

   >>> config = CoordinationConfig()
   >>> base_env = gym.make(...)
   >>> novice = ...
   >>> expert = ...
   >>> env = CoordEnv(config, base_env, novice, expert)


   .. py:attribute:: config_cls


   .. py:attribute:: NOVICE
      :value: 0


   .. py:attribute:: EXPERT
      :value: 1


   .. py:attribute:: config


   .. py:attribute:: base_env


   .. py:attribute:: novice


   .. py:attribute:: expert


   .. py:attribute:: open_novice
      :value: True


   .. py:attribute:: open_expert
      :value: False


   .. py:attribute:: action_space


   .. py:attribute:: observation_space


   .. py:attribute:: expert_query_cost_per_action
      :value: None


   .. py:attribute:: switch_agent_cost_per_action
      :value: None


   .. py:property:: num_envs
      :type: int


      Number of parallel environments.

      :returns: Number of parallel environments.
      :rtype: int

      .. rubric:: Examples

      >>> n = env.num_envs


   .. py:method:: set_costs(base_penalty: float) -> None

      Set the cost per action for expert queries and agent switching.

      :param base_penalty: The reward value per action.
      :type base_penalty: float

      :rtype: None

      .. rubric:: Examples

      >>> env.set_costs(0.05)


   .. py:method:: reset() -> Dict[str, Any]

      Reset the coordination environment to an initial state.

      :returns:

                The initial observation of the environment, including:
                    - "base_obs": The initial observation from the base environment.
                    - "novice_hidden": Numpy array of hidden features from the novice policy.
                    - "novice_logits": Numpy array of output logits from the novice policy.
                    - "expert_hidden": Numpy array of hidden features from the expert policy (if open_expert).
                    - "expert_logits": Numpy array of output logits from the expert policy (if open_expert).
      :rtype: dict

      .. rubric:: Examples

      >>> obs = env.reset()


   .. py:method:: _reset_agents(done: numpy.ndarray) -> None

      Reset the internal state of the novice and expert agents.

      :param done: Boolean array indicating which episodes in a batch require a reset.
      :type done: numpy.ndarray

      :rtype: None

      .. rubric:: Examples

      >>> env._reset_agents(np.array([True, False]))


   .. py:method:: step(action: numpy.ndarray) -> Tuple[Dict[str, Any], numpy.ndarray, numpy.ndarray, List[Dict[str, Any]]]

      Advance the environment by one step using the provided action.

      :param action: The action(s) to take in the environment. Should be a numpy array indicating which agent acts.
      :type action: numpy.ndarray

      :returns: * **obs** (*dict*) --

                  The next observation of the environment, including:
                      - "base_obs": The observation from the base environment.
                      - "novice_hidden": Numpy array of hidden features from the novice policy.
                      - "novice_logits": Numpy array of output logits from the novice policy.
                      - "expert_hidden": Numpy array of hidden features from the expert policy (if open_expert).
                      - "expert_logits": Numpy array of output logits from the expert policy (if open_expert).
                * **reward** (*numpy.ndarray*) -- The reward(s) obtained from the environment after taking the action.
                * **done** (*numpy.ndarray*) -- Boolean flag(s) indicating whether the episode has ended for each environment.
                * **info** (*list of dict*) -- Additional information from the environment for each agent or environment instance.

      :raises Exception: Propagates any exceptions raised by the underlying environment's `step` method.

      .. rubric:: Examples

      >>> obs, reward, done, info = env.step(action)


   .. py:method:: _compute_base_action(action: numpy.ndarray) -> numpy.ndarray

      Compute the environment-specific action for each agent.

      :param action: Array indicating which agent (novice or expert) acts for each environment.
      :type action: numpy.ndarray

      :returns: Array of actions to be passed to the base environment.
      :rtype: numpy.ndarray

      .. rubric:: Examples

      >>> base_action = env._compute_base_action(action)


   .. py:method:: _get_obs() -> Dict[str, Any]

      Return the current observation for the coordination environment.

      :returns:

                A dictionary containing:
                    - "base_obs": The current observation from the base environment.
                    - "novice_hidden": Numpy array of hidden features from the novice policy (if open_novice).
                    - "novice_logits": Numpy array of output logits from the novice policy (if open_novice).
                    - "expert_hidden": Numpy array of hidden features from the expert policy (if open_expert).
                    - "expert_logits": Numpy array of output logits from the expert policy (if open_expert).
      :rtype: dict

      .. rubric:: Examples

      >>> obs = env._get_obs()


   .. py:method:: _get_reward(base_reward: numpy.ndarray, action: numpy.ndarray, done: numpy.ndarray) -> numpy.ndarray

      Compute the reward for the current step, including costs for expert queries and agent switching.

      :param base_reward: The base reward from the environment.
      :type base_reward: numpy.ndarray
      :param action: The action(s) taken (novice or expert).
      :type action: numpy.ndarray
      :param done: Boolean flag(s) indicating whether the episode has ended for each environment.
      :type done: numpy.ndarray

      :returns: The computed reward(s) after applying costs.
      :rtype: numpy.ndarray

      .. rubric:: Examples

      >>> reward = env._get_reward(base_reward, action, done)


   .. py:method:: close() -> None

      Close the coordination environment and release any resources held.

      :rtype: None

      .. rubric:: Examples

      >>> env.close()


.. py:class:: GeneralCoordEnv(config: CoordinationConfig, base_env: gymnasium.Env, novice: duo.core.Policy, expert: duo.core.Policy, open_novice: bool = True, open_expert: bool = False)

   Bases: :py:obj:`CoordEnv`


   Coordination environment supporting recurrent policies.

   This class supports policies that maintain a hidden state across steps, but can be less efficient for
   stateless policies than `CoordEnv`.

   .. rubric:: Examples

   >>> config = CoordinationConfig()
   >>> base_env = gym.make(...)
   >>> novice = ...
   >>> expert = ...
   >>> env = GeneralCoordEnv(config, base_env, novice, expert)


   .. py:method:: _compute_agents_action() -> numpy.ndarray

      Compute the actions for both novice and expert agents, supporting recurrent policies.

      :returns: Array of actions to be passed to the base environment.
      :rtype: numpy.ndarray

      .. rubric:: Examples

      >>> base_action = env._compute_agents_action()


   .. py:method:: _compute_base_action(action: numpy.ndarray) -> numpy.ndarray

      Compute the environment-specific action for each agent, supporting recurrent policies.

      :param action: Array indicating which agent (novice or expert) acts for each environment.
      :type action: numpy.ndarray

      :returns: Array of actions to be passed to the base environment.
      :rtype: numpy.ndarray

      .. rubric:: Examples

      >>> base_action = env._compute_base_action(action)


   .. py:method:: _get_obs() -> Dict[str, Any]

      Return the current observation for the coordination environment, supporting recurrent policies.

      :returns:

                A dictionary containing:
                    - "base_obs": The current observation from the base environment.
                    - "novice_hidden": Numpy array of hidden features from the novice policy (if open_novice).
                    - "novice_logits": Numpy array of output logits from the novice policy (if open_novice).
                    - "expert_hidden": Numpy array of hidden features from the expert policy (if open_expert).
                    - "expert_logits": Numpy array of output logits from the expert policy (if open_expert).
      :rtype: dict

      .. rubric:: Examples

      >>> obs = env._get_obs()