duo_ai.core
===========

.. py:module:: duo_ai.core


Submodules
----------

.. toctree::
   :maxdepth: 1

   /autoapi/duo_ai/core/algorithm/index
   /autoapi/duo_ai/core/config/index
   /autoapi/duo_ai/core/environment/index
   /autoapi/duo_ai/core/evaluator/index
   /autoapi/duo_ai/core/policy/index


Classes
-------

.. autoapisummary::

   duo_ai.core.Algorithm
   duo_ai.core.CoordEnv
   duo_ai.core.Evaluator
   duo_ai.core.Policy


Package Contents
----------------

.. py:class:: Algorithm

   Bases: :py:obj:`abc.ABC`


   Abstract base class for all algorithms in the Duo framework.

   This class defines the interface that all algorithm implementations must follow.

   .. rubric:: Examples

   >>> class MyAlgorithm(Algorithm):
   ...     def train(self, *args, **kwargs):
   ...         pass


   .. py:method:: train(*args, **kwargs) -> None
      :abstractmethod:


      Train the model or algorithm using the provided arguments.

      :param \*args: Variable length argument list.
      :param \*\*kwargs: Arbitrary keyword arguments.

      :rtype: None

      .. rubric:: Examples

      >>> algo.train(data)


.. py:class:: CoordEnv(config: CoordinationConfig, base_env: gymnasium.Env, novice: duo.core.Policy, expert: duo.core.Policy, open_novice: bool = True, open_expert: bool = False)

   Bases: :py:obj:`gymnasium.Env`


   Environment for coordinating between novice and expert policies.

   This class wraps a base environment and enables switching between a novice and expert policy,
   applying costs for expert queries and agent switching.

   .. rubric:: Examples

   >>> config = CoordinationConfig()
   >>> base_env = gym.make(...)
   >>> novice = ...
   >>> expert = ...
   >>> env = CoordEnv(config, base_env, novice, expert)


   .. py:attribute:: config_cls


   .. py:attribute:: NOVICE
      :value: 0


   .. py:attribute:: EXPERT
      :value: 1


   .. py:attribute:: config


   .. py:attribute:: base_env


   .. py:attribute:: novice


   .. py:attribute:: expert


   .. py:attribute:: open_novice
      :value: True


   .. py:attribute:: open_expert
      :value: False


   .. py:attribute:: action_space


   .. py:attribute:: observation_space


   .. py:attribute:: expert_query_cost_per_action
      :value: None


   .. py:attribute:: switch_agent_cost_per_action
      :value: None


   .. py:property:: num_envs
      :type: int


      Number of parallel environments.

      :returns: Number of parallel environments.
      :rtype: int

      .. rubric:: Examples

      >>> n = env.num_envs


   .. py:method:: set_costs(base_penalty: float) -> None

      Set the cost per action for expert queries and agent switching.

      :param base_penalty: The reward value per action.
      :type base_penalty: float

      :rtype: None

      .. rubric:: Examples

      >>> env.set_costs(0.05)


   .. py:method:: reset() -> Dict[str, Any]

      Reset the coordination environment to an initial state.

      :returns:

                The initial observation of the environment, including:
                    - "base_obs": The initial observation from the base environment.
                    - "novice_hidden": Numpy array of hidden features from the novice policy.
                    - "novice_logits": Numpy array of output logits from the novice policy.
                    - "expert_hidden": Numpy array of hidden features from the expert policy (if open_expert).
                    - "expert_logits": Numpy array of output logits from the expert policy (if open_expert).
      :rtype: dict

      .. rubric:: Examples

      >>> obs = env.reset()


   .. py:method:: _reset_agents(done: numpy.ndarray) -> None

      Reset the internal state of the novice and expert agents.

      :param done: Boolean array indicating which episodes in a batch require a reset.
      :type done: numpy.ndarray

      :rtype: None

      .. rubric:: Examples

      >>> env._reset_agents(np.array([True, False]))


   .. py:method:: step(action: numpy.ndarray) -> Tuple[Dict[str, Any], numpy.ndarray, numpy.ndarray, List[Dict[str, Any]]]

      Advance the environment by one step using the provided action.

      :param action: The action(s) to take in the environment. Should be a numpy array indicating which agent acts.
      :type action: numpy.ndarray

      :returns: * **obs** (*dict*) --

                  The next observation of the environment, including:
                      - "base_obs": The observation from the base environment.
                      - "novice_hidden": Numpy array of hidden features from the novice policy.
                      - "novice_logits": Numpy array of output logits from the novice policy.
                      - "expert_hidden": Numpy array of hidden features from the expert policy (if open_expert).
                      - "expert_logits": Numpy array of output logits from the expert policy (if open_expert).
                * **reward** (*numpy.ndarray*) -- The reward(s) obtained from the environment after taking the action.
                * **done** (*numpy.ndarray*) -- Boolean flag(s) indicating whether the episode has ended for each environment.
                * **info** (*list of dict*) -- Additional information from the environment for each agent or environment instance.

      :raises Exception: Propagates any exceptions raised by the underlying environment's `step` method.

      .. rubric:: Examples

      >>> obs, reward, done, info = env.step(action)


   .. py:method:: _compute_base_action(action: numpy.ndarray) -> numpy.ndarray

      Compute the environment-specific action for each agent.

      :param action: Array indicating which agent (novice or expert) acts for each environment.
      :type action: numpy.ndarray

      :returns: Array of actions to be passed to the base environment.
      :rtype: numpy.ndarray

      .. rubric:: Examples

      >>> base_action = env._compute_base_action(action)


   .. py:method:: _get_obs() -> Dict[str, Any]

      Return the current observation for the coordination environment.

      :returns:

                A dictionary containing:
                    - "base_obs": The current observation from the base environment.
                    - "novice_hidden": Numpy array of hidden features from the novice policy (if open_novice).
                    - "novice_logits": Numpy array of output logits from the novice policy (if open_novice).
                    - "expert_hidden": Numpy array of hidden features from the expert policy (if open_expert).
                    - "expert_logits": Numpy array of output logits from the expert policy (if open_expert).
      :rtype: dict

      .. rubric:: Examples

      >>> obs = env._get_obs()


   .. py:method:: _get_reward(base_reward: numpy.ndarray, action: numpy.ndarray, done: numpy.ndarray) -> numpy.ndarray

      Compute the reward for the current step, including costs for expert queries and agent switching.

      :param base_reward: The base reward from the environment.
      :type base_reward: numpy.ndarray
      :param action: The action(s) taken (novice or expert).
      :type action: numpy.ndarray
      :param done: Boolean flag(s) indicating whether the episode has ended for each environment.
      :type done: numpy.ndarray

      :returns: The computed reward(s) after applying costs.
      :rtype: numpy.ndarray

      .. rubric:: Examples

      >>> reward = env._get_reward(base_reward, action, done)


   .. py:method:: close() -> None

      Close the coordination environment and release any resources held.

      :rtype: None

      .. rubric:: Examples

      >>> env.close()


.. py:class:: Evaluator(config: EvaluatorConfig, env: gym.Env)

   Evaluator for running policy evaluation on environments and summarizing results.

   .. rubric:: Examples

   >>> evaluator = Evaluator(EvaluatorConfig(), env)
   >>> summary = evaluator.evaluate(policy)


   .. py:attribute:: config_cls


   .. py:attribute:: config


   .. py:attribute:: env


   .. py:method:: evaluate(policy: duo_ai.core.Policy, num_episodes: Optional[int] = None) -> Dict[str, Any]

      Evaluate a policy on the environment and summarize the results.

      :param policy: The policy to evaluate. Must implement an `act` method and have a `.model` attribute.
      :type policy: duo.core.Policy
      :param num_episodes: Number of episodes to run. If None, uses value from config.
      :type num_episodes: int, optional

      :returns: A dictionary mapping split names to summary statistics for each evaluation.
      :rtype: dict

      .. rubric:: Examples

      >>> summary = evaluator.evaluate(policy, num_episodes=100)
      >>> print(summary['reward_mean'])


   .. py:method:: _eval_one_iteration(policy: duo_ai.core.Policy, env: gym.Env) -> None

      Run a single evaluation iteration for the policy on the environment.

      :param policy: The policy to evaluate.
      :type policy: duo.core.Policy
      :param env: The environment instance to evaluate on.
      :type env: gym.Env

      :rtype: None


.. py:class:: Policy

   Bases: :py:obj:`abc.ABC`


   Abstract base class for all policies in the Duo framework.

   This class defines the interface that all policy implementations must follow.

   .. rubric:: Examples

   >>> class MyPolicy(Policy):
   ...     def act(self, obs):
   ...         return ...
   ...     def reset(self, done):
   ...         pass
   ...     def set_params(self, params):
   ...         pass
   ...     def get_params(self):
   ...         return {}
   ...     def train(self):
   ...         pass
   ...     def eval(self):
   ...         pass


   .. py:method:: act(obs: Any, *args: Any, **kwargs: Any) -> torch.Tensor
      :abstractmethod:


      Select an action based on the given observation.

      :param obs: The current observation from the environment.
      :type obs: Any
      :param \*args: Additional positional arguments.
      :type \*args: Any
      :param \*\*kwargs: Additional keyword arguments.
      :type \*\*kwargs: Any

      :returns: The selected action. The format depends on the policy implementation.
      :rtype: torch.Tensor

      .. rubric:: Examples

      >>> action = policy.act(obs)


   .. py:method:: reset(done: numpy.ndarray) -> None
      :abstractmethod:


      Reset the internal state of the policy.

      This method should be overridden by subclasses to implement any necessary
      logic for resetting the policy's state to its initial configuration, such as
      clearing hidden states or episode-specific variables.

      :param done: Boolean array indicating which episodes in a batch require a reset.
      :type done: numpy.ndarray

      :rtype: None

      .. rubric:: Examples

      >>> policy.reset(done)


   .. py:method:: set_params(params: Dict[str, Any]) -> None
      :abstractmethod:


      Set the parameters of the policy.

      This method should be overridden by subclasses to update the policy's parameters
      based on the provided dictionary, such as loading model weights or hyperparameters.

      :param params: A dictionary containing the new parameters for the policy.
      :type params: dict

      :rtype: None

      .. rubric:: Examples

      >>> policy.set_params(params)


   .. py:method:: get_params() -> Dict[str, Any]
      :abstractmethod:


      Returns the current parameters of the policy.

      This method should be overridden by subclasses to return the relevant parameters
      of the policy, such as model weights or hyperparameters.

      :returns: A dictionary containing the current parameters of the policy.
      :rtype: dict

      .. rubric:: Examples

      >>> params = policy.get_params()


   .. py:method:: train() -> None
      :abstractmethod:


      Set the policy to training mode.

      This method should be overridden by subclasses to implement any necessary
      logic for preparing the policy for training, such as setting dropout or batch normalization layers.

      :rtype: None

      .. rubric:: Examples

      >>> policy.train()


   .. py:method:: eval() -> None
      :abstractmethod:


      Set the policy to evaluation mode.

      This method should be overridden by subclasses to implement any necessary
      logic for preparing the policy for evaluation, such as disabling dropout or batch normalization layers.

      :rtype: None

      .. rubric:: Examples

      >>> policy.eval()