duo_ai.policies
===============

.. py:module:: duo_ai.policies


Submodules
----------

.. toctree::
   :maxdepth: 1

   /autoapi/duo_ai/policies/always/index
   /autoapi/duo_ai/policies/logit/index
   /autoapi/duo_ai/policies/ppo/index
   /autoapi/duo_ai/policies/pyod/index
   /autoapi/duo_ai/policies/random/index


Attributes
----------

.. autoapisummary::

   duo_ai.policies.registry


Classes
-------

.. autoapisummary::

   duo_ai.policies.AlwaysPolicy
   duo_ai.policies.LogitPolicy
   duo_ai.policies.PPOPolicy
   duo_ai.policies.PyODPolicy
   duo_ai.policies.RandomPolicy


Package Contents
----------------

.. py:class:: AlwaysPolicy(config: AlwaysPolicyConfig, env: gym.Env)

   Bases: :py:obj:`duo_ai.core.policy.Policy`


   Policy that always selects the same agent (novice or expert) for every action.

   .. rubric:: Examples

   >>> policy = AlwaysPolicy(AlwaysPolicyConfig(agent="novice"), env)
   >>> obs = ...
   >>> action = policy.act(obs)


   .. py:attribute:: config_cls


   .. py:attribute:: choice


   .. py:attribute:: device


   .. py:attribute:: config


   .. py:method:: act(obs: Any, temperature: Optional[float] = None) -> torch.Tensor

      Select the constant action for a batch of observations.

      :param obs: Batch of observations. If dict, must contain 'base_obs'.
      :type obs: dict or np.ndarray
      :param temperature: Unused. Included for API compatibility.
      :type temperature: float, optional

      :returns: Tensor of constant actions (agent indices) for the batch.
      :rtype: torch.Tensor

      :raises ValueError: If obs is not a dict or numpy array.

      .. rubric:: Examples

      >>> action = policy.act(obs)


   .. py:method:: reset(done: numpy.ndarray) -> None

      Reset the policy state at episode boundaries.

      :param done: Boolean array indicating which episodes in a batch require a reset.
      :type done: np.ndarray

      :rtype: None

      .. rubric:: Examples

      >>> policy.reset(done)


   .. py:method:: get_params() -> Dict[str, Any]

      Get the current parameters of the policy.

      :returns: Dictionary of policy parameters.
      :rtype: dict

      .. rubric:: Examples

      >>> params = policy.get_params()


   .. py:method:: set_params(params: Dict[str, Any]) -> None

      Set the parameters of the policy.

      :param params: Dictionary of policy parameters to set.
      :type params: dict

      :rtype: None

      .. rubric:: Examples

      >>> policy.set_params(params)


   .. py:method:: train() -> None

      Set the policy to training mode.

      :rtype: None

      .. rubric:: Examples

      >>> policy.train()


   .. py:method:: eval() -> None

      Set the policy to evaluation mode.

      :rtype: None

      .. rubric:: Examples

      >>> policy.eval()


.. py:class:: LogitPolicy(config: LogitPolicyConfig, env: gym.Env)

   Bases: :py:obj:`duo_ai.core.policy.Policy`


   Policy that selects actions based on logit confidence metrics and thresholds.

   .. rubric:: Examples

   >>> policy = LogitPolicy(LogitPolicyConfig(), env)
   >>> obs = ...
   >>> action = policy.act(obs)


   .. py:attribute:: config_cls


   .. py:attribute:: config


   .. py:attribute:: params


   .. py:attribute:: device


   .. py:attribute:: EXPERT


   .. py:method:: act(obs: Dict[str, Any], temperature: Optional[float] = None) -> torch.Tensor

      Select actions based on confidence scores and threshold.

      :param obs: Observation dictionary containing 'novice_logits'.
      :type obs: dict
      :param temperature: Unused. Included for API compatibility.
      :type temperature: float, optional

      :returns: Tensor of selected actions (expert or not) for the batch.
      :rtype: torch.Tensor

      .. rubric:: Examples

      >>> action = policy.act(obs)


   .. py:method:: compute_confidence(logits: torch.Tensor) -> torch.Tensor

      Compute confidence scores from logits using the configured metric.

      :param logits: Logits tensor from the policy.
      :type logits: torch.Tensor

      :returns: Confidence scores for each sample in the batch.
      :rtype: torch.Tensor

      :raises NotImplementedError: If the configured metric is not recognized.

      .. rubric:: Examples

      >>> score = policy.compute_confidence(logits)


   .. py:method:: reset(done: numpy.ndarray) -> None

      Reset the policy state at episode boundaries.

      :param done: Boolean array indicating which episodes in a batch require a reset.
      :type done: numpy.ndarray

      :rtype: None

      .. rubric:: Examples

      >>> policy.reset(done)


   .. py:method:: get_params() -> Dict[str, Any]

      Get the current parameters of the policy.

      :returns: Dictionary of policy parameters.
      :rtype: dict

      .. rubric:: Examples

      >>> params = policy.get_params()


   .. py:method:: set_params(params: Dict[str, Any]) -> None

      Set the parameters of the policy.

      :param params: Dictionary of policy parameters to set.
      :type params: dict

      :rtype: None

      :raises KeyError: If a parameter key is not recognized by the policy.

      .. rubric:: Examples

      >>> policy.set_params({'threshold': 0.7})


   .. py:method:: train() -> None

      Set the policy to training mode.

      :rtype: None

      .. rubric:: Examples

      >>> policy.train()


   .. py:method:: eval() -> None

      Set the policy to evaluation mode.

      :rtype: None

      .. rubric:: Examples

      >>> policy.eval()


.. py:class:: PPOPolicy(config: PPOPolicyConfig, env: gym.Env)

   Bases: :py:obj:`duo_ai.core.policy.Policy`


   Policy class for PPO, wrapping a model and providing action selection and parameter management.

   .. rubric:: Examples

   >>> policy = PPOPolicy(PPOPolicyConfig(), env)
   >>> obs = ...
   >>> action = policy.act(obs)


   .. py:attribute:: config_cls


   .. py:attribute:: model


   .. py:attribute:: config


   .. py:method:: reset(done: numpy.ndarray) -> None

      Reset the policy state at episode boundaries.

      :param done: Boolean array indicating which episodes in a batch require a reset.
      :type done: numpy.ndarray

      :rtype: None

      .. rubric:: Examples

      >>> policy.reset(done)


   .. py:method:: act(obs: Any, temperature: float = 1.0, return_model_output: bool = False) -> Any

      Select an action based on the observation and temperature.

      :param obs: Observation input to the policy.
      :type obs: Any
      :param temperature: Sampling temperature. If 0, selects the argmax action. Default is 1.0.
      :type temperature: float, optional
      :param return_model_output: If True, also return the model output. Default is False.
      :type return_model_output: bool, optional

      :returns: **action** -- Selected action, or (action, model_output) if return_model_output is True.
      :rtype: torch.Tensor or tuple

      .. rubric:: Examples

      >>> action = policy.act(obs)
      >>> action, model_output = policy.act(obs, return_model_output=True)


   .. py:method:: set_params(params: Dict[str, Any]) -> None

      Set the model parameters from a state dictionary.

      :param params: State dictionary of model parameters.
      :type params: dict

      :rtype: None

      .. rubric:: Examples

      >>> policy.set_params(params)


   .. py:method:: get_params() -> Dict[str, Any]

      Get the current model parameters as a state dictionary.

      :returns: State dictionary of model parameters.
      :rtype: dict

      .. rubric:: Examples

      >>> params = policy.get_params()


   .. py:method:: train() -> None

      Set the policy/model to training mode.

      :rtype: None

      .. rubric:: Examples

      >>> policy.train()


   .. py:method:: eval() -> None

      Set the policy/model to evaluation mode.

      :rtype: None

      .. rubric:: Examples

      >>> policy.eval()


.. py:class:: PyODPolicy(config: PyODPolicyConfig, env: gym.Env)

   Bases: :py:obj:`duo_ai.core.policy.Policy`


   Policy that uses a PyOD outlier detector for action selection based on OOD scores.

   .. rubric:: Examples

   >>> policy = PyODPolicy(PyODPolicyConfig(), env)
   >>> obs = ...
   >>> action = policy.act(obs)


   .. py:attribute:: config_cls


   .. py:attribute:: config


   .. py:attribute:: threshold
      :value: None


   .. py:attribute:: device


   .. py:attribute:: clf


   .. py:attribute:: feature_type


   .. py:attribute:: EXPERT


   .. py:method:: _get_pyod_class(config: PyODPolicyConfig) -> type

      Dynamically import and return the PyOD class specified in the config.

      :param config: Configuration object for the policy.
      :type config: PyODPolicyConfig

      :returns: The PyOD class to instantiate.
      :rtype: type

      :raises ImportError: If the specified class cannot be imported.

      .. rubric:: Examples

      >>> cls = policy._get_pyod_class(config)


   .. py:method:: reset(done: numpy.ndarray) -> None

      Reset the policy state at episode boundaries.

      :param done: Boolean array indicating which episodes in a batch require a reset.
      :type done: numpy.ndarray

      :rtype: None

      .. rubric:: Examples

      >>> policy.reset(done)


   .. py:method:: _make_input(obs: Dict[str, Any]) -> numpy.ndarray

      Construct the input feature array for the PyOD model from the observation.

      :param obs: Observation dictionary containing required features.
      :type obs: dict

      :returns: Concatenated feature array for the PyOD model.
      :rtype: np.ndarray

      :raises AssertionError: If no features are selected for PyOD input.

      .. rubric:: Examples

      >>> inp = policy._make_input(obs)


   .. py:method:: fit(data: Dict[str, Any]) -> None

      Fit the PyOD model using the provided data.

      :param data: Data dictionary containing features for fitting the model.
      :type data: dict

      :rtype: None

      .. rubric:: Examples

      >>> policy.fit(data)


   .. py:method:: get_train_scores() -> numpy.ndarray

      Get the OOD decision scores from the PyOD model after fitting.

      :returns: Array of decision scores for the training data.
      :rtype: np.ndarray

      .. rubric:: Examples

      >>> scores = policy.get_train_scores()


   .. py:method:: act(obs: Dict[str, Any], temperature: Optional[float] = None) -> torch.Tensor

      Select actions based on OOD scores from the PyOD model.

      :param obs: Observation dictionary containing required features.
      :type obs: dict
      :param temperature: Unused. Included for API compatibility.
      :type temperature: float, optional

      :returns: Tensor of selected actions (expert or not) for the batch.
      :rtype: torch.Tensor

      .. rubric:: Examples

      >>> action = policy.act(obs)


   .. py:method:: set_params(params: Dict[str, Any]) -> None

      Set the parameters of the policy.

      :param params: Dictionary of policy parameters to set.
      :type params: dict

      :rtype: None

      .. rubric:: Examples

      >>> policy.set_params({'threshold': 0.5, 'clf': clf})


   .. py:method:: get_params() -> Dict[str, Any]

      Get the current parameters of the policy.

      :returns: Dictionary of policy parameters.
      :rtype: dict

      .. rubric:: Examples

      >>> params = policy.get_params()


   .. py:method:: train() -> None

      Set the PyOD model to training mode if applicable.

      :rtype: None

      .. rubric:: Examples

      >>> policy.train()


   .. py:method:: eval() -> None

      Set the PyOD model to evaluation mode if applicable.

      :rtype: None

      .. rubric:: Examples

      >>> policy.eval()


.. py:class:: RandomPolicy(config: RandomPolicyConfig, env: gym.Env)

   Bases: :py:obj:`duo_ai.core.policy.Policy`


   Policy that selects the expert action with a fixed probability.

   .. rubric:: Examples

   >>> policy = RandomPolicy(RandomPolicyConfig(prob=0.7), env)
   >>> obs = ...
   >>> action = policy.act(obs)


   .. py:attribute:: config_cls


   .. py:attribute:: prob


   .. py:attribute:: device


   .. py:attribute:: EXPERT


   .. py:attribute:: config


   .. py:method:: act(obs: object, temperature: Optional[float] = None) -> torch.Tensor

      Select actions randomly based on the configured probability.

      :param obs: Batch of observations. If dict, must contain 'base_obs'.
      :type obs: dict or np.ndarray
      :param temperature: Unused. Included for API compatibility.
      :type temperature: float, optional

      :returns: Tensor of selected actions (expert or not) for the batch.
      :rtype: torch.Tensor

      :raises ValueError: If obs is not a dict or numpy array.

      .. rubric:: Examples

      >>> action = policy.act(obs)


   .. py:method:: reset(done: numpy.ndarray) -> None

      Reset the policy state at episode boundaries.

      :param done: Boolean array indicating which episodes in a batch require a reset.
      :type done: np.ndarray

      :rtype: None

      .. rubric:: Examples

      >>> policy.reset(done)


   .. py:method:: set_params(params: dict) -> None

      Set the parameters of the policy.

      :param params: Dictionary of policy parameters to set.
      :type params: dict

      :rtype: None

      .. rubric:: Examples

      >>> policy.set_params({'prob': 0.5})


   .. py:method:: get_params() -> dict

      Get the current parameters of the policy.

      :returns: Dictionary of policy parameters.
      :rtype: dict

      .. rubric:: Examples

      >>> params = policy.get_params()


   .. py:method:: train() -> None

      Set the policy to training mode.

      :rtype: None

      .. rubric:: Examples

      >>> policy.train()


   .. py:method:: eval() -> None

      Set the policy to evaluation mode.

      :rtype: None

      .. rubric:: Examples

      >>> policy.eval()


.. py:data:: registry