duo_ai.policies =============== .. py:module:: duo_ai.policies Submodules ---------- .. toctree:: :maxdepth: 1 /autoapi/duo_ai/policies/always/index /autoapi/duo_ai/policies/logit/index /autoapi/duo_ai/policies/ppo/index /autoapi/duo_ai/policies/pyod/index /autoapi/duo_ai/policies/random/index Attributes ---------- .. autoapisummary:: duo_ai.policies.registry Classes ------- .. autoapisummary:: duo_ai.policies.AlwaysPolicy duo_ai.policies.LogitPolicy duo_ai.policies.PPOPolicy duo_ai.policies.PyODPolicy duo_ai.policies.RandomPolicy Package Contents ---------------- .. py:class:: AlwaysPolicy(config: AlwaysPolicyConfig, env: gym.Env) Bases: :py:obj:`duo_ai.core.policy.Policy` Policy that always selects the same agent (novice or expert) for every action. .. rubric:: Examples >>> policy = AlwaysPolicy(AlwaysPolicyConfig(agent="novice"), env) >>> obs = ... >>> action = policy.act(obs) .. py:attribute:: config_cls .. py:attribute:: choice .. py:attribute:: device .. py:attribute:: config .. py:method:: act(obs: Any, temperature: Optional[float] = None) -> torch.Tensor Select the constant action for a batch of observations. :param obs: Batch of observations. If dict, must contain 'base_obs'. :type obs: dict or np.ndarray :param temperature: Unused. Included for API compatibility. :type temperature: float, optional :returns: Tensor of constant actions (agent indices) for the batch. :rtype: torch.Tensor :raises ValueError: If obs is not a dict or numpy array. .. rubric:: Examples >>> action = policy.act(obs) .. py:method:: reset(done: numpy.ndarray) -> None Reset the policy state at episode boundaries. :param done: Boolean array indicating which episodes in a batch require a reset. :type done: np.ndarray :rtype: None .. rubric:: Examples >>> policy.reset(done) .. py:method:: get_params() -> Dict[str, Any] Get the current parameters of the policy. :returns: Dictionary of policy parameters. :rtype: dict .. rubric:: Examples >>> params = policy.get_params() .. py:method:: set_params(params: Dict[str, Any]) -> None Set the parameters of the policy. :param params: Dictionary of policy parameters to set. :type params: dict :rtype: None .. rubric:: Examples >>> policy.set_params(params) .. py:method:: train() -> None Set the policy to training mode. :rtype: None .. rubric:: Examples >>> policy.train() .. py:method:: eval() -> None Set the policy to evaluation mode. :rtype: None .. rubric:: Examples >>> policy.eval() .. py:class:: LogitPolicy(config: LogitPolicyConfig, env: gym.Env) Bases: :py:obj:`duo_ai.core.policy.Policy` Policy that selects actions based on logit confidence metrics and thresholds. .. rubric:: Examples >>> policy = LogitPolicy(LogitPolicyConfig(), env) >>> obs = ... >>> action = policy.act(obs) .. py:attribute:: config_cls .. py:attribute:: config .. py:attribute:: params .. py:attribute:: device .. py:attribute:: EXPERT .. py:method:: act(obs: Dict[str, Any], temperature: Optional[float] = None) -> torch.Tensor Select actions based on confidence scores and threshold. :param obs: Observation dictionary containing 'novice_logits'. :type obs: dict :param temperature: Unused. Included for API compatibility. :type temperature: float, optional :returns: Tensor of selected actions (expert or not) for the batch. :rtype: torch.Tensor .. rubric:: Examples >>> action = policy.act(obs) .. py:method:: compute_confidence(logits: torch.Tensor) -> torch.Tensor Compute confidence scores from logits using the configured metric. :param logits: Logits tensor from the policy. :type logits: torch.Tensor :returns: Confidence scores for each sample in the batch. :rtype: torch.Tensor :raises NotImplementedError: If the configured metric is not recognized. .. rubric:: Examples >>> score = policy.compute_confidence(logits) .. py:method:: reset(done: numpy.ndarray) -> None Reset the policy state at episode boundaries. :param done: Boolean array indicating which episodes in a batch require a reset. :type done: numpy.ndarray :rtype: None .. rubric:: Examples >>> policy.reset(done) .. py:method:: get_params() -> Dict[str, Any] Get the current parameters of the policy. :returns: Dictionary of policy parameters. :rtype: dict .. rubric:: Examples >>> params = policy.get_params() .. py:method:: set_params(params: Dict[str, Any]) -> None Set the parameters of the policy. :param params: Dictionary of policy parameters to set. :type params: dict :rtype: None :raises KeyError: If a parameter key is not recognized by the policy. .. rubric:: Examples >>> policy.set_params({'threshold': 0.7}) .. py:method:: train() -> None Set the policy to training mode. :rtype: None .. rubric:: Examples >>> policy.train() .. py:method:: eval() -> None Set the policy to evaluation mode. :rtype: None .. rubric:: Examples >>> policy.eval() .. py:class:: PPOPolicy(config: PPOPolicyConfig, env: gym.Env) Bases: :py:obj:`duo_ai.core.policy.Policy` Policy class for PPO, wrapping a model and providing action selection and parameter management. .. rubric:: Examples >>> policy = PPOPolicy(PPOPolicyConfig(), env) >>> obs = ... >>> action = policy.act(obs) .. py:attribute:: config_cls .. py:attribute:: model .. py:attribute:: config .. py:method:: reset(done: numpy.ndarray) -> None Reset the policy state at episode boundaries. :param done: Boolean array indicating which episodes in a batch require a reset. :type done: numpy.ndarray :rtype: None .. rubric:: Examples >>> policy.reset(done) .. py:method:: act(obs: Any, temperature: float = 1.0, return_model_output: bool = False) -> Any Select an action based on the observation and temperature. :param obs: Observation input to the policy. :type obs: Any :param temperature: Sampling temperature. If 0, selects the argmax action. Default is 1.0. :type temperature: float, optional :param return_model_output: If True, also return the model output. Default is False. :type return_model_output: bool, optional :returns: **action** -- Selected action, or (action, model_output) if return_model_output is True. :rtype: torch.Tensor or tuple .. rubric:: Examples >>> action = policy.act(obs) >>> action, model_output = policy.act(obs, return_model_output=True) .. py:method:: set_params(params: Dict[str, Any]) -> None Set the model parameters from a state dictionary. :param params: State dictionary of model parameters. :type params: dict :rtype: None .. rubric:: Examples >>> policy.set_params(params) .. py:method:: get_params() -> Dict[str, Any] Get the current model parameters as a state dictionary. :returns: State dictionary of model parameters. :rtype: dict .. rubric:: Examples >>> params = policy.get_params() .. py:method:: train() -> None Set the policy/model to training mode. :rtype: None .. rubric:: Examples >>> policy.train() .. py:method:: eval() -> None Set the policy/model to evaluation mode. :rtype: None .. rubric:: Examples >>> policy.eval() .. py:class:: PyODPolicy(config: PyODPolicyConfig, env: gym.Env) Bases: :py:obj:`duo_ai.core.policy.Policy` Policy that uses a PyOD outlier detector for action selection based on OOD scores. .. rubric:: Examples >>> policy = PyODPolicy(PyODPolicyConfig(), env) >>> obs = ... >>> action = policy.act(obs) .. py:attribute:: config_cls .. py:attribute:: config .. py:attribute:: threshold :value: None .. py:attribute:: device .. py:attribute:: clf .. py:attribute:: feature_type .. py:attribute:: EXPERT .. py:method:: _get_pyod_class(config: PyODPolicyConfig) -> type Dynamically import and return the PyOD class specified in the config. :param config: Configuration object for the policy. :type config: PyODPolicyConfig :returns: The PyOD class to instantiate. :rtype: type :raises ImportError: If the specified class cannot be imported. .. rubric:: Examples >>> cls = policy._get_pyod_class(config) .. py:method:: reset(done: numpy.ndarray) -> None Reset the policy state at episode boundaries. :param done: Boolean array indicating which episodes in a batch require a reset. :type done: numpy.ndarray :rtype: None .. rubric:: Examples >>> policy.reset(done) .. py:method:: _make_input(obs: Dict[str, Any]) -> numpy.ndarray Construct the input feature array for the PyOD model from the observation. :param obs: Observation dictionary containing required features. :type obs: dict :returns: Concatenated feature array for the PyOD model. :rtype: np.ndarray :raises AssertionError: If no features are selected for PyOD input. .. rubric:: Examples >>> inp = policy._make_input(obs) .. py:method:: fit(data: Dict[str, Any]) -> None Fit the PyOD model using the provided data. :param data: Data dictionary containing features for fitting the model. :type data: dict :rtype: None .. rubric:: Examples >>> policy.fit(data) .. py:method:: get_train_scores() -> numpy.ndarray Get the OOD decision scores from the PyOD model after fitting. :returns: Array of decision scores for the training data. :rtype: np.ndarray .. rubric:: Examples >>> scores = policy.get_train_scores() .. py:method:: act(obs: Dict[str, Any], temperature: Optional[float] = None) -> torch.Tensor Select actions based on OOD scores from the PyOD model. :param obs: Observation dictionary containing required features. :type obs: dict :param temperature: Unused. Included for API compatibility. :type temperature: float, optional :returns: Tensor of selected actions (expert or not) for the batch. :rtype: torch.Tensor .. rubric:: Examples >>> action = policy.act(obs) .. py:method:: set_params(params: Dict[str, Any]) -> None Set the parameters of the policy. :param params: Dictionary of policy parameters to set. :type params: dict :rtype: None .. rubric:: Examples >>> policy.set_params({'threshold': 0.5, 'clf': clf}) .. py:method:: get_params() -> Dict[str, Any] Get the current parameters of the policy. :returns: Dictionary of policy parameters. :rtype: dict .. rubric:: Examples >>> params = policy.get_params() .. py:method:: train() -> None Set the PyOD model to training mode if applicable. :rtype: None .. rubric:: Examples >>> policy.train() .. py:method:: eval() -> None Set the PyOD model to evaluation mode if applicable. :rtype: None .. rubric:: Examples >>> policy.eval() .. py:class:: RandomPolicy(config: RandomPolicyConfig, env: gym.Env) Bases: :py:obj:`duo_ai.core.policy.Policy` Policy that selects the expert action with a fixed probability. .. rubric:: Examples >>> policy = RandomPolicy(RandomPolicyConfig(prob=0.7), env) >>> obs = ... >>> action = policy.act(obs) .. py:attribute:: config_cls .. py:attribute:: prob .. py:attribute:: device .. py:attribute:: EXPERT .. py:attribute:: config .. py:method:: act(obs: object, temperature: Optional[float] = None) -> torch.Tensor Select actions randomly based on the configured probability. :param obs: Batch of observations. If dict, must contain 'base_obs'. :type obs: dict or np.ndarray :param temperature: Unused. Included for API compatibility. :type temperature: float, optional :returns: Tensor of selected actions (expert or not) for the batch. :rtype: torch.Tensor :raises ValueError: If obs is not a dict or numpy array. .. rubric:: Examples >>> action = policy.act(obs) .. py:method:: reset(done: numpy.ndarray) -> None Reset the policy state at episode boundaries. :param done: Boolean array indicating which episodes in a batch require a reset. :type done: np.ndarray :rtype: None .. rubric:: Examples >>> policy.reset(done) .. py:method:: set_params(params: dict) -> None Set the parameters of the policy. :param params: Dictionary of policy parameters to set. :type params: dict :rtype: None .. rubric:: Examples >>> policy.set_params({'prob': 0.5}) .. py:method:: get_params() -> dict Get the current parameters of the policy. :returns: Dictionary of policy parameters. :rtype: dict .. rubric:: Examples >>> params = policy.get_params() .. py:method:: train() -> None Set the policy to training mode. :rtype: None .. rubric:: Examples >>> policy.train() .. py:method:: eval() -> None Set the policy to evaluation mode. :rtype: None .. rubric:: Examples >>> policy.eval() .. py:data:: registry