duo_ai.algorithms

Submodules

Attributes

Classes

AlwaysAlgorithm

Algorithm that always returns the same action, regardless of input.

LogitAlgorithm

Algorithm for tuning confidence-based policies using logit thresholds and temperatures.

PPOAlgorithm

Proximal Policy Optimization (PPO) algorithm implementation.

PyODAlgorithm

Algorithm for out-of-distribution (OOD) detection using PyOD models.

RandomAlgorithm

Algorithm that searches for the best probability parameter to maximize evaluation reward.

Package Contents

class duo_ai.algorithms.AlwaysAlgorithm(config: AlwaysAlgorithmConfig)[source]

Bases: duo_ai.core.algorithm.Algorithm

Algorithm that always returns the same action, regardless of input.

Examples

>>> algo = AlwaysAlgorithm(AlwaysAlgorithmConfig())
config_cls
train(policy: duo.core.Policy, env: gym.Env, validators: Dict[str, duo.core.Evaluator]) None[source]

Run the AlwaysAlgorithm training procedure.

This method evaluates the provided policy in the given environment using the specified evaluators. The AlwaysAlgorithm always returns the same action, regardless of the input observation.

Parameters:
  • policy (duo.core.Policy) – The policy instance to use for generating actions.

  • env (gym.Env) – The environment in which the policy is evaluated.

  • validators (dict of str to duo.core.Evaluator) – Dictionary mapping split names to evaluator instances for evaluation.

Return type:

None

Examples

>>> algorithm = AlwaysAlgorithm(AlwaysAlgorithmConfig())
>>> algorithm.train(policy, env, validators)
class duo_ai.algorithms.LogitAlgorithm(config: LogitAlgorithmConfig)[source]

Bases: duo_ai.core.Algorithm

Algorithm for tuning confidence-based policies using logit thresholds and temperatures.

Examples

>>> algo = LogitAlgorithm(LogitAlgorithmConfig())
config_cls
config
train(policy: duo.core.Policy, env: gym.Env, validators: Dict[str, duo.core.Evaluator]) None[source]

Train the LogitAlgorithm by searching for the best threshold and temperature parameters based on rollout scores and evaluation results.

Parameters:
  • policy (duo.core.Policy) – The policy to be trained and evaluated.

  • env (gym.Env) – The environment used for training and rollouts.

  • validators (dict of str to duo.core.Evaluator) – Dictionary mapping split names to evaluator instances for evaluation.

Return type:

None

Examples

>>> algorithm = LogitAlgorithm(LogitAlgorithmConfig())
>>> algorithm.train(policy, env, validators)
save_checkpoint(policy: duo.core.Policy, name: str) None[source]

Save the current policy configuration and parameters to a checkpoint file.

Parameters:
  • policy (duo.core.Policy) – The policy whose parameters are to be saved.

  • name (str) – Name for the checkpoint file.

Return type:

None

Examples

>>> self.save_checkpoint(policy, "best_test")
_generate_scores(env: gym.Env, policy: duo.core.Policy, temperature: float, num_rollouts: int) list[source]

Generate confidence scores by rolling out the policy in the environment.

Parameters:
  • env (gym.Env) – The environment used for rollouts.

  • policy (duo.core.Policy) – The policy to be evaluated.

  • temperature (float) – Temperature parameter for action selection.

  • num_rollouts (int) – Total number of rollout episodes to generate.

Returns:

scores – List of confidence scores collected from rollouts.

Return type:

list of float

Examples

>>> scores = self._generate_scores(env, policy, 1.0, 128)
class duo_ai.algorithms.PPOAlgorithm(config: PPOAlgorithmConfig)[source]

Bases: duo_ai.core.Algorithm

Proximal Policy Optimization (PPO) algorithm implementation.

Examples

>>> algo = PPOAlgorithm(PPOAlgorithmConfig())
>>> algo.train(policy, env, validators)
config_cls
config
_initialize() None[source]

Initialize PPO training state, buffers, optimizer, and logging.

Return type:

None

train(policy: duo.policies.PPOPolicy, env: gymnasium.Env, validators: Dict[str, duo.core.Evaluator]) None[source]

Train the PPO algorithm on the specified environment(s) using the provided policy.

This method performs multiple training iterations, periodically evaluates the policy, logs statistics, and saves checkpoints for the best and last models.

Parameters:
  • policy (duo.policies.PPOPolicy) – The policy to be trained.

  • env (gym.Env) – The environment instance for training.

  • validators (dict of str to duo.core.Evaluator) – Dictionary mapping split names to evaluator instances for evaluation.

Return type:

None

Examples

>>> algorithm.train(policy, env, validators)
_train_once() None[source]

Perform a single training iteration of PPO, including trajectory collection, advantage computation, and policy/value updates.

Return type:

None

_update_learning_rate() None[source]

Update the learning rate for the optimizer, optionally annealing it over time.

Return type:

None

_compute_advantages_and_returns() Tuple[torch.Tensor, torch.Tensor][source]

Compute advantages and returns using Generalized Advantage Estimation (GAE).

Returns:

  • advantages (torch.Tensor) – Advantage estimates for each step.

  • returns (torch.Tensor) – Computed returns for each step.

Examples

>>> adv, ret = algo._compute_advantages_and_returns()
save_checkpoint(policy: duo.policies.PPOPolicy, name: str) None[source]

Save the current policy and optimizer state to a checkpoint file.

Parameters:
  • policy (duo.policies.PPOPolicy) – The policy to save.

  • name (str) – Name for the checkpoint file.

Return type:

None

Examples

>>> algo.save_checkpoint(policy, "last")
load_checkpoint(policy: duo.policies.PPOPolicy, load_path: str) None[source]

Load policy and optimizer state from a checkpoint file.

Parameters:
  • policy (duo.policies.PPOPolicy) – The policy to load parameters into.

  • load_path (str) – Path to the checkpoint file.

Return type:

None

Examples

>>> algo.load_checkpoint(policy, "checkpoint.ckpt")
class duo_ai.algorithms.PyODAlgorithm(config: PyODAlgorithmConfig)[source]

Bases: duo_ai.core.Algorithm

Algorithm for out-of-distribution (OOD) detection using PyOD models.

Examples

>>> algo = PyODAlgorithm(PyODAlgorithmConfig())
config_cls
config
random
train(policy: duo.policies.PPOPolicy, env: gym.Env, validators: Dict[str, duo.core.Evaluator]) None[source]

Train the PyODAlgorithm by searching for the best threshold parameter that maximizes evaluation reward.

Parameters:
  • policy (duo.policies.PPOPolicy) – The policy to be evaluated and tuned.

  • env (gym.Env) – The environment instance for training and data generation.

  • validators (dict of str to duo.core.Evaluator) – Dictionary mapping split names to evaluator instances for evaluation.

Return type:

None

Examples

>>> algorithm = PyODAlgorithm(PyODAlgorithmConfig())
>>> algorithm.train(policy, env, validators)
save_checkpoint(policy: duo.policies.PPOPolicy, name: str) None[source]

Save the current policy configuration and parameters to a checkpoint file.

Parameters:
  • policy (duo.policies.PPOPolicy) – The policy whose parameters are to be saved.

  • name (str) – Name for the checkpoint file.

Return type:

None

Examples

>>> self.save_checkpoint(policy, "best_test")
_generate_data(env: gym.Env, policy: duo.policies.PPOPolicy, temperature: float, num_rollouts: int, accept_rate: float) dict[source]

Generate data for OOD detection by rolling out the policy in the environment.

Parameters:
  • env (gym.Env) – The environment used for rollouts.

  • policy (duo.policies.PPOPolicy) – The policy to be evaluated.

  • temperature (float) – Temperature parameter for action selection.

  • num_rollouts (int) – Total number of rollout episodes to generate.

  • accept_rate (float) – Acceptance rate for sampling data during rollouts.

Returns:

data – Dictionary containing collected data arrays for each feature.

Return type:

dict

Examples

>>> data = self._generate_data(env, policy, 1.0, 128, 0.05)
class duo_ai.algorithms.RandomAlgorithm(config: RandomAlgorithmConfig)[source]

Bases: duo_ai.core.Algorithm

Algorithm that searches for the best probability parameter to maximize evaluation reward.

Examples

>>> algo = RandomAlgorithm(RandomAlgorithmConfig())
config_cls
config
train(policy: duo.policies.PPOPolicy, env: gym.Env, validators: Dict[str, duo.core.Evaluator]) None[source]

Train the RandomAlgorithm by searching for the best probability parameter that maximizes evaluation reward.

Parameters:
  • policy (duo.policies.PPOPolicy) – The policy to be evaluated and tuned.

  • env (gym.Env) – The environment instance for training and data generation.

  • validators (dict of str to duo.core.Evaluator) – Dictionary mapping split names to evaluator instances for evaluation.

Return type:

None

Examples

>>> algorithm = RandomAlgorithm(RandomAlgorithmConfig())
>>> algorithm.train(policy, env, validators)
save_checkpoint(policy: duo.policies.PPOPolicy, name: str) None[source]

Save the current policy configuration and parameters to a checkpoint file.

Parameters:
  • policy (duo.policies.PPOPolicy) – The policy whose parameters are to be saved.

  • name (str) – Name for the checkpoint file.

Return type:

None

Examples

>>> self.save_checkpoint(policy, "best_test")
duo_ai.algorithms.registry