duo_ai.algorithms¶
Submodules¶
Attributes¶
Classes¶
Algorithm that always returns the same action, regardless of input. |
|
Algorithm for tuning confidence-based policies using logit thresholds and temperatures. |
|
Proximal Policy Optimization (PPO) algorithm implementation. |
|
Algorithm for out-of-distribution (OOD) detection using PyOD models. |
|
Algorithm that searches for the best probability parameter to maximize evaluation reward. |
Package Contents¶
- class duo_ai.algorithms.AlwaysAlgorithm(config: AlwaysAlgorithmConfig)[source]¶
Bases:
duo_ai.core.algorithm.AlgorithmAlgorithm that always returns the same action, regardless of input.
Examples
>>> algo = AlwaysAlgorithm(AlwaysAlgorithmConfig())
- config_cls¶
- train(policy: duo.core.Policy, env: gym.Env, validators: Dict[str, duo.core.Evaluator]) None[source]¶
Run the AlwaysAlgorithm training procedure.
This method evaluates the provided policy in the given environment using the specified evaluators. The AlwaysAlgorithm always returns the same action, regardless of the input observation.
- Parameters:
policy (duo.core.Policy) – The policy instance to use for generating actions.
env (gym.Env) – The environment in which the policy is evaluated.
validators (dict of str to duo.core.Evaluator) – Dictionary mapping split names to evaluator instances for evaluation.
- Return type:
None
Examples
>>> algorithm = AlwaysAlgorithm(AlwaysAlgorithmConfig()) >>> algorithm.train(policy, env, validators)
- class duo_ai.algorithms.LogitAlgorithm(config: LogitAlgorithmConfig)[source]¶
Bases:
duo_ai.core.AlgorithmAlgorithm for tuning confidence-based policies using logit thresholds and temperatures.
Examples
>>> algo = LogitAlgorithm(LogitAlgorithmConfig())
- config_cls¶
- config¶
- train(policy: duo.core.Policy, env: gym.Env, validators: Dict[str, duo.core.Evaluator]) None[source]¶
Train the LogitAlgorithm by searching for the best threshold and temperature parameters based on rollout scores and evaluation results.
- Parameters:
policy (duo.core.Policy) – The policy to be trained and evaluated.
env (gym.Env) – The environment used for training and rollouts.
validators (dict of str to duo.core.Evaluator) – Dictionary mapping split names to evaluator instances for evaluation.
- Return type:
None
Examples
>>> algorithm = LogitAlgorithm(LogitAlgorithmConfig()) >>> algorithm.train(policy, env, validators)
- save_checkpoint(policy: duo.core.Policy, name: str) None[source]¶
Save the current policy configuration and parameters to a checkpoint file.
- Parameters:
policy (duo.core.Policy) – The policy whose parameters are to be saved.
name (str) – Name for the checkpoint file.
- Return type:
None
Examples
>>> self.save_checkpoint(policy, "best_test")
- _generate_scores(env: gym.Env, policy: duo.core.Policy, temperature: float, num_rollouts: int) list[source]¶
Generate confidence scores by rolling out the policy in the environment.
- Parameters:
env (gym.Env) – The environment used for rollouts.
policy (duo.core.Policy) – The policy to be evaluated.
temperature (float) – Temperature parameter for action selection.
num_rollouts (int) – Total number of rollout episodes to generate.
- Returns:
scores – List of confidence scores collected from rollouts.
- Return type:
list of float
Examples
>>> scores = self._generate_scores(env, policy, 1.0, 128)
- class duo_ai.algorithms.PPOAlgorithm(config: PPOAlgorithmConfig)[source]¶
Bases:
duo_ai.core.AlgorithmProximal Policy Optimization (PPO) algorithm implementation.
Examples
>>> algo = PPOAlgorithm(PPOAlgorithmConfig()) >>> algo.train(policy, env, validators)
- config_cls¶
- config¶
- _initialize() None[source]¶
Initialize PPO training state, buffers, optimizer, and logging.
- Return type:
None
- train(policy: duo.policies.PPOPolicy, env: gymnasium.Env, validators: Dict[str, duo.core.Evaluator]) None[source]¶
Train the PPO algorithm on the specified environment(s) using the provided policy.
This method performs multiple training iterations, periodically evaluates the policy, logs statistics, and saves checkpoints for the best and last models.
- Parameters:
policy (duo.policies.PPOPolicy) – The policy to be trained.
env (gym.Env) – The environment instance for training.
validators (dict of str to duo.core.Evaluator) – Dictionary mapping split names to evaluator instances for evaluation.
- Return type:
None
Examples
>>> algorithm.train(policy, env, validators)
- _train_once() None[source]¶
Perform a single training iteration of PPO, including trajectory collection, advantage computation, and policy/value updates.
- Return type:
None
- _update_learning_rate() None[source]¶
Update the learning rate for the optimizer, optionally annealing it over time.
- Return type:
None
- _compute_advantages_and_returns() Tuple[torch.Tensor, torch.Tensor][source]¶
Compute advantages and returns using Generalized Advantage Estimation (GAE).
- Returns:
advantages (torch.Tensor) – Advantage estimates for each step.
returns (torch.Tensor) – Computed returns for each step.
Examples
>>> adv, ret = algo._compute_advantages_and_returns()
- save_checkpoint(policy: duo.policies.PPOPolicy, name: str) None[source]¶
Save the current policy and optimizer state to a checkpoint file.
- Parameters:
policy (duo.policies.PPOPolicy) – The policy to save.
name (str) – Name for the checkpoint file.
- Return type:
None
Examples
>>> algo.save_checkpoint(policy, "last")
- load_checkpoint(policy: duo.policies.PPOPolicy, load_path: str) None[source]¶
Load policy and optimizer state from a checkpoint file.
- Parameters:
policy (duo.policies.PPOPolicy) – The policy to load parameters into.
load_path (str) – Path to the checkpoint file.
- Return type:
None
Examples
>>> algo.load_checkpoint(policy, "checkpoint.ckpt")
- class duo_ai.algorithms.PyODAlgorithm(config: PyODAlgorithmConfig)[source]¶
Bases:
duo_ai.core.AlgorithmAlgorithm for out-of-distribution (OOD) detection using PyOD models.
Examples
>>> algo = PyODAlgorithm(PyODAlgorithmConfig())
- config_cls¶
- config¶
- random¶
- train(policy: duo.policies.PPOPolicy, env: gym.Env, validators: Dict[str, duo.core.Evaluator]) None[source]¶
Train the PyODAlgorithm by searching for the best threshold parameter that maximizes evaluation reward.
- Parameters:
policy (duo.policies.PPOPolicy) – The policy to be evaluated and tuned.
env (gym.Env) – The environment instance for training and data generation.
validators (dict of str to duo.core.Evaluator) – Dictionary mapping split names to evaluator instances for evaluation.
- Return type:
None
Examples
>>> algorithm = PyODAlgorithm(PyODAlgorithmConfig()) >>> algorithm.train(policy, env, validators)
- save_checkpoint(policy: duo.policies.PPOPolicy, name: str) None[source]¶
Save the current policy configuration and parameters to a checkpoint file.
- Parameters:
policy (duo.policies.PPOPolicy) – The policy whose parameters are to be saved.
name (str) – Name for the checkpoint file.
- Return type:
None
Examples
>>> self.save_checkpoint(policy, "best_test")
- _generate_data(env: gym.Env, policy: duo.policies.PPOPolicy, temperature: float, num_rollouts: int, accept_rate: float) dict[source]¶
Generate data for OOD detection by rolling out the policy in the environment.
- Parameters:
env (gym.Env) – The environment used for rollouts.
policy (duo.policies.PPOPolicy) – The policy to be evaluated.
temperature (float) – Temperature parameter for action selection.
num_rollouts (int) – Total number of rollout episodes to generate.
accept_rate (float) – Acceptance rate for sampling data during rollouts.
- Returns:
data – Dictionary containing collected data arrays for each feature.
- Return type:
dict
Examples
>>> data = self._generate_data(env, policy, 1.0, 128, 0.05)
- class duo_ai.algorithms.RandomAlgorithm(config: RandomAlgorithmConfig)[source]¶
Bases:
duo_ai.core.AlgorithmAlgorithm that searches for the best probability parameter to maximize evaluation reward.
Examples
>>> algo = RandomAlgorithm(RandomAlgorithmConfig())
- config_cls¶
- config¶
- train(policy: duo.policies.PPOPolicy, env: gym.Env, validators: Dict[str, duo.core.Evaluator]) None[source]¶
Train the RandomAlgorithm by searching for the best probability parameter that maximizes evaluation reward.
- Parameters:
policy (duo.policies.PPOPolicy) – The policy to be evaluated and tuned.
env (gym.Env) – The environment instance for training and data generation.
validators (dict of str to duo.core.Evaluator) – Dictionary mapping split names to evaluator instances for evaluation.
- Return type:
None
Examples
>>> algorithm = RandomAlgorithm(RandomAlgorithmConfig()) >>> algorithm.train(policy, env, validators)
- save_checkpoint(policy: duo.policies.PPOPolicy, name: str) None[source]¶
Save the current policy configuration and parameters to a checkpoint file.
- Parameters:
policy (duo.policies.PPOPolicy) – The policy whose parameters are to be saved.
name (str) – Name for the checkpoint file.
- Return type:
None
Examples
>>> self.save_checkpoint(policy, "best_test")
- duo_ai.algorithms.registry¶