Logit Algorithms¶
The novice computes a confidence score based on its output logits. It issues a help request whenever the score is below a threshold.
Notation:
\(z = (z_1, \dots, z_{|A|})\) are the logits computed by the novice.
\(p = \mathrm{Softmax}(z)\) is the probability vector.
\(p^{\downarrow}\) denotes the elements of \(p\) sorted in descending order.
Supported metrics:
max_logit: The maximum logit value \(\max_i z_i\)max_prob[1]: The maximum probability \(\max_i p_i\)margin[2]: The difference between the highest and second-highest probabilities \(p_1^{\downarrow} - p_2^{\downarrow}\)entropy[3]: The negative entropy of the action distribution \(\sum_i p_i \ln p_i\)energy[4]: The log-sum-exp of the logits \(\ln \sum_i \exp(z_i)\)
A challenge in this approach is determining the appropriate threshold. We address this by proposing the following adaptive procedure:
Exploration: Use the novice to explore the training environment, generating a set of states \(\mathcal{S}_{\text{train}}\).
Score Computation: For each state \(s \in \mathcal{S}_{\text{train}}\), compute its confidence score \(c(s)\). This results in a pool of confidence scores \(\mathcal{C} = \{c(s) \mid s \in \mathcal{S}_{\text{train}}\}\).
Threshold Selection: Consider the \(n\)-th percentiles of \(\mathcal{C}\) as candidate thresholds (\(n = 0, 10,..., 100\)).
Validation: For each candidate threshold, construct a policy and evaluate its performance on the validation tasks.
Test-Time Selection: Select the policy that yields the best validation performance and use it during testing.