Logit Algorithms

The novice computes a confidence score based on its output logits. It issues a help request whenever the score is below a threshold.

Notation:

  • \(z = (z_1, \dots, z_{|A|})\) are the logits computed by the novice.

  • \(p = \mathrm{Softmax}(z)\) is the probability vector.

  • \(p^{\downarrow}\) denotes the elements of \(p\) sorted in descending order.

Supported metrics:

  • max_logit: The maximum logit value \(\max_i z_i\)

  • max_prob [1]: The maximum probability \(\max_i p_i\)

  • margin [2]: The difference between the highest and second-highest probabilities \(p_1^{\downarrow} - p_2^{\downarrow}\)

  • entropy [3]: The negative entropy of the action distribution \(\sum_i p_i \ln p_i\)

  • energy [4]: The log-sum-exp of the logits \(\ln \sum_i \exp(z_i)\)

A challenge in this approach is determining the appropriate threshold. We address this by proposing the following adaptive procedure:

  1. Exploration: Use the novice to explore the training environment, generating a set of states \(\mathcal{S}_{\text{train}}\).

  2. Score Computation: For each state \(s \in \mathcal{S}_{\text{train}}\), compute its confidence score \(c(s)\). This results in a pool of confidence scores \(\mathcal{C} = \{c(s) \mid s \in \mathcal{S}_{\text{train}}\}\).

  3. Threshold Selection: Consider the \(n\)-th percentiles of \(\mathcal{C}\) as candidate thresholds (\(n = 0, 10,..., 100\)).

  4. Validation: For each candidate threshold, construct a policy and evaluate its performance on the validation tasks.

  5. Test-Time Selection: Select the policy that yields the best validation performance and use it during testing.

References