Reviewer #1
Questions
1. Summarize the contributions made in the paper with your own words
The paper proposes to use Thompson Sampling to select informative samples for training an outlier detector using outlier exposure. For this, the authors make use of Bayesian linear regression. The method is relatively simple yet seemingly effective and outperforms a relevant set of baseline methods.
2. Novelty, relevance, significance
The problem of outlier detection is a widely studied problem that is relevant for the application of deep learning methods (and general ML methods) to real world applications. The application of Thompson Sampling to outlier mining for more efficient usage of outlier samples and improved outlier detection is novel and seems to significantly improve the performance.
3. Soundness
The method proposed in the paper is theoretically sound. The experiments seem to indicate superior performance compared to relevant baselines.
4. Quality of writing/presentation
The paper is well written and easy to follow. However, the presentation could be improved by some minor changes (full list below).
5. Literature
The paper covers most important relevant work. However, it is missing some more recent improvements of methods without auxiliary data such as [1].

[1]: Zhang, Hongjie, et al. "Hybrid models for open set recognition." European Conference on Computer Vision. Springer, Cham, 2020.
6. Basis of review (how much of the paper did you read)?
I check the full paper and parts of the appendix. However, I did not in-depth check the theoretical analysis.
7. Summary
Post-rebuttal update:

I thank the authors for the extensive response. I believe that the paper tackles an interesting problem with a novel solution. I believe that the manuscript in the current form would be of interest to the wider research community. Nevertheless, I encourage the authors to consider R2's suggestion of expanding the experiments beyond CIFAR10/100.

----
Overall, this paper is well written and easy to follow and it tackles an important problem of detecting OOD data.
- The paper presents a novel method for outlier mining, improving the outlier detection performance.
- The paper uses relevant baselines in the experiment and shows favorable results for the proposed method.
- Most relevant background is covered.
- The interpretation of the experiments would benefit from a more detailed depiction of the individual results. Currently only aggregated results are shown. Also, the baseline method results are not 100% aligned with the originally published results.
8. Miscellaneous minor issues
- In Figure 1, what are the green shaded regions? Are those the class boundaries? Why is this different from 1.a?
- The explanation of the method would benefit from a note that mentions that the posterior sampling and update is performed as a separate step of the main training.
- What prior is used for the linear regression?
- How is the the posterior logit variance chosen? What's the value?
- The update of the Bayesian linear regression seems to be more of an analytical computation rather than an "update" of some parameters. This could be rephrased or clarified.
- How are the margin parameters chosen? What's the impact of them?
- Would this method be more effective with other types of outliers too?
- Why does Table 1 not show standard deviations or so, given that Fig 3. shows them?
- Fig. 3 is a bit hard to read - I'd suggest changing the color scheme to something appropriate for color-blindness and also use different markers for the different methods.
- When mentioning that POEM benefits from early stopping. What are the setting there? Does this mean that only POEM is stopped and regular training continues?
- What does the training with more outliers test? Is this referring to more outliers in the overall pool or a bigger outlier buffer? Did you perform any tests on using a smaller set? It'd be more insightful to run experiments with fewer outliers.

As a separate question - would there be an option to use Thompson Sampling without the Bayesian linear regression? E.g. could one turn the predictive probabilities / logits into a probability of being on the decision boundary and then sampling outliers according to p(boundary)?
10. [R] Phase 1 recommendation. Should the paper progress to phase 2?
Yes

Reviewer #2
Questions
1. Summarize the contributions made in the paper with your own words
Building on the outlier exposure framework, the paper introduces a posterior-based sampling approach to select more relevant examples for model regularization. The idea is to estimate the posterior by perform Bayesian linear regression on a neural network feature vector. Sampling from the posterior is then performed using Thompson sampling in order to balance the exploration/exploitation trade-off. As a result, the decision boundary between in- and out-of-distribution regions is better approximated. The paper demonstrates empirical improvements over competing methods. Finally, the paper theoretically analyzes the benefit of outlier mining with high boundary scores for the simple mixture of Gaussians case.
2. Novelty, relevance, significance
Given that outlier exposure and outlier mining are known approaches, the contribution this paper makes is incremental. At the same time, the Thompson-sampling based strategy seems to consistently improve results in the performed experiments. Hence, the paper is a novel combination of known methods.
3. Soundness
Both the methodological and the experimental part appear to be sufficiently well supported and sound.
4. Quality of writing/presentation
The writing is generally clear and easy to follow. See first weakness in summary below for an exception.
5. Literature
Related work appears to be sufficiently discussed.
6. Basis of review (how much of the paper did you read)?
I read the main part of the paper and skimmed the appendix.
7. Summary
Strengths:

- The general problem of OOD detection is an important problem in reliable machine learning.

- The Thompson sampling approach presented in the paper makes sure that the outlier exposure regularizer is confronted with informative samples. This ensures a more sample-efficient learning of the in-distribution boundary.

- Performance gains using this approach seem to consistently improve over competing methods.

- While limited on mixtures of Gaussians, the theoretical section analyzes the benefit of outlier mining.

Weaknesses / Open Questions:

- When reading the caption of Figure 2 and the text in lines 221-223 I was initially confused as to why both in-distribution and out-of-distribution should go into the classification branch. This is cleared up in the next paragraph by explaining that the outliers are used for regularization. I think It would be better to stress early on and remind readers that this classification branch uses an outlier exposure loss. Figure 2 should be updated with a corresponding depiction as well.

- I wonder whether the authors have considered choosing other acquisition functions popular in the reinforcement learning and Bayesian optimization literature (such as entropy-based methods or UCB) and why Thompson sampling was eventually picked.

- SSD+ uses a different hyper-parameter configuration than the rest of the experiments. It would be great if the authors could provide a justification for this deviation.

- While the paper compares with a decent amount of competing approaches, it would have been great if the paper also contained more results on other datasets besides CIFAR-10 and CIFAR-100 as the in-distribution model.

- It is great to see error bars for POEM in Table 1. While I tend to believe that the presented approach consistently outperforms other approaches, all competing methods lack such error bars. This makes it hard to determine whether the results are statistically significant.

- The paper could also benefit from a more detailed discussion around the particular choice of posterior approximation (Bayesian linear regression on neural network features VS Bayesian neural network VS Deep Ensembles, for example). I would not hold this point against the paper too much though, as the current tradeoff between computational tractability and OOD detectability seems to be well balanced.

====== Post-rebuttal update ======

Thanks to the authors for their response! My concerns appear to have been mostly addressed and I feel the paper is a valid contribution to the OOD literature. However, I still think that additional experiments with other data sets as ID data beyond CIFAR-10 and CIFAR-100 could further strengthen the the empirical results.
8. Miscellaneous minor issues
- The term "outlier mining" should be defined in its first introduction on line 11 right column.

Reviewer #3
Questions
1. Summarize the contributions made in the paper with your own words
The authors propose a method for mining outlier examples, out of a much larger set, to train an out-of-distribution classifier. The key idea is to make use of Thompson sampling, which does conducts better exploitation than another similar method based on greedy sampling namely NTOM.
2. Novelty, relevance, significance
The main novelty is the use of Thompson sampling to improve the selection of subset of outliers to make the training of an OOD detector more efficient.
The contribution are somewhat narrow, but there is a community that might benefit from these findings. The proposed method achieves better results than the other SOTA methods, and that is a strong argument in my opinion.
3. Soundness
I think the research is relevant and presents strong evidence of a method that can be applied for OOD detection when there is a set of outliers available, to be considered as OOD during training. But that is not the case in a plethora of real-world applications, so the applicability of the approach can be limited.
However, the research mostly sound and most choices are well explained. I highlight some exceptions that looked awkward to me, such as a more scientifically sound framework for the metrics and for setting the rejection threshold. I suggest the authors to take a look on Equal Error Rate and False Acceptance Rates, which, in my opinion, make the comparison of different system fairer.
4. Quality of writing/presentation
The paper is well organized and well written. There are some typos, but they rarely appear in the text. The only one I caught was the 'posteior' instead of 'posterior' in line 162
5. Literature
I think they cover what has been done previously with the same datasets, but the paper lacks related work done in other fields, specially for NLP applications. I suggest the authors to take a look on recent editions of ACL and EMNLP to find other interesting works and datasets. A few of these references are:
- Chen and You, GOLD: Improving Out-of-Scope Detection in Dialogues using Data Augmentation, EMNLP 2021
- Zhou et al., Contrastive Out-of-Distribution Detection for Pretrained Transformers, EMNLP 2021
- Li et al., kFolden: k-Fold Ensemble for Out-Of-Distribution Detection, EMNLP 2021
- Nimah et al, ProtoInfoMax: Prototypical Networks with Mutual Information Maximization for Out-of-Domain Detection, Findings of EMNLP 2021
- Xu et al, Unsupervised Out-of-Domain Detection via Pre-trained Transformers, ACL 2021
- Shen et al, Enhancing the generalization for Intent Classification and Out-of-Domain Detection in SLU, ACL 2021
- Zhan et al, Out-of-Scope Intent Detection with Self-Supervision and Discriminative Training, ACL 2021.
- Zeng et al, Modeling Discriminative Representations for Out-of-Domain Detection with Supervised Contrastive Learning, ACL 2021.
6. Basis of review (how much of the paper did you read)?
I read the full paper, but have not read thoroughly the equations and proofs
7. Summary
I think the paper presents interesting results that can be valuable to applications where data is abundant, and the presentation is good.

But there are a few week points that, if addressed, can improve considerably the presentation. For example, it would be helpful to cover literature and datasets from problems that are from-the-start 'open-world' problems, such as intent classification in dialogue systems. Additionally, making it clearer a major limitation of the proposed approach, which is the need of a set of outliers to calibrate the system, can also help the reader to understand the applicability of the proposed approach.
10. [R] Phase 1 recommendation. Should the paper progress to phase 2?
Yes

Reviewer #4
Questions
1. Summarize the contributions made in the paper with your own words
A Thompson sampling-based method for training-time auxiliary outliers selection is proposed to learn better out-of-distribution (OOD) detectors. This method selects auxiliary outliers near the OOD decision boundary (exploitation) on sampled posterior models (allows exploration). Building on recent advances of OOD detection models trained with auxiliary datasets, this work demonstrates superior sample efficiency and competitive detection performance through experiments.
2. Novelty, relevance, significance
- This paper contributes some new ideas to advance the field.
- The main contribution lies in the novel idea of framing outlier mining as a sequential decision-making problem and thus draws our attention to the possibility of leveraging on established sequential decision-making techniques. This paper also views a previous work NTOM through this lens, categorizing it as a greedy sampling strategy. Therefore, I can see this work as a natural extension that advances from random/greedy sampling to one that balances exploration and exploitation. Could the authors then explain the choice of Thompson sampling over other well-known acquisition functions, such as EI, UCB, etc.?
- Energy-regularized training is directly adopted to utilize the outliers identified in the mining step, there is no contribution to the training procedure. However, the idea of interleaving feature extraction update and posterior model update is significant to produce an effective end-to-end framework.
3. Soundness
- It seems that the availability of a suitable auxiliary outlier dataset could affect the performance of learned models. For example, an auxiliary dataset that is very different from the in-distribution dataset would have limited effectiveness to produce a useful OOD decision boundary in practice. How do you resolve this issue? This is also related to the claim about the consistency of performance over different auxiliary outlier datasets in Appendix B, is there an experiment setting where auxiliary outliers are "more different" from the in-distribution data than the OOD test dataset? To me, this is important because the OOD test dataset should remain unknown.
- [Sec 3.2] A Bayesian linear regressor is used to model boundary score. A fixed logit value of (3 + gaussian noise) is used to label the target value of auxiliary outliers. I do not get how does the fitted model then help to distinguish outliers closer to the ID-OOD boundary (i.e., outliers with high boundary score)?
- [Clarification] When updating the fixed-sized queue at each epoch, do you just add the updated features of the new samples selected during this round? Or do you update the whole queue of size M (including the previous epochs)? This is of concern because only the features from the updated neural network should be used for posterior modelling.
- [Sec 4.2] It is claimed that POEM utilizes outliers more effectively than existing approaches. Referring to Figure 3, I would expect a faster descent in FPR at the very beginning instead of only after 60 epochs. What does the curve look like before 60 epochs?
- [Appendix B] Why is FPR95 the only metric reported for "the choice of auxiliary dataset" study? And why is AUROC the only metric reported for "the effect of pool size" study?
4. Quality of writing/presentation
- The paper is generally well written. The delivery of ideas is clear and the organization of the paper is good. I especially like that the paper starts with empirical motivations and corresponding experiments, then ends off with an additional section on theoretical analysis.
- For experiments run repeatedly, the standard error/deviation should be clearly stated. For example, state them in Table 1 (other methods), Table 3, Table 4, Table 5. Error bars or error regions should be included in Figure 3 as well.
5. Literature
The authors place the paper well in the context of current research: OOD detection with outlier exposure that is carefully mined considering both exploration and exploitation.
6. Basis of review (how much of the paper did you read)?
I read the full paper, including all the proofs.
7. Summary
- [pros] The idea of extending outlier mining to balance exploration and exploitation is natural. The end-to-end framework that interleaves feature extraction update and posterior model update is sound. I also really appreciate the theoretical analysis involving the implication of high boundary score outliers, offering insights for future works looking deeper into more general cases.
- [cons] The choice of the methodology could be better justified. A deeper and more complete ablation study about the choice of auxiliary outlier set might be needed because auxiliary outliers lie in the crux of this proposed framework (this concern also extends to all outlier exposure related methods). To make the paper sound, some presentations of the experimental results need to be improved to better support the claims of significant performance improvement.
8. Miscellaneous minor issues
- [Line 256] Do you mean $N(0, \sigma^2)$ instead for the Gaussian distribution.
- [Line 363] The training time of POEM is "shorter" instead of "smaller".
- [Line 418] Should it be "FNR() = FPR()"?
- [Equation 9] Did you miss a negative sign in (9)?
- [Line 773] Write "n" in math mode.
- [Line 796] Same as above, should it be "FNR() = FPR()"?
- [Line 805] First draw a scalar g "from" a uniform distribution ...
- [Equation 14] Missing a bracket at the second line of equality.