Reviewer #2 Questions 1. {Summary} Please briefly summarize the main claims/contributions of the paper in your own words. (Please do not include your evaluation of the paper here). This paper identifies a major challenge for OoD detection where spurious correlation dominates. This is quite an important problem as in many circumstances such as fair machine learning, the spurious correlation can degenerates the performance and lead to unethical predictions. The authors also analyze this problem from theoretical perspectives and prove that this is inherently a hard problem to solve. ================Response after rebuttal========================= I have read the rebuttal. I will not change my score. But the followings is for scientific discussions. The binary classification theoretical formulation maybe not enough to reveal the whole picture. As this paper considers spurious correlation, the labels on the training and test sets should be the same to construct spurious correlations. If the labels are different, then it should be diversity shift according to [1]. 2. {Novelty} How novel are the concepts, problems addressed, or methods introduced in the paper? Good: The paper makes non-trivial advances over the current state-of-the-art. 3. {Soundness} Is the paper technically sound? Good: The paper appears to be technically sound, but I have not carefully checked the details. 4. {Impact} How do you rate the likely impact of the paper on the AI research community? Good: The paper is likely to have high impact within a subfield of AI OR moderate impact across more than one subfield of AI. 5. {Clarity} Is the paper well-organized and clearly written? Good: The paper is well organized but the presentation could be improved. 6. {Evaluation} If applicable, are the main claims well supported by experiments? Good: The experimental evaluation is adequate, and the results convincingly support the main claims. 7. {Resources} If applicable, how would you rate the new resources (code, data sets) the paper contributes? (It might help to consult the paper’s reproducibility checklist) Good: The shared resources are likely to be very useful to other AI researchers. 8. {Reproducibility} Are the results (e.g., theorems, experimental results) in the paper easily reproducible? (It may help to consult the paper’s reproducibility checklist.) Good: key resources (e.g., proofs, code, data) are available and key details (e.g., proofs, experimental setup) are sufficiently well-described for competent researchers to confidently reproduce the main results. 9. {Ethical Considerations} Does the paper adequately address the applicable ethical considerations, e.g., responsible data collection and use (e.g., informed consent, privacy), possible societal harm (e.g., exacerbating injustice or discrimination due to algorithmic bias), etc.? Good: The paper adequately addresses most, but not all, of the applicable ethical considerations. 10. {Reasons to Accept} Please list the key strengths of the paper (explain and summarize your rationale for your evaluations with respect to questions 1-9 above). This paper identifies a major and practical challenge for OoD detection. This challenge is caused by spurious correlation existed in the data which maybe hard to solve. 11. {Reasons to Reject} Please list the key weaknesses of the paper (explain and summarize your rationale for your evaluations with respect to questions 1-9 above). The mathematical definition of spurious correlation seems to be missing. And the categories of spurious and non-spurious OoD type is not rigorous. Instead of using the subjective categories, using other terms such as correlation shift and diversity shift in [1] with mathematical definitions may be more useful. [1]https://arxiv.org/abs/2106.03721 12. {Questions for the Authors} Please provide questions that you would like the authors to answer during the author feedback period. Please number them. What is then mathematical definition of spurious OoD and non-spurious OoD, is it the same as the correlation shift and diversity shift in [1]. 13. {Detailed Feedback for the Authors} Please provide other detailed, constructive, feedback to the authors. 1.Formulate mathematical definitions of spurious and non-spurious OoD. 14. (OVERALL EVALUATION) Please provide your overall evaluation of the paper, carefully weighing the reasons to accept and the reasons to reject the paper. Ideally, we should have: - No more than 25% of the submitted papers in (Accept + Strong Accept + Very Strong Accept + Award Quality) categories; - No more than 20% of the submitted papers in (Strong Accept + Very Strong Accept + Award Quality) categories; - No more than 10% of the submitted papers in (Very Strong Accept + Award Quality) categories - No more than 1% of the submitted papers in the Award Quality category Accept: Technically solid paper, with high impact on at least one sub-area of AI or moderate to high impact on more than one area of AI, with good to excellent evaluation, resources, reproducibility, and no unaddressed ethical considerations. Reviewer #3 Questions 1. {Summary} Please briefly summarize the main claims/contributions of the paper in your own words. (Please do not include your evaluation of the paper here). The paper addresses the problem of identifying OOD inputs for classification tasks implemented by neural networks. The authors propose different definitions of OOD inputs, and provide a theoretical analysis that demonstrate why detecting specific classes of OOD inputs is challenging in practice. 2. {Novelty} How novel are the concepts, problems addressed, or methods introduced in the paper? Good: The paper makes non-trivial advances over the current state-of-the-art. 3. {Soundness} Is the paper technically sound? Good: The paper appears to be technically sound, but I have not carefully checked the details. 4. {Impact} How do you rate the likely impact of the paper on the AI research community? Good: The paper is likely to have high impact within a subfield of AI OR moderate impact across more than one subfield of AI. 5. {Clarity} Is the paper well-organized and clearly written? Excellent: The paper is well-organized and clearly written. 6. {Evaluation} If applicable, are the main claims well supported by experiments? Excellent: The experimental evaluation is comprehensive and the results are compelling. 7. {Resources} If applicable, how would you rate the new resources (code, data sets) the paper contributes? (It might help to consult the paper’s reproducibility checklist) Excellent: The shared resources are likely to have a broad impact on one or more sub-areas of AI. 8. {Reproducibility} Are the results (e.g., theorems, experimental results) in the paper easily reproducible? (It may help to consult the paper’s reproducibility checklist.) Excellent: key resources (e.g., proofs, code, data) are available and key details (e.g., proof sketches, experimental setup) are comprehensively described for competent researchers to confidently and easily reproduce the main results. 9. {Ethical Considerations} Does the paper adequately address the applicable ethical considerations, e.g., responsible data collection and use (e.g., informed consent, privacy), possible societal harm (e.g., exacerbating injustice or discrimination due to algorithmic bias), etc.? Not Applicable: The paper does not have any ethical considerations to address. 10. {Reasons to Accept} Please list the key strengths of the paper (explain and summarize your rationale for your evaluations with respect to questions 1-9 above). The paper addresses an interesting problem, it is well-written and clear. An interesting theoretical analysis, as well as experiments, appear to support the claims made in the paper 11. {Reasons to Reject} Please list the key weaknesses of the paper (explain and summarize your rationale for your evaluations with respect to questions 1-9 above). -- 12. {Questions for the Authors} Please provide questions that you would like the authors to answer during the author feedback period. Please number them. -- 13. {Detailed Feedback for the Authors} Please provide other detailed, constructive, feedback to the authors. -- 14. (OVERALL EVALUATION) Please provide your overall evaluation of the paper, carefully weighing the reasons to accept and the reasons to reject the paper. Ideally, we should have: - No more than 25% of the submitted papers in (Accept + Strong Accept + Very Strong Accept + Award Quality) categories; - No more than 20% of the submitted papers in (Strong Accept + Very Strong Accept + Award Quality) categories; - No more than 10% of the submitted papers in (Very Strong Accept + Award Quality) categories - No more than 1% of the submitted papers in the Award Quality category Accept: Technically solid paper, with high impact on at least one sub-area of AI or moderate to high impact on more than one area of AI, with good to excellent evaluation, resources, reproducibility, and no unaddressed ethical considerations. 20. I acknowledge that I have read the author's rebuttal and made whatever changes to my review where necessary. Agreement accepted Reviewer #4 Questions 1. {Summary} Please briefly summarize the main claims/contributions of the paper in your own words. (Please do not include your evaluation of the paper here). The authors highlight a problem with OOD detection methods based on classification models learned using empirical risk minimization (ERM). These models do not distinguish between input features that are well-correlated (but spurious) features vs. those that are causal (invariant) of the output label. Invariant features remain reliable predictors of the label in any environment but the predictive power of spurious features may vary depending on environment. OOD detection methods that leverage the learned model are vulnerable to OOD examples that exhibit spurious features and no invariant features (called "spurious" OOD in the paper) since these examples are easily mistaken for ID. In contrast, "non-spurious" OOD examples that exhibit neither spurious nor invariant features are typically correctly detected as OOD. The distinction between these two types of OOD are illustrated in the paper using some simple example datasets to show the effect on maximum softmax probability (MSP), a baseline OOD detection method. Then it is shown that the effect still holds for other OOD detection methods but that feature-based methods (Mahalanobis/Gram) are less affected by this than output-based methods (MSP/ODIN/energy score). Finally, a theoretical exposition is given to explain why ERM-based learned models are always necessarily vulnerable to this problem. 2. {Novelty} How novel are the concepts, problems addressed, or methods introduced in the paper? Fair: The paper contributes some new ideas. 3. {Soundness} Is the paper technically sound? Good: The paper appears to be technically sound, but I have not carefully checked the details. 4. {Impact} How do you rate the likely impact of the paper on the AI research community? Fair: The paper is likely to have moderate impact within a subfield of AI. 5. {Clarity} Is the paper well-organized and clearly written? Good: The paper is well organized but the presentation could be improved. 6. {Evaluation} If applicable, are the main claims well supported by experiments? Fair: The experimental evaluation is weak: important baselines are missing, or the results do not adequately support the main claims. 7. {Resources} If applicable, how would you rate the new resources (code, data sets) the paper contributes? (It might help to consult the paper’s reproducibility checklist) Fair: The shared resources are likely to be moderately useful to other AI researchers. 8. {Reproducibility} Are the results (e.g., theorems, experimental results) in the paper easily reproducible? (It may help to consult the paper’s reproducibility checklist.) Good: key resources (e.g., proofs, code, data) are available and key details (e.g., proofs, experimental setup) are sufficiently well-described for competent researchers to confidently reproduce the main results. 9. {Ethical Considerations} Does the paper adequately address the applicable ethical considerations, e.g., responsible data collection and use (e.g., informed consent, privacy), possible societal harm (e.g., exacerbating injustice or discrimination due to algorithmic bias), etc.? Not Applicable: The paper does not have any ethical considerations to address. 10. {Reasons to Accept} Please list the key strengths of the paper (explain and summarize your rationale for your evaluations with respect to questions 1-9 above). • an important problem of OOD detection is highlighted and illustrated • some interesting observations are made about the problem based on testing • a compelling theoretical exposition shows how the problem is inevitable 11. {Reasons to Reject} Please list the key weaknesses of the paper (explain and summarize your rationale for your evaluations with respect to questions 1-9 above). The observations (in the "results and insights" subsections of Sec 3 &4), while interesting, remain only weakly supported by testing (as discussed in 6 above). Furthermore, explanations or hypotheses for these observations are lacking. For example, in Section 3, it is observed that detection performance on both spurious and non-spurious OOD worsens with increasing correlation of inputs with spurious features, but no potential reason for this surprising result is proposed. As another example, while the central observation of the paper, that feature-based OOD methods are less affected by spurious OOD, is analyzed with visualizations and histograms (Figure 3), no explanation for this phenomenon is hypothesized or explored. The formal setting, the general ERM vulnerability to spurious features that vary in different environments and its relevance to OOD is not new (e.g., Arjovsky, M.; Bottou, L.; Gulrajani, I.; and Lopez-Paz, D. 2019. Invariant risk minimization. arXiv preprint arXiv:1907.02893). The theoretical exposition that spurious OOD is unavoidable may be novel. The concrete illustration of the problem and demonstration on existing model-based OOD methods is novel as is the observation that feature-based methods are less affected by this phenomenon. The issue of OOD detection is an important one in AI areas fields related to learning but although the paper highlights a particular relevant issue, it does not shed light on how to fix the problem. The observation that feature-based OOD methods are less affected is not elaborated to draw conclusions. In addition, this observation is based on a very limited amount of testing so it is not clear the extent to which it can be considered a generalized finding. The simple tests done in the paper do illustrate the highlighted problem; however, it is difficult to use these results to draw general conclusions or support the observations made in the corresponding "results and insights" sections. For example, in Section 3, it is observed that detection performance on both spurious and non-spurious OOD worsens with increasing correlation of inputs with spurious features. More testing is warranted before this observation can be taken as a statement of a general property of OOD. 12. {Questions for the Authors} Please provide questions that you would like the authors to answer during the author feedback period. Please number them. Table 1: Which OOD detection method is assumed here? It is not clear. Figure 3a: What visualization method is used here? (e.g., t-SNE?) Table 3: What do the red and blue numbers mean? 13. {Detailed Feedback for the Authors} Please provide other detailed, constructive, feedback to the authors. The paper is generally well written and easy to follow. Some details are omitted that require extra work to figure out. Examples: see my questions. Section 3, spelling: "bold male" Section 4, Feature-based vs Output-based OOD Detection: The distinction between feature-based vs output-based OOD detection is not clearly defined in the paper but can be inferred from later text (however readers would need prior knowledge of OOD detection methods). 14. (OVERALL EVALUATION) Please provide your overall evaluation of the paper, carefully weighing the reasons to accept and the reasons to reject the paper. Ideally, we should have: - No more than 25% of the submitted papers in (Accept + Strong Accept + Very Strong Accept + Award Quality) categories; - No more than 20% of the submitted papers in (Strong Accept + Very Strong Accept + Award Quality) categories; - No more than 10% of the submitted papers in (Very Strong Accept + Award Quality) categories - No more than 1% of the submitted papers in the Award Quality category Borderline reject: Technically solid paper where reasons to reject, e.g., lack of novelty, outweigh reasons to accept, e.g., good evaluation. Please use sparingly. 20. I acknowledge that I have read the author's rebuttal and made whatever changes to my review where necessary. Agreement accepted Reviewer #5 Questions 1. {Summary} Please briefly summarize the main claims/contributions of the paper in your own words. (Please do not include your evaluation of the paper here). The paper tackles the problem of out-of-distribution (OOD) detection from a novel perspective. The authors propose to separate invariant features from environmental ones and introduce the notions of spurious and non-spurious OOD (the conventional ones). They demonstrate the impact of spurious correlation on OOD detection. 2. {Novelty} How novel are the concepts, problems addressed, or methods introduced in the paper? Good: The paper makes non-trivial advances over the current state-of-the-art. 3. {Soundness} Is the paper technically sound? Good: The paper appears to be technically sound, but I have not carefully checked the details. 4. {Impact} How do you rate the likely impact of the paper on the AI research community? Good: The paper is likely to have high impact within a subfield of AI OR moderate impact across more than one subfield of AI. 5. {Clarity} Is the paper well-organized and clearly written? Excellent: The paper is well-organized and clearly written. 6. {Evaluation} If applicable, are the main claims well supported by experiments? Excellent: The experimental evaluation is comprehensive and the results are compelling. 7. {Resources} If applicable, how would you rate the new resources (code, data sets) the paper contributes? (It might help to consult the paper’s reproducibility checklist) Excellent: The shared resources are likely to have a broad impact on one or more sub-areas of AI. 8. {Reproducibility} Are the results (e.g., theorems, experimental results) in the paper easily reproducible? (It may help to consult the paper’s reproducibility checklist.) Excellent: key resources (e.g., proofs, code, data) are available and key details (e.g., proof sketches, experimental setup) are comprehensively described for competent researchers to confidently and easily reproduce the main results. 9. {Ethical Considerations} Does the paper adequately address the applicable ethical considerations, e.g., responsible data collection and use (e.g., informed consent, privacy), possible societal harm (e.g., exacerbating injustice or discrimination due to algorithmic bias), etc.? Not Applicable: The paper does not have any ethical considerations to address. 10. {Reasons to Accept} Please list the key strengths of the paper (explain and summarize your rationale for your evaluations with respect to questions 1-9 above). The approach is novel and potentially useful for the broader community. The work provides important insights into OOD detection, e.g, that the detection mechanisms should focus on the latent space rather than output. Authors provide good intuition and formally show how spurious correlations can trick the predictor. 11. {Reasons to Reject} Please list the key weaknesses of the paper (explain and summarize your rationale for your evaluations with respect to questions 1-9 above). A summary of the results from supplementary would be helpful in the main text. Too many important results are put into supplementary. 12. {Questions for the Authors} Please provide questions that you would like the authors to answer during the author feedback period. Please number them. 1. When spurious correlation is increased from 0.7 to 0.9 the difference in performance is not as significant as between 0.5 and 0.7. How do authors explain this and can they comment on a possible threshold for spurious correlation when the performance is still acceptable? 2. Is 4 runs enough for averaging the results of these experiments? Comments after authors' response: authors have fully answered the questions. 13. {Detailed Feedback for the Authors} Please provide other detailed, constructive, feedback to the authors. - Even though it is understandable what M_inv and M_e are, they are not defined in the text. 14. (OVERALL EVALUATION) Please provide your overall evaluation of the paper, carefully weighing the reasons to accept and the reasons to reject the paper. Ideally, we should have: - No more than 25% of the submitted papers in (Accept + Strong Accept + Very Strong Accept + Award Quality) categories; - No more than 20% of the submitted papers in (Strong Accept + Very Strong Accept + Award Quality) categories; - No more than 10% of the submitted papers in (Very Strong Accept + Award Quality) categories - No more than 1% of the submitted papers in the Award Quality category Accept: Technically solid paper, with high impact on at least one sub-area of AI or moderate to high impact on more than one area of AI, with good to excellent evaluation, resources, reproducibility, and no unaddressed ethical considerations. 20. I acknowledge that I have read the author's rebuttal and made whatever changes to my review where necessary. Agreement accepted