September 22, 2014

Meng (2014)

  • Jeremy Wu () motivated Meng's questions about multi-source inference
  • How could we use a non-random sample?
  • “Is an 80% non-random sample ‘better’ than a 5% random sample in measurable terms? 90%? 95%? 99%?”
  • Meng (2014) defines multi-source inference as that arising in "situations where we need to draw inference by using data coming from different sources and some (but not all) of which were not collected for inference purposes."

45.4.1: Large absolute size or large relative size?

  • Two sources
  • Administrative record (\(f_a\) fraction of population)
  • Simple random sample (SRS) (\(f_s\) fraction of population)
  • \(f_a >>f_s\)
  • How to combine maximal information from both sources?
  • Wu () asks how relative information varies with differences in ratio of \(\frac{f_a}{f_s}\)

Finite population study: comparing MSE of two estimators

  • \(\lbrace x_1, ..., x_N\rbrace\) is our population of size \(N\)
    • \(R_i\) indicator for presence of \(x_i\) in administrative record
    • \(I_i\) indicator for presence of \(x_i\) in SRS
    • \(n_a = \sum_{i=1}^NR_i >>n_s = \sum_{i=1}^N I_i\)
  • \(\bar{x}_a = \frac{\sum_{i=1}^Nx_iR_i}{n_a}\)
  • \(\bar{x}_s = \frac{\sum_{i=1}^Nx_iI_i}{n_s}\)

Probit model

  • \(R_i = 1_{Z_i \le \alpha + \beta x_i}\)
  • \(Z_i \sim N(0,1)\text{ iid}\) (latent refusal tendency of \(i^{th}\) individual)
  • \(\beta\): strength of self-selecting mechanism (\(\beta \neq 0\) means non-ignorable missing data)

MSE for \(\bar{x}_s\)

  • \(x_s\) unbiased
  • \(Var(\bar{x}_s) = \frac{1-f_s}{n_s}S_N^2(x)\)
  • \(S_N^2(x) = \frac{1}{N-1}\sum_{i=1}^N(x_i - \bar{x}_N)^2\)

MSE for \(\bar{x}_a\)

  • biased (due to nature of sampling)
  • \(R_i\) depends on \(x_i\)
  • Assume: 1. \(N\) very large, 2. \(f_a = \frac{n_a}{N}\) not near 0
  • Then \(\text{bias}^2\) dominates \(MSE(\bar{x}_a) \approx \text{bias}^2(\bar{x}_a)\)
  • Large \(N\) and large \(n_a\) allow simplification:

\[MSE(\bar{x}_a) \approx \lbrace \frac{\sum_{i=1}^N(x_i - \bar{x}_N)p(x_i)}{\sum_{i=1}^Np(x_i)}\rbrace^2\]

  • \(p(x_i) = \mathbb{E}(R_i|x_i) = \Phi(\alpha + x_i\beta)\)

Assume finite population is a SRS from a superpopulation \(X \sim N(\mu,\sigma^2)\)

  • By LLN,

\[ \text{bias}^2(\bar{x}_a) = \frac{\text{cov}(X, p(X))}{\mathbb{E}(p(X))} = \frac{\sigma \mathbb{E}(Z\Phi(\tilde{\alpha}+ \tilde{\beta}Z))}{\mathbb{E}(\Phi(\tilde{\alpha}+ \tilde{\beta}Z)))} = \frac{\sigma \tilde{\beta}}{\sqrt{1+\tilde{\beta}^2}}\text{MR}(\frac{\tilde{\alpha}}{\sqrt{1+\tilde{\beta}^2}})\]

  • \(MR(a) = \frac{\phi(a)}{\Phi(a)}\)

Comparing MSEs

\[\frac{\text{MSE}(\bar{x}_a)}{\sigma^2} \approx \frac{\text{bias}^2(\bar{x}_a)}{\sigma^2} = \frac{\phi^2(z_{f_a})}{f_a^2} \frac{\tilde{\beta}^2}{1+\tilde{\beta}^2} = \frac{\tilde{\beta}^2}{1+\tilde{\beta}^2}\frac{e^{-z^2_{f_z}}}{2\pi f_a^2}\]

\[\frac{\text{MSE}(\bar{x}_s)}{\sigma^2} = \frac{1}{n_s} - \frac{1}{N}\approx \frac{1}{n_s}\]

Sampling v. Non-sampling errors

  • Non-sampling is arbitrarily small only when \(f_a \to 1\) (\(n_a\) alone is insufficient)
  • Sampling error can be arbitrarily small by merely making \(n_s\) arbitrarily large, even if \(f_s \to 0\) as \(n\to \infty\)

  • Big data may be "big" due to their size relative to the (finite) population size, rather than due to their absolute size

Sufficient condition for \(\text{MSE}(\bar{x}_a)< \text{MSE}(\bar{x}_s)\)

\[f_a > \frac{n_s\rho^2_N(x,p)}{1+n_s\rho^2_N(x,p)}\]

  • \(\rho_N(x,p)\): correlation between x & p
  • estimate \(\rho_N(x,p)\) from the SRS

Meng (2014): Question 1

  • Given partial knowledge of the recording/response mechanism for a (large) biased sample, what is the optimal way to create an intentionally biased sub-sampling scheme to counter-balance the original bias so the resulting sub-sample is guaranteed to be less biased than the original biased sample in terms of the sample mean, or other estimators, or predictive power?

Meng (2014): Question 2

What should be the key considerations when combining small random samples with large non-random samples, and what are the sensible “corner- cutting” guidelines when facing resource constraints? How can the combined data help to estimate \(\rho_N (x, p)\)? In what ways can such estimators aid multi-source inference?

Meng (2014): Question 3

  • What are theoretically sound and practically useful defect indices for pre- diction, hypothesis testing, model checking, clustering, classification, etc., as counterparts to the defect index for estimation, ρN(x,p)? What are their roles in determining information bounds for multi-source inference? What are the relevant information measures for multi-source inference?

Extra question

  • How can we develop the study of multi-source inference in the personalized (genomic) medicine context?
  • Administrative record: EMR ?
  • SRS: ?
  • How close to SRS is the sample in, say, a GWAS? or TCGA study?
  • Personalized medicine's goals are about prediction

References

Meng, Xiao-Li. 2014. “A Trio of Inference Problems That Could Win You a Nobel Prize in Statistics (If You Help Fund It).” Past, Present, and Future of Statistical Science. CRC Press.

Wu, Jeremy“21st Century Statistical Systems.” http://jeremyswu.blogspot.com/2012/08/abstract-combination-of-traditional.html.