Causal Inference: Identification Under Strong Ignorability

Author

Hyunseung Kang

Published

April 10, 2024

Abstract

In many settings, complete randomization of the treatment is infeasible. In this lecture, we discuss a popular way to identify causal effects based on strong ignorability, also referred to as conditional exchangeabiltiy or selection on observables. Simply put, strong ignorbability states that conditional on some pre-treatment covariates, treatment is randomized to individuals. Under strong ignorbability, we show that the following causal quantities are identified: the average treatment effect and the optimal static, single-time treatment policy.

Review: Causal Identification Under Complete Randomized Experiment

Suppose we are interested in identifying the average treatment effect (ATE), \(\mathbb{E}[Y(1)-Y(0)]\) where \(Y(a)\) is the counterfactual outcome under \(a\). Under an ideal, complete randomized experiment, the following assumptions are satisfied:

  • (A1,SUTVA): \(Y = A Y(1) + (1-A) Y(0)\)
  • (A2, Randomization of \(A\)): \(A \perp X, Y(1), Y(0)\)
  • (A3, Positivity/Overlap): \(0< P(A=1) < 1\)

Assumptions (A2) and (A3) can also be interpreted as consequences of missing at completely random (MCAR) in the missing data literature. Specifically, consider the following data table where the outcome \(Y\) is lung function (high is good), \(A\) is smoking status (i.e. \(A = 1\) is daily smoker, \(A=0\) is never smoker), and \(X\) is age.

\(Y(1)\) \(Y(0)\) \(Y\) \(A\) \(X_{\rm Age}\)
John NA 0.9 0.9 0 38
Sally 0.8 NA 0.8 1 30
Kate NA 0.6 0.6 0 23
Jason 0.6 NA 0.6 1 26

To identify the column means of \(Y(1)\) (i.e. \(\mathbb{E}[Y(1)]\)) in the presence of missing entries, one approach would be to take the mean of the non-missing values (i.e. \(\mathbb{E}[Y | A=1]\)). This approach is valid if the NAs were missing completely at random. Or formally, we have \(\mathbb{E}[Y(1)] = \mathbb{E}[Y | A=1]\) if the “missingness” indicator \(A\) is independent of \(Y(1)\) (i.e \(A \perp Y(1)\)).

Motivation Behind Strong Ignorability

Summary of Stratified/Blocked Randomized Experiment

In most randomized experiments, treatment is not randomized completely at random. Often, treatment is randomized within a pre-defined block of individuals and the blocks are defined by covariates \(X\). This type of randomized experiment is broadly known as stratified/blocked experiments. Some examples include

  • From the above table, suppose we divide individuals based on whether they are less than 30 years old or greater than or equal to 30 years old; in other words, the two blocks are defined by age. Treatment is randomly assigned with each block.
  • Consider a hypothetical randomized experiment to assess the causal effect of a new drug for reducing blood pressure. Suppose we recruited identical twins and we randomly assign treatment within each twin. Here, each twin defines a block.

The treatment probabilities can be different across blocks; some blocks have a higher probability of receiving treatment while others do not. But, within each block, the treatment is assigned randomly.

We can formalize the assumptions of a stratified randomized experiment as follows:

  • (A1, SUTVA): \(Y = A Y(1) + (1-A) Y(0)\)
  • (A2c, Conditional randomization of \(A\)): \(A \perp Y(1), Y(0) | X\)
  • (A3c’, Positivity/Overlap): \(0 < P(A=1 | X=x) < 1\) for all \(x\)

The “c” in the assumptions reflect that they are conditional versions of the original assumptions (A2) and (A3). Specifically, assumptions (A2c) states that \(A\) is now conditionally independent of \(Y(1), Y(0)\) given \(X\). Assumption (A3c) states that the treatment probability depends on \(X\) and the probability of getting treated must be between \(0\) and \(1\).

For example, if \(X\) is only age:

  • (A2c) states that conditional on the age group, treatment \(A\) is randomly assigned to individuals within that age group.
  • (A3c) states that conditional on the age group, each person has a non-zero probability of receiving treatment or control. Note that the treatment probability may be different across age groups, i.e. \(P(A=1 | X_{\rm under 30} = 1)\) may not be equal to \(P(A = 1 | X_{\rm under 30} = 0)\) where \(X_{\rm under 30} = I(X_{\rm age} < 30)\).

Other remarks about the assumptions:

  • The function \(P(A=1 | X=x)\) is so famous that it’s called the propensity score (Rosenbaum and Rubin (1983)). It gets its name because it quantifies the propensity for an individual with covariates \(X\) to take the treatment. We’ll discuss the propensity score in greater detail later.
  • Assumptions (A2c) and (A3c) are known as strong ignorability (Rosenbaum and Rubin (1983))

Connection to Complete Randomized Experiments

Intuitively, if the treatment was randomized completely at random to everyone, the treatment is also randomized to individuals within any pre-defined subgroups based on covariates. This section formally proves the connection between a complete randomized epxeriment and a stratified randomized experiment.

First, we show that assumption (A2) implies assumption (A2c). To do this, it’s useful to prove the following assertion between three random variables \(A,B\) and \(C\) \[ A \perp B,C \Rightarrow A\perp B | C \] Note that \(A\perp B,C\) is equivalent to \(\mathbb{P}(A | B,C) = \mathbb{P}(A)\). Also, \(A \perp B,C\) implies \(A \perp C\), which is equivalent to \(\mathbb{P}(A | C) = \mathbb{P}(A)\). Putting them together, we have \(\mathbb{P}(A|B,C) = \mathbb{P}(A | C)\) or equivalently, \(A \perp B | C\).

We can then set \(A = A\), \(B= \{Y(1),Y(0)\}\), and \(C = X\) and apply the assertion to arrive at \[\begin{align*} {\rm (A2)}: A\perp Y(1), Y(0), X \Rightarrow {\rm (A2c)} A \perp Y(1), Y(0) \mid X \end{align*}\]

Second, we show that assumptions (A3) and (A2) implies assumption (A3c). From (A2), we have \(\mathbb{P}(A =1 ) = \mathbb{P}(A=1|X=x)\) for all \(x\). Then directly applying (A3) gives us \(0< \mathbb{P}(A=1) = \mathbb{P}(A=1 |X=x) < 1\) for all \(x\).

Note that without (A2), it’s not necessarily the case that (A3) implies (A3c). This is because by the law of total probability, we have \[\begin{align*} \mathbb{P}(A=1) &= \int_{x}\mathbb{P}(A=1 | X=x)f(x)dx && \text{Law of total probability} \\ \end{align*}\] where \(f(x)\) is the density of \(X\).Even if \(f(x) > 0\) and (A3) where $ 0 < P(A =1) < 1$ holds, it’s possible that for some values of \(x\), we have \(\mathbb{P}(A=1 | X=x) = 0\) or \(\mathbb{P}(A=1| X=x)=1\). In the extreme case, as long as \(f(x) > 0\), assumption (A3) still allows the case where \(\mathbb{P}(A=1 | X=x)=1\) for only one value of \(x\) and \(\mathbb{P}(A=1 | X\neq x)=0\), thereby violating (A3c).

Connection to Missing Data

Similar to the case under a complete randomized experiment, the identification assumptions (A2c) and (A3c) have connection to the missing at random (MAR) assumption in the missing data literature. To illustrate, let’s consider again the table above, except we partition the rows based on whether someone is less than 30 years old or not.

\(Y(1)\) \(Y(0)\) \(Y\) \(A\) \(X_{\rm under 30 years}\)
John NA 0.9 0.9 0 0
Sally 0.8 NA 0.8 1 0
—————————————————-
Kate NA 0.6 0.6 0 1
Jason 0.6 NA 0.6 1 1

Assumption (A2c) states that within the rows of the table where \(X\)s are identical (i.e. conditional on \(X\)), the missingness indicator \(A\) is completely independent of the columns \(Y(1), Y(0)\). Assumption (A3c) states that within the rows of the table where \(X\)s are identical, some values of \(Y(1)\) (or \(Y(0)\)) are observed and this holds for every value of \(X\).

Identification Under Strong Ignorability

Identification of the ATE

Under a stratified randomized experiment, it’s relatively straightforward to identify the ATE among a subgroup defined by \(X\), i.e. \(\tau(x) = \mathbb{E}[Y(1)-Y(0) |X=x]\); this is known as the conditional average treatment effect (CATE). Intuitively, identification is achieved by considering a smaller table where we only have individuals with \(X =x\). Then, similar to a completely randomized experiment, we can identify \(\mathbb{E}[Y(1)|X=x]\) by taking the average of the observed \(Y(1)\). More formally, for any \(x\), we have

\[\begin{align*} \mathbb{E}[Y | A=1,X=x] &= \mathbb{E}[AY(1) + (1-A) Y(0) | A=1,X=x] && \text{(A1)} \\ &= \mathbb{E}[Y(1) | A=1,X=x] && \text{Property of conditional expectation} \\ &= \mathbb{E}[Y(1) | X=x] && \text{(A2c)} \end{align*}\] Assumption (A3c) ensures that the conditioning event \(\mathbb{E}[Y | A=1,X=x]\) is well-defined.

Then, using the law of total expectation, we can also identify the unconditional mean \(\mathbb{E}[Y(1)]\) as follows \[\begin{align*} \mathbb{E}[Y(1)] &= \mathbb{E}[\mathbb{E}[Y(1) | X]] && \text{Law of total expectation} \\ &= \mathbb{E}[\mathbb{E}[Y | A=1,X]] && \text{Argument from above} \end{align*}\] Repeating the above argument identifies \(\mathbb{E}[Y(0)] = \mathbb{E}[\mathbb{E}[Y | A=0,X]]\) and thus, the ATE is identified via

\[ \mathbb{E}[Y(1) - Y(0)] = \mathbb{E}[\mathbb{E}[Y | A=1,X]] - \mathbb{E}[\mathbb{E}[Y | A=0,X]] \]

Throughout the discussion above, we assumed that \(f(x)>0\) for all \(x\) where \(f(x)\) is the density of the random variable \(X\). Positivity of the density function \(f\) ensures that the conditional expectations that condition on \(X\) are well-defined. Without this assumption, there maybe some \(x\) values for which \(f(x)=0\) and thus \(\mathbb{E}[Y(1) | X=x]\) is not well-defined. Or more crudely stated, you cannot take the average of \(Y(1)\) among a subgruop of individuals \(X=x\) that are not observable.

Similarly, we stated above that Assumption (A3c) ensures that \(\mathbb{E}[Y | A=1,X=x]\) is well-defined and here, we provide a rationale behind why this is the case. By the definition of conditional probability, we have \(\mathbb{P}(A=1|X=x) = \mathbb{P}(A=1,X=x) / \mathbb{P}(X=x)\). If \(f(x) > 0\) and \(P(A=1|X=x) > 0\), we must have \(\mathbb{P}(A =1,X=x) > 0\) and thus, the conditioning event \(\{A=1,X=x\}\) has a non-zero probability of occurring.

Identification of the Average Treatment Effect Among the Treated (ATT)

In addition to the ATE, there is another popular causal estimand called the average treatment effect among the treated (ATT), or formally \[ ATT = \mathbb{E}[Y(1) - Y(0) \mid A = 1] \] In words, ATT is the average treatment effect among individuals who received treatment. Or, in the context of the data table where there are four individuals, i.e.:

\(Y(1)\) \(Y(0)\) \(Y\) \(A\) \(X_{\rm Age}\)
John NA 0.9 0.9 0 38
Sally 0.8 NA 0.8 1 30
Kate NA 0.6 0.6 0 23
Jason 0.6 NA 0.6 1 26

the ATT represents the average of the differences of \(Y(1)-Y(0)\) among Sally and Jason, both of whom were treated. Note that the ATT is different than the ATE, which represents the average of the differences of \(Y(1)-Y(0)\) for both treated and untreated individuals.

When the treatment \(A\) is completely randomized (i.e. assumption (A2) where \(A \perp Y(1),Y(0), X\)), we have that the ATT=ATE. A bit more formally, assumption (A2) implies that \(A \perp Y(1) -Y(0)\) and thus

\[ \mathbb{E}[Y(1) - Y(0) ] = \mathbb{E}[Y(1) - Y(0) \mid A=1] \]

A feature of the ATT is that you can identify the causal effect by a weaker version of strong ignorability, i.e. \[\begin{align*} {\rm (A2c.0):} \quad{} A \perp Y(0) \mid X \end{align*}\] Comparing assumption (A2c) and (A2c.0), we don’t necessarily need the conditional independence between \(A\) and\(Y(1)\). Or, in the context of missing data, we only need the missingness indicator \(A\) to be independent of the column \(Y(0)\), not necessarily with the column \(Y(1)\).

But, from my experience, the practical difference between (A2c) and (A2c.0) where investigators discuss whether such assumptions are plausible in an observational study, is minor.

To prove identification, the term \(\mathbb{E}[Y(1) \mid A=1]\) in the ATT can be identified with just assumption (A1):

\[\begin{align*} \mathbb{E}[Y(1) \mid A =1] &= \mathbb{E}[\mathbb{E}[Y(1) \mid A =1,X] \mid A=1] && \text{Law of total expectation} \\ &= \mathbb{E}[\mathbb{E}[Y \mid A =1,X] \mid A=1] && \text{(A1)} \end{align*}\] In contrast, the term \(\mathbb{E}[Y(0) \mid A=1]\) in the ATT can be identified with (A1),(A2c.0) and (A3).

\[\begin{align*} \mathbb{E}[Y(0) \mid A =1] &= \mathbb{E}[\mathbb{E}[Y(0) \mid A =1,X]\mid A=1] && \text{Law of total expectation} \\ &= \mathbb{E}[\mathbb{E}[Y(0) \mid A = 0,X]\mid A=1] && \text{(A2c.0) and (A3)} \\ &= \mathbb{E}[\mathbb{E}[Y \mid A = 0,X] \mid A=1] && \text{(A1)} \end{align*}\]

Hence, under (A1), (A2c.0), and (A3), we can identify the ATT via \[ \mathbb{E}[Y(1) - Y(0) \mid A = 1] = \mathbb{E}[\mathbb{E}[Y \mid A =1,X] \mid A=1] - \mathbb{E}[\mathbb{E}[Y \mid A = 0,X]\mid A= 1] \]

Identification of Other Measures of Causal Effects: Causal Relative Risk (CRR) and Causal Odds Ratio (COR)

Other causal estimands are possible to identify under assumptions (A1), (A2c), and (A3c). Here, we list some popular causal estimands when the outcome is binary and prove that they can be identified.

Under a binary outcome, some popular causal estimands are the causal relative risk (CRR) and the causal odds ratio (COR): \[\begin{align*} {\rm CRR} &= \frac{\mathbb{E}[Y(1)]}{\mathbb{E}[Y(0)]} = \frac{\mathbb{P}(Y(1) = 1)}{\mathbb{P}(Y(0) = 1)} \\ {\rm COR} &= \frac{\frac{\mathbb{P}(Y(1) = 1)}{1 - \mathbb{P}(Y(1)= 1)}}{ \frac{\mathbb{P}(Y(0) = 1)}{1 - \mathbb{P}(Y(0) = 1)}} \end{align*}\] Even though odds ratios are very popular, there are some issues with defining causal odds ratios (or more generally odds ratios) and I would recommend using CRRs instead of CORs unless the scientific question is expressed in terms of odds ratios. Note that the original ATE \(\mathbb{E}[Y_i(1) - Y_{i}(0)]\), or a linear contrast of the outcomes, also works for binary outcomes.

Identification of the CRR and the COR proceeds by identifying \(\mathbb{E}[Y(a)]\), which make up the CRR and the COR.In particular, we have \[\begin{align*} \mathbb{E}[Y(a)] &= \mathbb{E}[\mathbb{E}[Y(a) \mid X]] && \text{Law of total expectation}\\ &= \mathbb{E}[\mathbb{E}[Y(a) \mid A = a, X]] && \text{(A2c) and (A3c)} \\ &= \mathbb{E}[\mathbb{E}[Y \mid A = a, X]] && \text{(A1)} \end{align*}\] We need (A3c) to ensure that the conditioning event \(\{A=a,X\}\) is well-defined. Then, under (A1), (A2c), and (A3c), the CRR and the COR are identified as \[\begin{align*} {\rm CRR} &= \frac{\mathbb{E}[\mathbb{E}[Y \mid A = 1, X]] }{\mathbb{E}[\mathbb{E}[Y \mid A = 0, X]] } \\ {\rm COR} &= \frac{ \frac{\mathbb{E}[\mathbb{E}[Y \mid A = 1, X]]}{1-\mathbb{E}[\mathbb{E}[Y \mid A = 1, X]]} }{ \frac{\mathbb{E}[\mathbb{E}[Y \mid A = 0, X]] }{1-\mathbb{E}[\mathbb{E}[Y \mid A = 0, X]]}} \end{align*}\]

Greenland, Robins, and Pearl (1999) and Hernán, Clayton, and Keiding (2011) discuss how the causal odds ratio can have paradoxical behaviors due to non-collapsibility of the odds ratio. They also discuss how the causal odds ratio can be challenging to estimate with data. These discussions are often connected with the Simpson’s paradox.

Identification of Single, Static, Optimal Treatment Regime/Policy (OTR)

In personalized medicine, the goal is to operationalize how doctors assign treatment to patients by developing an optimal treatment assignment policy where the patient receives the treatment that maximizes the patient’s outcome. A bit more formally, consider a policy function \(\pi : \mathcal{X} \to \{0,1\}\) which assigns either treatment (i.e \(1\)) or control (i.e \(0\)) based on the individual’s characteristic \(X \in \mathcal{X}\). The goal is to find the best \(\pi\), denoted as \(\pi_{\rm OTR}\), that maximizes the expected counterfactual outcome under that policy:

\[ \pi_{\rm OTR} = \underset{\pi : \mathcal{X} \to \{0,1\} }{\rm argmax } \ \mathbb{E}[Y(\pi(X))] \] The term \(Y(\pi(X))\) is the counterfactual outcome if treatment is assigned based on the policy \(\pi\) and can be written as \[ Y(\pi(X))= Y(1) I(\pi(X) = 1) + Y(0) I(\pi(X) = 0) \] The term \(\mathbb{E}[Y(\pi(X))]\), which takes an average of the counterfactual outcome under policy \(\pi\) is frequently referred to as the value of \(\pi\). For example,

  • The value of a policy that always assigns treatment, i.e. \(\pi(X) = 1\), is \(\mathbb{E}[Y(\pi(X))] = \mathbb{E}[Y(1)]\)
  • The value of a policy that always assigns control, i.e. \(\pi(X) = 0\), is \(\mathbb{E}[Y(\pi(X))] = \mathbb{E}[Y(0)]\)

Given any policy \(\pi\), we can identify its value under assumptions (A1), (A2c), and (A3c). \[\begin{align*} \mathbb{E}[Y(\pi(X))] &= \mathbb{E}[Y(1) I(\pi(X) = 1) + Y(0) I(\pi(X) = 0)] && \text{Definition} \\ &= \mathbb{E}[ \mathbb{E}[Y(1) I(\pi(X) = 1) + Y(0) I(\pi(X) = 0) \mid X]] && \text{Law of total expectation} \\ &= \mathbb{E}[I(\pi(X) = 1)\mathbb{E}[Y(1) \mid X] + I(\pi(X) =0) \mathbb{E}[Y(0) \mid X]] && \text{Property of conditional expectations} \\ &= \mathbb{E}[I(\pi(X) = 1)\mathbb{E}[Y \mid A=1,X] + I(\pi(X) =0) \mathbb{E}[Y \mid A=0, X]] && \text{(A1), (A2c), and (A3c); see the identification of the ATE above.} \end{align*}\] Once we identified the value for any policy, identifying the optimal policy \(\pi_{\rm OTR}\) does not involve more identifying assumptions. Specifically, let \(\mu_a(x) = \mathbb{E}[Y \mid A=a,X=x]\). Then, we can rewrite \(\pi_{\rm OTR}\) as

\[\begin{align*} \pi_{\rm OTR} &= \underset{\pi }{\rm arg max } \ \mathbb{E}[Y(\pi(X))] \\ &= \underset{\pi }{\rm arg max } \mathbb{E}[I(\pi(X) = 1) \mu_1(X) + I(\pi(X) =0) \mu_0(X) ] && \text{From equality above} \\ &=\underset{\pi }{\rm argmax } \mathbb{E}[\pi(X) \mu_1(X) + (1-\pi(X)) \mu_0(X) ] && \text{Because the output of $\pi(X)$ is either $1$ or $0$} \\ &= \underset{\pi }{\rm argmax } \mathbb{E}[\pi(X) (\mu_1(X) - \mu_0(X))] && \text{Dropped $\mathbb{E}[\mu_0(X)]$ since it's a constant} \\ &= I(\mu_1(X) - \mu_0(X) \geq 0) && \text{See explanation below} \end{align*}\] In words, the optimal treatment policy for a person with characteristic \(X\) is to check whether the expected outcome among those with characteristic \(X\) is larger under treatment (i.e. \(\mu_1(X)\)) or under control (i.e. \(\mu_0(X)\)). If \(\mu_1(X) > \mu_0(X)\), the optimal policy states that the person should be treated. If \(\mu_1(X) < \mu_0(X)\), the optimal policy is to assign the control to the person.

The last equality can be a bit tricky to understand if you never seen it before. To illustrate, let \(\Delta(x) = \mu_1(x) - \mu_0(x)\). Then, the second to the last equality becomes

\[\begin{align*} &\mathbb{E}[\pi(X) (\mu_1(X) - \mu_0(X))] \\ =& \mathbb{E}[\pi(X) \Delta(x) \{I(\Delta(x) \geq 0) + I(\Delta(X) < 0) \} ] && \text{Using the fact $1 = I(\Delta(x) \geq 0) + I(\Delta(X) < 0)$ } \\ =& \underbrace{\mathbb{E}[\pi(X) \Delta(x)I(\Delta(x) \geq 0)]}_{\text{non-negative}} +\underbrace{\mathbb{E}[\pi(X) \Delta(x) I(\Delta(X) < 0) ]}_{\text{non-positive}} \end{align*}\] To find \(\pi\) that maximize the above expression, we need to

  • Set \(\pi(X) =0\) whenever \(\Delta(X) < 0\); this turns the second term, which is non-positive, to zero. When \(\Delta(X) > 0\),
  • Set \(\pi(X) = 1\) whenever \(\Delta(X) > 0\) in order to maximize the first term, which is non-negative. In particular, if \(\pi(X) = 0\) when \(\Delta(X) >0\), then the first term is not maximized.

Combining these two observations, we arrive at \(\pi_{\rm OTR}(X) = I(\Delta(X) \geq 0)\).

For more information on this topic, see NC State’s course on Dynamic Treatment Regimes.

Observational Studies and Strong Ignorability

A large number works in causal inference sought to frame study of causal effects from observational studies as a version of a stratified randomized experiment where assumptions (A1)-(A3) are satisfied. Specifically, these works assume that given the measured pr-treatment covariates \(X\), the treatment \(A\) can be considered “as-if” random akin to a stratified randomized experiment. Some other interpretations of this approach to studying observational studies are:

  • We measured all the confounders in the observational study (i.e. \(X\)) and these variables satisfy (A2c) and (A3c) above.
  • There are no unmeasured confounders, denoted as \(U\), that can influence the propensity for someone to be treated (or receive control). A bit more formally, we do not have the case where \[A \perp Y(1), Y(0) | X,U \quad{} \text{ but } \quad{} A \not\perp Y(1), Y(0) | X\]
  • The self-selection into treatment (or control) does not depend on anything except \(X\).
  • If assumptions (A2c) and (A3c) hold in an observational study, we must adjust/control for \(X\) in order to identify the average treatment effect.

In my opinion, these are strong assumptions for an observational study and we’ll discuss identification whenever strong ignorability does not hold.

Also, one of the key differences between a stratified randomized experiment (or in general, any randomized experiment) and an observational study is that in a randomized experiment, the propensity score \(e(X)\) is known by the investigator. In contrast, in an observational study, \(e(X)\) is not known since individual’s selection into treatment cannot be controlled by the investigator. We’ll discuss the implications of this in later lectures on estimation.

For more discussions about examining observational studies from the lens of a randomized experiment, see Cochran (1965),Rubin (2007), and a very recent, nice article by Small (2024). One major takeaway from these readings is that investigators should blind themselves to the outcome, akin to a randomized experiment where the investigator does not see the outcome at the same time as the treatment variable \(A\) or the covariates \(X\).

Central Role of the Propensity Score \(\mathbb{P}(A=1 | X)\)

Rosenbaum and Rubin (1983) showed that the propensity score \(e(X) = \mathbb{P}(A=1 | X) \in (0,1)\) plays a critical role in identification and estimation of causal effects. Here, we highlight the two most important properties of the propensity score.

Consider any function \(b(X)\) of the covariates. This function \(b\) is called a balancing score if conditional on \(b(X)\), the treatment is independent of \(X\), i.e.  \[ A \perp X | b(X) \] A couple of remarks:

  • A trivial function \(b\) that satisfies this condition is the identity function \(b(X) = X\).
  • Theorem 1 of Rosenbaum and Rubin (1983) showed that the propensity score \(e(X)\) is a balancing score; see their Theorem 1.

Rosenbaum and Rubin (1983) proved two very interesting results about the propensity score.

Theorem 2 of Rosenbaum and Rubin (1983): \(b(X)\) is a balancing score if and only if \(b(X)\) is finer than the propensity score \(e(X)\), i.e. if there exists a function \(g\) where \(e(X) = g(b(X))\).

Some remarks

  • The propensity score contains the ``smallest’’ amount of information to achieve \(A \perp X | b(X)\) or the propensity score is the coarsest balancing score.
  • To intuitively check this, consider setting \(b(X) =X\). This is not only a balancing score, but also provides much more information (i.e. finer information) than the propensity score \(e\), which is a number between \(0\) and \(1\). Also, in Theorem 1, we have \(e(X) = e(b(X))\) or \(g=e\).

Also notice that in a completely randomized trial where (A2) and (A3) held, we had \(A \perp X\) and covariates were balanced. But, under the conditional versions (A2c) and (A3c), we now have \(A \perp X \mid e(X)\) or covariates are balanced conditional on the propensity score \(e(X)\).

Theorem 3 of Rosenbaum and Rubin (1983): Let \(e(X) = \mathbb{P}(A=1 | X)\). If conditions (A1), (A2c), and (A3c) hold, then we have \[ {\rm (A1)} + {\rm (A2c)} + {\rm (A3c)} \Rightarrow A \perp Y(1), Y(0) | e(X) \text{ and } 0 < \mathbb{P}(A=1 | e(X)) < 1 \]

Technically, Rosenbaum proved this for all balancing scores, not just the propensity score. But, the implications for propensity score is more useful to understand the ideas. Specifically

  • If (A1),(A2c), and (A3c) hold for the entire \(X\), then these assumptions also hold for a scalar summary of \(X\) in the form of the propensity score \(e(X)\). In other words, the propensity score is the “best” dimension reducing score of \(X\) for the purpose of causal inference.
  • If we look at the proof of identification for the ATE, we can identify the ATE via \[ \mathbb{E}[Y(1) - Y(0)] = \mathbb{E}[\mathbb{E}[Y \mid A=1,e(X)]] - \mathbb{E}[\mathbb{E}[Y \mid A=1,e(X)]] \] The proof of this follows directly from the proof of the identification of the ATE where we replace \(X\) with \(e(X)\). A version of this equality is in Theorem 4 of Rosenbaum and Rubin (1983).
  • In a completely randomized trial where (A2) and (A3) held, we had \(A \perp X\) and covariates were balanced. But, under the conditional versions (A2c) and (A3c), we now have \(A \perp X \mid e(X)\) or covariates are balanced conditional on the propensity score \(e(X)\).

References

Cochran, William G. 1965. “The Planning of Observational Studies of Human Populations.” Journal of the Royal Statistical Society. Series A (General) 128 (2): 234–66.
Greenland, Sander, James M Robins, and Judea Pearl. 1999. “Confounding and Collapsibility in Causal Inference.” Statistical Science 14 (1): 29–46.
Hernán, Miguel A, David Clayton, and Niels Keiding. 2011. “The Simpson’s Paradox Unraveled.” International Journal of Epidemiology 40 (3): 780–85.
Rosenbaum, Paul, and Donald Rubin. 1983. “The Central Role of the Propensity Score in Observational Studies for Causal Effects.” Biometrika 70 (1): 41–55.
Rubin, Donald B. 2007. “The Design Versus the Analysis of Observational Studies for Causal Effects: Parallels with the Design of Randomized Trials.” Statistics in Medicine 26 (1): 20–36.
Small, Dylan S. 2024. “Protocols for Observational Studies: Methods and Open Problems.” arXiv Preprint arXiv:2403.19807.