Causal Inference: Identification of Causal Effects (Missing Data & Randomized Experiment)

Author

Hyunseung Kang

Published

January 28, 2025

Concepts Covered Today

  • Three causal identifying assumptions: (a) SUTVA/consistency, (b) strong ignorability, and (c) positivity/overlap.
  • Motivation of assumptions based on missing data (MCAR, MAR)
  • Motivation of assumptions based ona randomized experiment
  • Covariance balance
  • References
    • Chapter 2 of Hernán and Robins (2020)

Review from Last Week

We are interested in the causal effect of a treatment (versus no treatment/control) on an outcome.

We used the counterfactual/potential outcomes to define causal effects.

  • \(Y(1)\): the counterfactual outcome if, contrary to fact, the study unit was treated.
  • \(Y(0)\): the counterfactual outcome if, contrary to fact, the study unit was not treated (i.e., control)
\(Y(1)\) \(Y(0)\)
John 0.54 0.94
Sally 0.91 0.91
Kate 0.81 0.60
Jason 0.60 0.84

From the fundamental problem of causal inference, we generally cannot observe both \(Y(1),Y(0)\).

Also, the counterfactual outcomes differ from the observed outcomes \(Y\) in that the observed outcome is a realization for a particular value of the treatment assignment \(A\).

\(Y\) \(A\)
John 0.94 0
Sally 0.91 0
Kate 0.81 1
Jason 0.60 1

How do we learn about \(Y(1), Y(0)\) from the observable data \(Y,A\)?

Our First Assumption for Causal Identification: SUTVA/Causal Consistency

First, let’s make the following assumption known as stable unit treatment value assumption (SUTVA) (Rubin (1980)) or causal consistency (page 4 of Hernán and Robins (2020)):

\[Y = AY(1) + (1-A) Y(0).\]

It’s also common to rewrite the above assumption as

\[Y = Y(A), \quad{} \text{ or if } A=a, \text{ then } Y = Y(a).\]

The latter version covers the case when the treatment \(A\) is not binary (e.g., discrete, continuous)

In words, SUTVA states the observed outcome \(Y\) is one realization of the counterfactual outcomes \(Y(a)\) based on the observed value of the treatment \(A\).

More subtly, SUTVA implies two “mini” assumptions.

  • There are no multiple versions of treatment.
  • There is no interference, a term coined by Cox (1958).

No Multiple Versions of Treatment

It’s useful to understand this assumption by studying the case when the assumption is violated.

Let’s go back to the smoking example from the first lecture where we defined the causal effect of daily smoking (i.e., treatment) versus never smoking (i.e., control) on lung function.

Daily smoking may include different type of daily smokers.

  1. Is a daily smoker a person who smokes at least one cigarette per day?
  2. Is a daily smoker a person who smokes at least one pack of cigarettes per day?
  3. Is a daily smoker a person who smokes during all time in their lives, including during pregnancy?
  4. Is a daily smoker a person who vapes every day?

We can define counterfactual outcomes for all types of daily smokers:

  • \(Y(1)\): counterfactual outcome under definition (a) of a daily smoker
  • \(Y(2)\): counterfactual outcome under definition (b) of a daily smoker
  • \(Y(k)\): counterfactual outcome under definition (k) of a daily smoker

By assuming SUTVA, we eliminate these variations in the counterfactuals. Formally,

\[Y(1) = Y(2) = \ldots =Y(k).\]

In words, SUTVA implies that the lung function of a daily smoker who smokes at least one cigarette per day (i.e., \(Y(1)\)) is equal to the lung function of the same daily smoker living in the same environment except that he/she smokes at least one pack of cigarettes per day (i.e., \(Y(2)\)).

  • No multiple versions of treatment assumption does not imply that the counterfactual outcome of the control, \(Y(0)\), is equal to the counterfactual outcome for the treatment , i.e., \(Y(1) = Y(0)\).
  • But, the assumption implies that there are no multiple versions of control.
    • In the data example, the counterfactual outcomes of different types of “never-smokers” are identical.
    • For example, a never-smoke could be someone who never smoked since birth or someone who only smokes “rarely.”
    • Formally, \(Y(0) = Y(0') = Y(0^{''})\) where \(0, 0', 0^{''}\) represent different types of never-smokers.

There are settings where no multiple versions of treatment is plausible. For example,

  • The casual effect of taking Wegovy/semaglutides weight loss drug (i.e., treatment). Taking this drug is a well-define intervention. Similarly, not taking the drug (i.e., control) is also a well-defined intervention.
  • In general, randomized controlled trials (RCTs) usually have unambiguous definitions of treatment and control.
  • The causal effect of increasing graduate student’s stipends in Fall 2024 (i.e., the treatment). The increase is a well-defined policy (e.g., 5% increase).

Broadly speaking, SUTVA forces you to define meaningful \(Y(a)\); see the first lecture.

  • Some authors restrict counterfactual outcomes to be based on well-defined interventions or “no causation without manipulation”
  • See Holland (1986),Hernán and Taubman (2008),Cole and Frangakis (2009), VanderWeele (2009), and first lecture notes on the “causal effect” of race.

No Interference

It is useful to understand this assumption with a counterexample.

Suppose we want to study the causal effect of getting the varicella vaccine (i.e., chickenpox vaccine) on getting the chickenpox. Let’s define the following counterfactual outcomes:

  • \(Y(1)\): John’s counterfactual chickenpox status if John gets vaccinated.
  • \(Y(0)\): John’s counterfactual chickenpox status if John doesn’t get vaccinated.
    • If \(Y(0)=0\), John did not get the chickenpox in the universe where he’s not vaccinated against it.
    • If \(Y(0)=1\), John got the chickenpox in the universe where he’s not vaccinated against it.

If the chickenpox vaccine is 100% effective for everyone, it’s likely that \(Y(1) = 0\).

Now, suppose John has a sibling Sally and let’s consider John’s counterfactual universe where he is not vaccinated and Sally’s vaccination status varies.

  1. John’s counterfactual chickenpox status when John is not vaccinated, but Sally is vaccinated.
  2. John’s counterfactual chickenpox status when John and Sally are both unvaccinated.

We can redefine the counterfactual outcomes to incorporate Sally’s vaccination status.

  1. \(Y(0,1)\): John’s counterfactual chickenpox status where the first \(0\) refers to John’s vaccination status (i.e., not vaccinated) and the second \(1\) refers to Sally’s vaccination status (i.e., vaccinated)
  2. \(Y(0,0)\): John’s counterfactual chickenpox status where the first \(0\) refers to John’s vaccination status (i.e., not vaccinated) and the second \(0\) refers to Sally’s vaccination status (i.e., not vaccinated)

SUTVA implies that John’s counterfactual outcome only depends on John’s vaccination status, not Sally’s vaccination status. Formally,

\[Y(0,1) = Y(0,0) = Y(0)\]

  • From our understanding of chickenpox and how contagious it is, no interference is an implausible assumption.
  • For example, if Sally is vaccinated, John will be less likely to get the chickenpox compared to when Sally is not vaccinated.
    • We can express this as \(Y(0,1) \leq Y(0,0)\)
    • Remember, the study unit’s outcome is \(1\) if the unit gets the chickenpox and \(0\) otherwise.
  • In general, no interference is unlikely to hold in vaccine studies and studies of peer/neighborhood/carryover effects.
    • Rosenbaum (2007) has a nice set of examples of when the no interference assumption is implausible.
    • There is a lot of ongoing work on this topic (e.g., Li and Wager (2022), Sävje, Aronow, and Hudgens (2021)).

There are settings where the no interference assumption is plausible.

  • The causal effect of taking Lipitor/atorvastatin cholesterol drug (i.e., treatment) on total cholesterol levels (i.e., outcome).
    • John’s cholesterol level will unlikely be affected by whether Sally takes the drug or not.
  • The causal effect of enrolling in a job training program (i.e., treatment) on employment.
    • John’s employment status will unlikely be affected by whether Sally enrolls in the training program or not.

We can write a more general version of the no interference assumption. - Formally, let \(i=1,\ldots,n\) index \(n\) study units. - Let \(Y_i(a_i,a_{-i}) \in \mathbb{R}\) denote the counterfactual outcome of unit \(i\) if unit \(i\) gets treatment status \(a_i \in \mathbb{R}\) and unit i’s peers get treatment status \(a_{-i} \in \mathbb{R}^{n-1}\)$. - No interference for all units states that for every \(i=1,\ldots,n\),

\[ Y_i(a_i,a_{-i}) = Y_i(a_i,a_{-i}') = Y(a_i)\]

for all \(a_i \in \mathbb{R}\), \(a_{-i}, a_{-i}' \in \mathbb{R}^{n-1}\).

Motivating the Other Assumptions for Causal Identification: A Missing Data Perspective

Once we assume SUTVA (i.e. \(Y= AY(1) + (1-A)Y(0)\)), the other assumptions for causal identification can be motivated by a connection to a missing data problem.

\(Y(1)\) \(Y(0)\) \(Y\) \(A\)
John NA 0.94 0.94 0
Sally NA 0.91 0.91 0
Kate 0.81 NA 0.81 1
Jason 0.60 NA 0.60 1

Under SUTVA, we only see one of the two counterfactual outcomes based on \(A\).

  • \(A\) serves as the “missingness” indicator where \(A=1\) implies \(Y(1)\) is observed and \(A=0\) implies \(Y(0)\) is observed.
  • \(Y\) is the “observed” value.

Assumption on Missingness Pattern

Suppose we are interested in the causal estimand \(\mathbb{E}[Y(1)]\) (i.e. the mean of the first column).

One approach to study it is to take the average of the “complete cases” (i.e., Kate and Jason’s \(Y(1)\)s).

  • Formally, we would identify \(\mathbb{E}[Y(1)]\) with \(\mathbb{E}[Y | A=1]\), the population mean of the observed outcome \(Y\) among \(A=1\).
  • This approach is valid if the entries of the first column are missing completely at random (MCAR).
    • For each row, the missingness of \(Y(1)\) depends on a missingness indicator \(A\) where the value of this indicator is based on the result of a random, independent, and identical coin flip.
    • Someone essentially had a blindfold on and randomly erased some values of \(Y(1)\) values; the entries of \(Y(1)\) are missing completely by chance.

See here for an introduction to missing data.

Here, we illustrate MCAR through a small simulation study.

Consider a data table where the first column consists of \(Y(1)\) and the second column consists of \(A\), the missingness indicator where \(A=1\) indicates that \(Y(1)\) is not missing and \(A=0\) indicates that \(Y(1)\) is missing. Each row of the table represents an individual.

set.seed(1)
library(knitr)
n = 100; mu = 10
Y1 = rnorm(n,mu); 

A = as.numeric(runif(n) < 0.5)
dat = data.frame(Y1 = Y1,A=A);
dat[dat$A==0,"Y1"] = NA
knitr::kable(dat[1:10,],format="pipe",digits=2,align="l",
             caption = "First 20 rows of the data",
             col.names=c("Y(1)","A"))
First 20 rows of the data
Y(1) A
9.37 1
10.18 1
NA 0
11.60 1
10.33 1
NA 0
NA 0
10.74 1
10.58 1
NA 0

MCAR states that there are no patterns in how the NAs appear in the table; the entries of \(Y(1)\) are missing completely by chance. Then, intuitively, taking the mean of \(Y(1)\) among the observed values should be a good estimate of the mean of all \(Y(1)\).

mean(dat[dat$A == 1,"Y1"]) # mean among non-missing values
[1] 10.10171
mean(Y1) #mean among all $Y_i(1)$ (missing and non-missing)
[1] 10.10889

The two means should be similar to each other, confimring our intuition.

Now, suppose the missingness is not random. For example, in the simulation below, all values of \(Y(1)\) that are greater than 10 are not missnig and all values of \(Y(1)\) that are less than 10 are missing. In this case, \(A\), the variable that indicates whether \(Y(1)\) is missing or not, depends on the value of \(Y(1)\) and \(A \not\perp Y(1)\).

A = as.numeric(Y1 > 10) #all values of 
dat = data.frame(Y1 = Y1,A=A);
dat[dat$A==0,"Y1"] = NA
knitr::kable(dat[1:10,],format="pipe",digits=2,align="l",
             caption = "First 20 rows of the data",
             col.names=c("Y(1)","A"))
First 20 rows of the data
Y(1) A
NA 0
10.18 1
NA 0
11.60 1
10.33 1
NA 0
10.49 1
10.74 1
10.58 1
NA 0

It’s useful to draw the distribution of the observed \(Y(1)\) versus the distribution of the entire (missing and non-missing) \(Y(1)\).

dY1 = density(Y1) #all of Y1
dY1_obs = density(dat$Y1,na.rm=TRUE) #observed Y1

plot(dY1,xlim=c(min(Y1),max(Y1)),ylim=c(0,max(dY1_obs$y)),
     main="Estimated densities",xlab="Y(1)")
lines(dY1_obs,col=2)
legend("topright", legend = c("All of Y(1)", "Observed Y(1)"),
       lty=2,col=1:2)

From the plot, it’s clear that the mean among the observed \(Y(1)\) will be a poor estimate of the mean of the entire \(Y(1)\) distribution; the mean of the observed \(Y(1)\) will be slightly higher than the mean of the entire \(Y(1)\).

mean(dat[dat$A == 1,"Y1"]) # mean among non-missing values
[1] 10.76445
mean(Y1) #mean among all $Y_i(1)$ (missing and non-missing)
[1] 10.10889

Formal Statement of MCAR

Formally, MCAR can be stated as \[A \perp Y(1) \text{ and } 0 < \mathbb{P}(A=1)\]

  • \(A \perp Y(1)\) states that missingness is independent of \(Y(1)\)
    1. Missingness occurs completely at random in the rows of the first column, say by a flip of a random coin.
    2. Missingness doesn’t occur more frequently for lower values of \(Y(1)\); this would violate \(A \perp Y(1)\).
    3. Used in the context of causal inference, this assumption is sometimes referred to as (complete) exchangeability or ignorabiltiy or complete randomization
  • \(0 < \mathbb{P}(A=1)\) states that you have a non-zero probability of observing some entries of the column \(Y(1)\)
    1. If \(\mathbb{P}(A=1) =0\), then all entries of the column \(Y(1)\) are missing and we can’t learn anything about its column mean.
    2. Used in the context of causal inference, this assumption is sometimes referred to as positivity or overlap.

A natural question to ask is whether you can assess, with the observed data \(A,Y\) that \(A \perp Y(1)\). Without assumptions, this is impossible as \(Y(1)\) and \(Y\) are not linked. Even if they are linked via SUTVA, you cannot check \(A \perp Y(1)\) by simply checking \(A \perp Y\); see the next callout block.

More generally, assumptions involving counterfactuals like \(A \perp Y(1)\) must be verified by the study design or the process in which the data was generated. For example, in addition to obtaining the data \(A,Y\), if someone told us that \(A\) was generated from an independent, random flip of a coin, then we know \(A \perp Y(1)\). To put it differently, it is impossible to check whether \(A\) came from an independent, random flip of a coin based on the numerical values of \(A,Y\) alone as the values themselves do not tell you information about how \(A\) was generated.

Note that \(A \perp Y(1)\) (i.e. missingness indicator \(A\) for the \(Y(1)\) column is completely random) is not equivalent to \(A \perp Y\) (i.e. the missingness indicator \(A\) is not associated with \(Y\)), with or without SUTVA.

Without SUTVA, \(Y\) and \(Y(1)\) can be two completely different variables and thus, the two independence assumptions are generally not equivalent to each other. In other words, \(A \perp Y(1)\) makes an assumption about the counterfactual outcome whereas \(A \perp Y\) makes an assumption about the observed outcome.

With SUTVA, \(Y = AY(1) + (1-A)Y(0)\). But, \(A \perp Y(1)\) still does not necessarily imply that \(Y\), which is a linear combination of \(Y(1)\) and \(Y(0)\) are independent of \(A\). The easiest way to see this is by going back to the data table with \(Y(1), Y(0), Y,A\).

  • \(A \perp Y(1)\) states that there is a lack of relationship between the column of \(Y(1)\) and the column of \(A\).
  • \(A \perp Y\) states that there is a lack of relationship between the column \(Y\), which is a mix of \(Y(1)\) and \(Y(0)\) under SUTVA, and the column of \(A\).

One special case where \(A \perp Y(1)\) implies \(A \perp Y\) is when \(Y(1) = Y(0)\), or that there is no causal effect of treatment for everyone.

  • By SUTVA, \[Y = AY(1) + (1-A)Y(0) = AY(1) + (1-A)Y(1) = Y(1)\]
  • By the equality above, \(A \perp Y(1)\) implies \(A \perp Y\).

Formal Proof of Causal Identification of \(\mathbb{E}[Y(1)]\)

Suppose SUTVA and MCAR hold:

  • (A1, SUTVA): \(Y = A Y(1) + (1-A) Y(0)\)
  • (A2, Complete randomization): \(A \perp Y(1)\)
  • (A3, Positivity): \(0 < \mathbb{P}(A=1)\)

Then, we can identify the causal estimand \(\mathbb{E}[Y(1)]\) by writing it as the following function of the observed data \(\mathbb{E}[Y | A=1]\): \[\begin{align*} \mathbb{E}[Y \mid A=1] &= \mathbb{E}[AY(1) + (1-A)Y(0) \mid A=1] && \text{(A1)} \\ &= \mathbb{E}[Y(1) \mid A=1] && \text{Algebra} \\ &= \mathbb{E}[Y(1)] && \text{(A2)} \end{align*}\] (A3) is used to ensure that \(\mathbb{E}[Y | A=1]\) is a well-defined quantity.

Technically speaking, to establish \(\mathbb{E}[Y | A=1] = \mathbb{E}[Y(1)]\), we only need mean independence:

\[\mathbb{E}[Y(1) | A=1] = \mathbb{E}[Y(1)] \]

Note that \(A \perp Y(1)\) implies mean independence above, but the converse is not true.

Causal Identification of the ATE

In a similar vein, to identify the ATE \(\mathbb{E}[Y(1)-Y(0)]\), a natural approach would be to use \(\mathbb{E}[Y | A=1] - \mathbb{E}[Y | A=0]\).

This approach would be valid under the following variation of the MCAR assumption: \[A \perp Y(0),Y(1), \quad{} 0 < \mathbb{P}(A=1) < 1\]

  • The first part states that the treatment \(A\) is independent of \(Y(1), Y(0)\). This is also referred to (complete) exchangeability, ignorability, or complete randomization in causal inference.
  • \(0 < \mathbb{P}(A=1) <1\) states that there is a non-zero probability of observing some entries from the columns of \(Y(1)\) and \(Y(0)\). This is (again) referred to positivity or overlap in causal inference.

Formal Proof of Causal Identification of the ATE

Suppose SUTVA and MCAR hold:

  • (A1, SUTVA): \(Y = A Y(1) + (1-A) Y(0)\)
  • (A2, Igornability): \(A \perp Y(1), Y(0)\)
  • (A3, Positivity): \(0 < \mathbb{P}(A=1) < 1\)

Then, we can identify the ATE from the observed data via: \[\begin{align*} &\mathbb{E}[Y|A=1] - \mathbb{E}[Y | A=0] \\ =& \mathbb{E}[AY(1) + (1-A)Y(0) | A=1] \\ & \quad{} - \mathbb{E}[AY(1) + (1-A)Y(0) | A=0] && \text{(A1)} \\ =& \mathbb{E}[Y(1)|A=1] - \mathbb{E}[Y(0) | A=0] && \text{Algebra} \\ =& \mathbb{E}[Y(1)] - \mathbb{E}[Y(0)] && \text{(A2)} \end{align*}\]

(A3) ensures that the conditioning events in \(\mathbb{E}[\cdot |A=0]\) and \(\mathbb{E}[\cdot |A=1]\) are well-defined.

Interpreting the Causal Identification of the ATE

\[\underbrace{\mathbb{E}[Y|A=1] - \mathbb{E}[Y | A=0]}_{\text{Measure of Association}} = \underbrace{\mathbb{E}[Y(1)] - \mathbb{E}[Y(0)]}_{\text{Measure of Causation}}\]

This equality implies that under (A1,SUTVA), (A2, Ignorability), and (A3, Positivity), a measure of association between \(A\) and \(Y\) based on difference in means (i.e., the left-hand-side ) is equal to a measure of causation based on difference in counterfactual means (i.e., the right-hand side).

  • More concretely, suppose the difference in the population means of \(Y\) is \(0.5\) (i.e., \(\mathbb{E}[Y|A=1] - \mathbb{E}[Y | A=0] = 0.5\))
  • Then the difference in the means of the counterfactual outcomes is also \(0.5\) (i.e., \(\mathbb{E}[Y(1)] - \mathbb{E}[Y(0)] = 0.5\))
  • In other words, under (A1,SUTVA), (A2, Ignorability), and (A3, Positivity), association can imply causation.

We can take this result a bit further by considering the setting where there is no association between \(A\) and \(Y\).

  • No association is equivalent to \(A \perp Y\).
    • This implies that there is no (linear) correlation between \(A\) and \(Y\) (i.e., the population \({\rm corr}(A,Y) = 0\))
    • In general, there is no dependence of any kind (linear or non-linear) between \(A\) and \(Y\).
  • An implication of \(A \perp Y\) is that \(\mathbb{E}[Y |A=1] = \mathbb{E}[Y|A=0] = \mathbb{E}[Y]\)
  • By the result above, if (A1,SUTVA), (A2, Ignorability), and (A3, Positivity) hold, \(\mathbb{E}[Y(1)] = \mathbb{E}[Y(0)]\).
  • In short, under (A1,SUTVA), (A2, Ignorability), and (A3, Positivity), no association implies no causation.

More broadly, as illustrated by both examples, association can imply certain causal claims if additional assumptions hold (e.g., (A1,SUTVA), (A2,Ignorability), and (A3, Positivity)).

  • Importantly, if SUTVA does not hold, there is no way to link the observed values \(A,Y\) to the counterfactual outcomes \(Y(1), Y(0)\)
  • Thus, as the old saying goes, association may not imply causation unless additional assumptions hold.

Motivating the Other Assumptions for Causal Identification: A Randomized Experiment

Consider an ideal, completely randomized trial/experiment (RCT) to assess the causal effect of a new drug (versus a control/placebo) on an outcome of interest.

  1. Enroll individuals to the experiment based on some enrollment criterion.
  2. Randomly assign some individuals to treatment (i.e., \(A=1\)) and others to control (i.e., \(A=0\))
  3. Observe outcomes \(Y\) from both groups.

RCTs have been referred to as the gold standard to study causal effects of a treatment on an outcome of interest. But why?

  • At a high level, RCTs recreates the parallel universe analogy.
  • Specifically, by randomization, all features about the study units are similar between the treated and the control groups.
    • The two groups are similar with respect to their measurable traits (\(X\))
    • The two groups are also similar with respect to their unmeasurable traits (\(U\))
  • Then, any difference in the outcome between the two groups can only be attributed to a difference in the treatment status, thus recreating the parallel universe analogy from our first lecture.

This was the “big” idea from Fisher in 1935 where he used randomization as the “reasoned basis’’ for causal inference. Paul Rosenbaum explains this more beautifully than I do in Chapter 2.3 of Rosenbaum (2020).

Formalizing RCTs with Counterfactual Outcomes

Consider the following data table.

\(Y(1)\) \(Y(0)\) \(Y\) \(A\) X (Measured; age) U (Unmeasured; environment)
John NA 0.94 0.94 0 23 \(U_{\rm John}\)
Sally NA 0.91 0.91 0 27 \(U_{\rm Sally}\)
Kate 0.81 NA 0.81 1 32 \(U_{\rm Kate}\)
Jason 0.60 NA 0.60 1 30 \(U_{\rm Jason}\)

If the treatment \(A\) is completely randomized (as in an RCT), we would also have \(A \perp X, U\). More generally, we have

\[\text{(A2, Ignorability) } A \perp Y(1), Y(0), X, U\].

Also, because there is at least one control unit and treated unit in an RCT, we have

\[\text{(A3, Positivity) } 0 < \mathbb{P}(A=1) < 1\].

Even with the change in (A2, Ignorability) from before, the proof to identify the ATE in an RCT remains the same as before, i.e., \(\mathbb{E}[Y(1)] - \mathbb{E}[Y(0)] = \mathbb{E}[Y|A=1] - \mathbb{E}[Y | A=0]\). This is because only need \(A \perp Y(1), Y(0)\) for identification.

Covariate Balance

An important implication from randomization of treatment assignment is covariate balance.

Roughly speaking, we say that covariate \(X\) (measured or unmeasured) is ``balanced’’ between treated and control groups if

\[\mathbb{P}(X |A=1) = \mathbb{P}(X | A=0)\]

  • In words, covariate balance states that the treated group and the control group have similar distribution of covariates.
  • Suppose measured and unmeasured covariates are balanced between the treated group and the control group.
  • Then on average, any difference in the outcome between the two groups can be attributed to the difference in their treatment status.

From the RCT motivation, it’s very obvious that covariate balance holds for both \(X\) and \(U\), i.e.

\[\mathbb{P}(X,U |A=1) = \mathbb{P}(X,U | A=0)\].

In general, covariates should be balanced between treated and control groups to make causal claims about the relationship between the outcome and the treatment.

  • As a result, it’s common to check for covariate balance in causal inference by comparing the means of \(X\)s among treated and control units (e.g. two-sample t-test of the mean of \(X\)).
  • It’s also common to do this for RCTs to verify that the randomization was done successfully.

In Chapter 9.1 of Rosenbaum (2020), Rosenbaum recommends using the pooled variance when computing the difference in means of a covariate between the treated group and the control group. Specifically, let \({\rm SD}X_{A=1}\) be the standard deviation of the covariate in the treated group and \({\rm SD}X_{A=0}\) be the standard deviation of the covariate in the control group. Then, Rosenbaum suggests the statistic

\[ \text{Standardized difference in means} = \frac{\bar{X}_{A=1}-\bar{X}_{A=0}}{\sqrt{ ({\rm SD}(X)_{A=1}^2 + {\rm SD}(X)_{A=0}^2)/2}} \]

Note About Pre-treatment Covariates

We briefly mentioned that covariates \(X\) must precede treatment assignment, i.e.

  1. We collect \(X\) (i.e. baseline covariates)
  2. We assign treatment/control \(A\)
  3. We observe outcome \(Y\)

If they are post-treatment covariates, then the treatment can have a causal effect on both the outcome \(Y\) and the covariates \(X\).

In this case, it’s not unclear whether \(Y\) has a causal effect because of a causal effect in \(X\). Studying this type of question is called causal mediation analysis.

In general, we don’t want to condition on post-treatment covariates \(X\) when the goal is to estimate the average treatment effect of \(A\) on \(Y\).

References

Cole, Stephen R, and Constantine E Frangakis. 2009. “The Consistency Statement in Causal Inference: A Definition or an Assumption?” Epidemiology 20 (1): 3–5.
Cox, David. 1958. Planning of Experiments. Wiley.
Hernán, Miguel, and James Robins. 2020. Causal Inference: What If. Boca Raton: Chapman & Hall/CRC.
Hernán, Miguel, and Sarah Taubman. 2008. “Does Obesity Shorten Life? The Importance of Well-Defined Interventions to Answer Causal Questions.” International Journal of Obesity 32 (3): S8–14.
Holland, Paul. 1986. “Statistics and Causal Inference.” Journal of the American Statistical Association 81 (396): 945–60.
Li, Shuangning, and Stefan Wager. 2022. “Random Graph Asymptotics for Treatment Effect Estimation Under Network Interference.” The Annals of Statistics 50 (4): 2334–58.
Rosenbaum, Paul. 2007. “Interference Between Units in Randomized Experiments.” Journal of the American Statistical Association 102 (477): 191–200.
———. 2020. Design of Observational Studies. Springer.
Rubin, Donald B. 1980. “Randomization Analysis of Experimental Data: The Fisher Randomization Test Comment.” Journal of the American Statistical Association 75 (371): 591–93.
Sävje, Fredrik, Peter Aronow, and Michael Hudgens. 2021. “Average Treatment Effects in the Presence of Unknown Interference.” Annals of Statistics 49 (2): 673.
VanderWeele, Tyler J. 2009. “Concerning the Consistency Assumption in Causal Inference.” Epidemiology 20 (6): 880–83.