The Experimental Ideal
Do hospitals have a positive impact on health? Let $Y_i$ be the observed health status of an individual and let $D_i$ be whether they went to the doctor or not.
$$
\textit{Potential Outcome} = \begin{cases}
Y_{1i} & \text{if } D_i = 1 \\\
Y_{0i} & \text{if } D_i = 0
\end{cases}$$
If we naively take the difference between the average health status of people who go to the hospital, and people who don’t we end up with the answer that going to the hospital actually makes you sicker. This is because of self- selection. We assume that people who go to hospital are sicker before they go than people who simply do not, and we can’t observe both $Y_{i0}$ and $Y_{i1}$ for a given person $i$.
We can write the observed outcome as a function of the potential outcomes.
$$Y_i = \begin{cases}
Y_{1i} & \text{if } D_i = 1 \\\
Y_{0i} & \text{if } D_i = 0
\end{cases}$$
$$Y_i = Y_{0i} + (Y_{1i} - Y_{0i} )D_i$$ That is the observed outcome is $Y_{0i}$ unless the person goes to the hospital in which case their treatment effect is added to the baseline.
$$E[Y_{1i}|D_i = 1] - E[Y_{0i}|D_i = 0]$$ This is the observed treatment effect, but it is a biased estimate. If we add and subtract the conditional expectation of baseline health given that people were treated then we see.
$$E[Y_{1i}|D_i = 1] - E[Y_{0i}|D_i = 1] + E[Y_{0i}|D_i = 1] - E[Y_{0i}|D_i = 0]$$
The first two terms of the expression are the average treatment effects on those who were treated. And the second term can be interpreted as what would have happened to hospitalized if they were not minus the outcomes for the people who never went to the hospital. This constitutes the selection bias.
Random Assignment Saves the Day
What we are after is the average causal effect of hospitalization on those who were hospitalized.
$$E[Y_{1i}|D_i = 1] - E[Y_{0i}|D_i=0] = E[Y_{1i}-Y_{0i}|D_i = 1]$$ Previously we saw that there is selection bias when we measure the observed effects. By randomly assigning $D_i$ among a certain population we get around this constraint. Random assignment makes $D_i$ independent of $Y_{0i}$. This means that the expectation of $Y_{0i}$ will be the same irrespective of of the treatment $D_i$.
$$E[Y_{0i}] = E[Y_{0i}|D_i = 1] = E[Y_{0i}|D_i = 0]$$
When we plug this fact back into our original expression for the observed outcome
$$E[Y_{1i}|D_i = 1] - E[Y_{0i}|D_i = 1] + E[Y_{0i}|D_i = 1] - E[Y_{0i}|D_i = 0]$$
the middle terms will cancel each other out leaving just
$$E[Y_{1i}|D_i = 1] - E[Y_{0i}|D_i = 1] = E[Y_{1i} - Y_{0i}]$$
Good experimental design is still hard (see the entire class I took on experimental design), but this solves one of th emain issues of science.
Experiments can be estimated using a regression
$$Y_i = \alpha + \rho D_i + \eta_i$$
When we have selection bias the treatment effect $D_i$, will be correlated with the residuals, $\eta_i$. With random assignment this bias disappears, and the regression estimates the causal effect $\rho$.
Including other covariates has some important effects
$$Y = \alpha + \rho D_i + X^`_i \gamma + \eta_i$$
If the controls are uncorrelated with the are uncorrelated with the treatment then they will not effect the estimates of $\rho$. Their inclusion may help generate more precise estimates of the causal effects, and can also make the standard errors smaller because they do a better job of predicting the outcomes, resulting in smaller residuals.
Exercise: Show this using simulation