The Conditional Expectation Function (CEF)
The CEF for a dependent variable \(Y_i \), given a vector of \(k \times 1\) covariates \(X_i\) (with elements \(x_{ki}\)) is the expectation of the population average (or an infinitely large sample).
$$E[Y_i \vert X_i=x]$$
The CEF is only a function of $X_i$. The potential outcomes framework presented an important case of the CEF where $D_i \in {0, 1}$. If $X_i$ is random then the CEF is also random, but a particular value of $X_i$ can give a concrete answer to the CEF.
Assume that $Y_i$ with a conditional probability density function represented by $f_y(t|X_i=x)$ for $Y_i = t$. Then
$$E[Y_i \vert X_i = x] = \int tf_y(t\vert X_i = x) dt$$
In the discrete case
$$E[Y_i \vert X_i = x] = \sum_{t} t \cdot P(Y_i = t|X_i=x)$$
Law of Iterated Expectations
$$E[Y_i] = E{E[Y_i|X_i]}$$
A proof of LIE for the continuous case
$$
\begin{equation}
\begin{split}
E{E[Y_i \vert X_i ]} &= \int E[Y_i |X_i = u] g_x(u) du \\\
&= \int \left[ \int tf_y(t\vert X_i = x) dt\right] g_x(u) du \\\
&= \int \int tf_y(t\vert X_i = x)g_x(u) dtdu \\\
&= \int t\int\left[f_y(t\vert X_i = x)g_x(u)\right] dt \\\
&= \int t \left[ \int f_{xy}(u, t) du\right] dt \\\
&= \int t g_y(t)dt = E[y_i]
\end{split}
\end{equation}
$$
In the discrete case, I start by noting that
$$P(Y_i = y_t\vert X_i=x_j) = \frac{P(Y_i = y_t, X_i=x_j)}{P(X_i=x_j)}$$
$$
\begin{equation}
\begin{split}
E{E[Y_i \vert X_i] } &= \sum_{j=1}^{d_x} E[Y_i \vert X_i = x_j] \cdot P(X_i = x_j) \\\
& = \sum_{j=1}^{d_x}\left [\sum_{k=1}^{d_y} y_t\cdot P(Y_i = y_k|X_i=x_j) \right] \cdot P(X_i = x_j) \\\
& = \sum_{k=1}^{d_y} y_t\sum_{j=1}^{d_x} P(Y_i = y_t, X=X_j) \\\
& = \sum_{k=1}^{d_y} y_t P(Y_i = y_t ) = E[Y_i]
\end{split}
\end{equation}
$$
**The CEF Decomposition Property**
$$Y_i = E[y_i | X_i] + \varepsilon_i$$ Under the assumptions that
- (1) $\varepsilon$ is mean independent of $X_i$, $E[\varepsilon_i | X_i] = 0$
- (2) $\varepsilon$ is uncorrelated with any function of $X_i$
The CEF Prediction Property
Let $m(X_i)$ be any function of $X_i$, the CEF solves
$$E[Y_i \vert X_i ] = \arg \min_{m(X_i)}E[(Y_i - m(X_i))^2]$$
Proof
$$
\begin{split}
(Y_i - m(X_i))^2 &= ((Y_i - E[Y_i \vert X_i]) + (E[Y_i\vert X_i] - m(X_i)))^2 \\\
&=(Y_i - E[Y_i\vert X_i])^2 + 2[(Y_i - E[Y_i\vert X_i]) \times (E[Y_i \vert X_i] - m(X_i))] + (E[Y_i \vert X_i] - m(X_i))^2 \\\
&=\varepsilon_i^2 + 2h(x_i)\varepsilon_i + (E[Y\vert X_i] - m(X_i))^2
\end{split}
$$
Taking the expectation we arrive at
$$\begin{split}
E[(Y_i - m(X_i))^2] &= E[\varepsilon_i^2] + 2E[h(x_i)]E[\varepsilon_i] + E[(E[Y_i|X_i] - m(X_i))^2] \\\
& = E[\varepsilon_i^2] + 0+ E[(E[Y_i|X_i] - m(X_i))^2] \\\
& = E[\varepsilon_i^2] + E[E[Y_i|X_i]^2 ] - 2E[E[Y_i\vert X_i]m(X_i)] + E[m(X_i)^2]
\end{split}
$$
Taking the first order conditions
$$\begin{split}
\frac{\partial (Y_i - m(X_i))^2}{\partial m(X_i)} =- 2E[Y_i\vert X_i] + 2m(X_i) = 0 \\\
\therefore m(X_i) = E[Y_i\vert X_i]
\end{split}
$$
The ANOVA Theorem
$$V(Y_i) = V(E[Y_i\vert X_i]) + E[V(Y_i\vert X_i)]$$ Proof
By the CEF decomposition property we know that
$$\varepsilon_i = Y_i - E[Y_i \vert X_i]$$ Because $\varepsilon_i$ and $E[Y_i \vert X_i]$ are by definition uncorrelated the variance of $\varepsilon_i$ can be written as:
$$\begin{split}V(\varepsilon_i) &= E(\varepsilon_i^2) + E(\varepsilon_i)^2 = E(\varepsilon_i^2) \\\
&= E[E[\varepsilon_i|X_i]^2] = E[V(Y_i \vert X_i)]
\end{split}$$