The importance of completeness in linear regressions is an often-discussed issue. By leaving out relevant variables the coefficients might be inconsistent.
Assuming a linear complete model of the form:
z = a + bx + cy + ε.
Where z is supposed to be dependent, x and y are independent and ε is the error term.
Now we drop y to check which terms are affected. By reducing one dimension we transform a linear hyperplane to a linear line. In the initial R3 this two-dimensional line (incomplete model) is located in the center of y. More precisely, at ȳ which is the mean of y. This leads to a correction of “a” and ε – if y is left out.
By starting from the initial model (R3) we get “a” by x = 0 and y = 0. To obtain the new intercept (α) “a” must be extended to ȳ with:
α = a + cȳ.
For the residuals ε the contribution (regarding the explanatory power) of y disappears. This leads to an increasing error-term (u):
u = ε + c(y – ȳ).
So, the model
z = α + bx + u
consists of
z = a + cȳ + bx + ε + c(y – ȳ)
Dissolving the parentheses leads to the initial model z.
Assumed there is a correlation between x and y:
Regarding the consistency of the coefficients this is not a problem for the initial (complete) model in the R3. However, it might be a problem due to multicollinearity and the resulting variance inflation. But for the incomplete model in the R2 there will be a correlation among the independent variable x and the residuals u. Thus, if there is no correlation between the omitted variable(s) and the contained variable(s) in the model there is no problem regarding the consistency. Except the endogeneity comes from errors of measurement or reverse causality. But this is another story…
Credit: Data Science Central By: Frank Raulf