什么是不好的控制变量, 什么又是好的控制变量?

所有计量经济圈方法论丛的程序文件, 微观数据库和各种软件都放在社群里.欢迎到计量经济圈社群交流访问.

由@因果推断研究小组撰写

现在，越来越讲究因果推断识别，因此一个被称之为“条件独立性假设”就格外重要（关于CIA可以到小组交流）。今天，咱们小组引荐的，是关于控制变量选择的议题，即何为不好的（好的）控制变量？不好的控制变量最好不要引入到回归模型中来，因为会造成咱们习以为常的“选择性偏误问题”。

那到底什么是不好的控制变量呢？它指的是会受到解释变量影响的变量，即这些控制变量并没有在解释变量受到影响之前就已经是前置决定了的(predetermined)。比如，研究学历对收入的影响，那咱们是不是需要控制一下职业呢？一旦控制职业变量，意味着咱们是在同一职业里对学历影响收入进行研究，但这样做存在选择性偏误呢？学历会同时影响一个人的职业和收入的，即这里的控制变量职业并不是一个相对于解释变量学历的前置变量，因此，咱们认定它是一个不好的控制变量。

下面看一看Mostly Harmless Econometrics: An Empiricist‘s Companion里的一个chapter，如果觉得中文读起来不顺畅的话，可以看看。

We have made the point that control for covariates can make the conditional independence assumption more plausible. But more control is not always better. Some variables are bad controls and should not be included in a regression model even when their inclusion might be expected to change the short regression coe¢ cients. Bad controls are variables that are themselves outcome variables in the notional experiment at hand. That is, bad controls might just as well be dependent variables too. Good controls are variables that we can think of as having been fixed at the time the regressor of interest was determined.

The essence of the bad control problem is a version of selection bias, albeit somewhat more subtle than the selection bias. To illustrate, suppose we are interested in the effects of a college degree on earnings and that people can work in one of two occupations, white collar and blue collar. A college degree clearly opens the door to higher-paying white collar jobs. Should occupation therefore be seen as an omitted variable in a regression of wages on schooling? After all, occupation is highly correlated with both education and pay. Perhaps it's best to look at the e¤ect of college on wages for those within an occupation, say white collar only. The problem with this argument is that once we acknowledge the fact that college a¤ects occupation, comparisons of wages by college degree status within an occupation are no longer apples-to-apples, even if college degree completion is randomly assigned.

be outcomes in the causal nexus. In many cases, however, the timing is uncertain or unknown. In such cases, clear reasoning about causal channels requires explicit assumptions about what happened first, or the assertion that none of the control variables are themselves caused by the regressor of interest.

本站仅提供存储服务，所有内容均由用户发布，如发现有害或侵权内容，请点击举报。