This is the first in a new series of posts exploring assumptions behind various statistical tests and measures, with a focus on understanding what happens when those assumptions are violated.

In this first post, I’ll take a look at the use of *Analysis of Covariance* (ANCOVA) to statistically “control for” and “remove” the effects of an extraneous, third variable from a general linear model that describes the relationship between a dichotomous predictor variable and a continuous dependent measure.

Specifically, I’ll take a look at the appropriateness of using ANCOVA to help answer the following question:

Are people more likely to relocate to a new town or city if there is a four-year college or university within the same county?

To find out, I gathered estimates from the U.S. Census Bureau on annual net migration for each county or county equivalent in the United States (net migration = total number of people who move into a county each year – total number of people who move out of a county each year). I then cross-referenced these data with information gathered from IPEDS on whether each county possesses a four-year college or university.

Assuming prospective employees and students are most attracted to colleges and universities with graduate-level programs, I restricted my analysis to four-year public or private not-for-profit institutions within the following Carnegie Classifications:

- Research Universities (very high research activity)
- Research Universities (high research activity)
- Doctoral/Research Universities
- Master’s Colleges and Universities (larger programs)
- Master’s Colleges and Universities (medium programs)
- Master’s Colleges and Universities (smaller programs)

First, let’s look at the migration data.

Figure 1 below shows the annual net migration rate for each county or county equivalent in the United States, according to 2012-2013 estimates from the U.S. Census Bureau [1].

*Figure 1: Annual Net Migration Rate for Each County or County Equivalent in the United States.*

**Net migration rate is calculated as annual net migration per 1,000 county residents [2].*

Figure 2 shows the distribution of four-year public and private not-for-profit institutions across the country. Out of all 3,143 counties in the country, 17.09% have at least one four-year college or university that is classified as Master’s level or higher.

*Figure 2: Four-Year Public or Private Colleges or Universities with Graduate-Level Programs.*

#### Do Counties with Four-Year Institutions Have a Higher Annual Net Migration Rate?

Indeed, they do.

A simple comparison reveals that counties with four-year graduate-level institutions have an average annual net migration rate of +1.43 people / 1,000 county residents. Meanwhile, counties without four-year graduate-level institutions have an average annual net migration rate of -0.72 people / 1,000 county residents.

*Figure 3: Mean net migration rate for counties with and without four-year colleges and universities.*

**Error bars reflect standard error of the mean.*

But is this difference in annual net migration rate really due to the presence of four-year colleges and universities? It’s impossible to know for sure on the basis of this simple comparison.

Obviously, counties with four-year colleges and universities might differ from counties without four-year colleges and universities in a number of important ways.

For example, counties with four-year colleges and universities might also be more likely to have a better local economy, a better public school system, better paying jobs, and more job opportunities, among other things. We would surely expect any of these to contribute to a higher annual net migration rate.

#### Controlling for Extraneous Variables: A Cautionary Lesson on Analysis of Covariance (ANCOVA)

Although I don’t have data on any of the factors mentioned above, the data I have shows that counties with four-year colleges and universities are generally more populous than counties without four-year colleges and universities, as shown below in Figure 4.

*Figure 4: Median resident population for counties with and without four-year colleges and universities.*

**Error bars reflect interquartile range (IQR).*

Furthermore, counties with larger populations tend to have slightly higher annual net migration rates, as indicated by a small but statistically significant positive correlation between the two (*r *= 0.12, *p *< .0001).

So we might wonder – is the presence of a four-year college or university related to net migration *above and beyond* the influence of county population?

To determine this, one might look for a way to statistically “control for” and “remove” the effects of county population on annual net migration rates.*

**Although using annual net migration rate (defined here as net migration per 1,000 county residents) mostly adjusts for differences in county population already, it does not do so completely, as evidenced by the positive correlation between annual migration rate and county population. *

This brings us to the Analysis of Covariance (ANCOVA).

ANCOVA is a useful tool because it allows one to obtain a more precise estimate of the effect of a predictor variable (e.g., whether a given county has a four-year college or university) on a given dependent measure (e.g., annual net migration rate). It accomplishes this by mathematically removing the portion of variability in the dependent measure that is explained by a pre-selected third variable (called the *covariate*).

When used properly, ANCOVA can decrease the total variability in the dependent measure that needs to be explained. This has the effect of increasing statistical power, which thereby increases the chances of finding a statistically significant relationship between the predictor variable and the dependent measure.

Because ANCOVA is useful for removing “extra” variability from a measure, many researchers use this procedure to statistically “control for” pre-existing group differences on the covariate – differences that if left uncontrolled might offer a competing or less interesting explanation for why the predictor variable and the dependent measure are related.

So, to determine whether annual net migration rate is related to the presence of four-year colleges and universities after controlling for differences in county population, I conducted an Analysis of Covariance using annual net migration rate as the dependent measure, county type as the predictor variable (i.e., whether a given county has a four-year college or university), and county population as the covariate.*

**Because county population is highly positively skewed, I actually used the natural log transform of county population.*

Figure 5 below shows the relationship between annual net migration rate (on the *y axis*) and county population (on the *x axis*), as well as the predicted values for annual net migration rate according to the best-fitting linear model.

*Figure 5: Relationship between Annual Net Migration Rate and Ln (County Population), for Counties with Four-Year Institutions and Counties without Four-Year Institutions.*

There are a couple of important things to point out about these results.

First, although statistically significant, the model provides a pretty lousy fit to the data, accounting for only 1.39% of the variability in annual net migration rates.

Second, annual net migration rates seems completely *unrelated* to whether a county possesses a four-year college or university.

When county population is held constant, the annual net migration rate for counties *with* a four-year college or university is -0.16 people per 1,000 county residents, whereas the annual net migration rate for counties *without* a four-year college or university is -0.39 people per 1,000 county residents, a difference that is far from statistically significant (*p* = 0.73).

So does this mean that four-year colleges and universities have little-to-no impact on county-level migration patterns after fully controlling for differences in county population?

Not necessarily.

As it turns out, there is a great deal of confusion about the appropriate use of ANCOVA, and several limitations in the analysis above prevent us from drawing strong conclusions.

#### Understanding the Limitations of ANCOVA

The analysis presented here suffers from two fatal flaws, which I intentionally included for the sake of illustrating the potential pitfalls of ANCOVA.

*(1) The predictor variable (county type) and the covariate (county population) are related.*

Although commonly used to “control for” extraneous group differences, a critical assumption of ANCOVA is that the **predictor variable and the covariate are independent of one another**.

This is important because if the predictor variable and the covariate are related, then removing the effects of the covariate will ultimately lead to removing some of the effects of the predictor variable as well (or possibly even *all *of the effects of the predictor variable, depending on the degree of the relationship)!

On a related note, the predictor variable *should not* influence the covariate, as would necessarily be the case here if four-year colleges and universities do, in fact, drive up net migration.

Which brings us to the second problem with our ANCOVA.

*(2) The predictor variable is more strongly related to the covariate than the dependent measure.*

The presence of four-year colleges and universities is more closely related to county population (*r *= 0.56, *p *< 0.0001) than to annual net migration rate (*r *= 0.07, *p *< 0.0001).

This occurred as a consequence of my intentional use of net migration *rate* (i.e., net migration per 1,000 county residents) as the dependent measure. Because net migration rate mostly adjusted for differences in county population already, the relationship between the dependent variable and the covariate is minimal.

And so because the covariate is only *weakly* related to the dependent measure, using ANCOVA to “control for” and “remove” the effect of county population leads to relatively little removal of variability in net migration rates (doing nothing to increase statistical power).

Meanwhile, because the covariate is *strongly* related to the predictor variable, removing the effect of county population leads to the unfortunate side-effect of removing almost all of the effect of county type – the primary variable we’re interested in!

So what is the take-home message from all of this?

Two things:

1. Counties with four-year colleges and universities do, indeed, have higher annual net migration rates than counties without four-year colleges and universities. Perhaps this isn’t altogether surprising. However, because these findings are correlational in nature, we have no way of knowing for sure (at least not on the basis of the analyses reported here) whether four-year institutions of higher education play a direct, causal role in driving up annual net migration rates.

2. More generally, it is inappropriate to use ANCOVA to “control for” pre-existing group differences on an extraneous, third variable (especially when the covariate is more strongly related to the predictor variable than to the dependent measure). As such, we should be skeptical whenever we turn to the news and read about some newly published scientific study that examined the relationship between two variables while statistically “controlling for” or “removing” the effects of a a seemingly uninteresting third variable (e.g., a study examining the relationship between marital status [*x*] and ratings of happiness in a romantic relationship [*y*] while statistically “controlling for” the number of years the couple has been together in the relationship [*z*]).

As Miller & Chapman make clear in this 2001 paper, there is often no way to statistically “control for” or “remove” real group differences on a potential covariate – at least not without altering the relationship between the predictor variable and the dependent measure in potentially problematic ways.

References:

[1] Estimates of the Components of Resident Population Change: April 1, 2010 to July 1, 2013

Source: U.S. Census Bureau, Population Division

Release Dates: For the United States, regions, divisions, states, and Puerto Rico Commonwealth, January 2014. For counties, metropolitan statistical areas, micropolitan statistical areas, metropolitan divisions, and combined statistical areas, March 2014.

[2] Annual Estimates of the Resident Population: April 1, 2010 to July 1, 2013

Source: U.S. Census Bureau, Population Division

Release Dates: For the United States, regions, divisions, states, and Puerto Rico Commonwealth, December 2013. For counties, municipios, metropolitan statistical areas, micropolitan statistical areas, metropolitan divisions, and combined statistical areas, March 2014. For Cities and Towns (Incorporated Places and Minor Civil Divisions), May 2014.