Analysis of variance (ANOVA) is a widely used statistical test in the behavioral and social sciences.

In a nutshell, ANOVA is used to evaluate differences between (at least) three group means to determine whether there is a “statistically significant” difference somewhere among them (i.e., a difference that is unlikely due to chance factors).

ANOVA is commonly used in conjunction with an experimental research design, in which a researcher randomly assigns participants to one of several groups and tests to see whether an experimental treatment variable leads to group differences on a given dependent measure.

For example, suppose a clinical psychologist wants to investigate whether Cognitive Behavioral Therapy (CBT) is more effective than Freudian Psychoanalysis at quickly alleviating symptoms of mild-to-moderate depression.

To evaluate the effectiveness of CBT, the researcher could conduct an experiment in which moderately depressed participants are randomly assigned to one of three groups – (1) a Treatment Group that receives CBT for three weeks; (2) a Control Group that receives Psychoanalysis for three weeks; or (3) a Control Group that does not receive any form of psychotherapy.

After three weeks, the researcher could then administer a standard depression survey and use ANOVA to compare average depression scores across the three groups.

In this example, the treatment variable is the type of psychotherapy that patients receive (CBT vs. Psychoanalysis vs. None) and the dependent measure is an individual’s level of depression, as measured by whatever depression survey the researcher uses.

**The Mechanics of ANOVA**

ANOVA is a form of null hypothesis testing because it estimates the probability of the observed data assuming the null hypothesis is true.

The *null hypothesis* states that the treatment variable has no effect on the dependent measure and that, therefore, the group means are equivalent.

If the probability of the observed data under the null hypothesis is sufficiently low however (p < .05 is the standard convention), we can reject the null hypothesis and conclude that the group means are *probably* not equal.

ANOVA estimates the probability of the data under the null hypothesis by computing a ratio of two variances – specifically, the ratio of the *between groups variance* to the *within groups variance*.

The *between groups variance* (also called Mean Square Treatment or Mean Square Between) reflects the extent to which the group means vary from each other and from the overall grand mean. In a properly designed experiment, group means are expected to vary from one another for two reasons: (1) the effect (if any) of the experimental treatment variable; and (2) random factors, such as measurement error and individual differences between subjects.

The *within groups variance* (also called Mean Square Error or Mean Square Within) reflects the extent to which individual scores vary from each other and from their respective group mean. Individual scores within each group are expected to vary from one another solely because of random factors, such as measurement error and individual differences between subjects.

Mean Square Treatment (MS_{T}) and Mean Square Error (MS_{e}) are calculated according to the formulas shown below.

*Formula 1: Mean Square Treatment (also called Mean Square Between):*

where *n* is number of observations in each group, *a* is the number of group means, *Ybar.j* is the mean for group *j* and *Ybar.. *is the grand mean.

*Formula 2: Mean Square Error (also called Mean Square Within):*

where *n* is number of observations in each group, *a* is the number of group means, *Yij* is the *i*th score in group *j*, and *Ybar.j* is the mean for group *j*.

Under the null hypothesis (i.e., assuming the group means are equivalent), the ratio of *Mean Square Treatment* to *Mean Square Error* is distributed as the test statistic F.

Examples of several F distributions are shown below in Figure 1.

*Figure 1: Examples of F distributions with varying numerator and denominator degrees of freedom.*

If the experimental treatment variable has little to no effect on the dependent measure, then the group means will vary from one another for the same reason that individual scores vary from one another – namely because of purely random factors, such as measurement error and individual differences between subjects. As such, MS_{T} will be roughly equal to MS_{e} and the value for F will be close to 1.

On the other hand, if the experimental treatment variable affects the dependent measure, then group means will vary from one another because of random factors *and* because of the experimental treatment variable. This extra source of variability will cause the group means to be more variable than individual scores within groups. Therefore when the treatment variable affects the dependent measure, MS_{T} will be greater than MS_{e} and the value for F will be greater than 1.

To determine whether the between groups variance is large enough to warrant “statistically significant” results, the computed F ratio must be compared to a critical value for F, which serves as a cutoff for deciding whether or not to reject the null hypothesis.

If the computed F ratio is larger than the critical F value, then we reject the null hypothesis that the group means are equal and conclude that there is a difference somewhere among them. On the other hand, if the computed F ratio is less than the critical F value, then we fail to reject the null hypothesis and conclude that the group means likely do not differ.

Note that the critical F value is based on three things: (1) degrees of freedom for the numerator of the F ratio (df numerator = *a*-1, where *a *is the number of groups); (2) degrees of freedom for the denominator of the F ratio (df denominator = *a*(*n*-1), where *a* is the number of groups and *n* is the number of observations in each group); and (3) the alpha level chosen by the experimenter.

Alpha reflects the probability of observing a computed F ratio that is larger than the critical F value under the null hypothesis. Therefore, it is the probability of mistakenly rejecting the null hypothesis and concluding that the treatment variable affects the dependent measure when, in fact, it does not. This type of error, called a Type 1 error, will be discussed below further.

**Investigating the Assumptions Behind ANOVA**

Perhaps because of its enormous popularity in the behavioral and social sciences, researchers occasionally use ANOVA without considering the major assumptions behind the test – and therefore without considering whether the data being analyzed violate those assumptions.

This is problematic because, as I’ll show below, violating the assumptions of ANOVA can have serious consequences, such as an inflated Type 1 error rate. As described above, the Type 1 error rate refers to the probability of mistakenly rejecting the null hypothesis. Therefore, when researchers use ANOVA inappropriately, they run the serious risk of concluding that their experimental treatment variable affects their dependent measure when, in fact, it might not.

*The Assumption of Homogeneity of Variances*

One of the most important assumptions of ANOVA is the assumption of *homogeneity of variances*. This refers to the assumption that group variances are roughly equal. When group variances are unequal, they are said to be *heterogenous*.

To assess the importance of homogeneity of variances, I ran a series of Monte Carlo simulations, each involving a one-way ANOVA on three independent group means.

In the first set of simulations, group means were based on an equal number of observations (Balanced Data). In the second set of simulations, group means were based on an unequal number of observations (Unbalanced Data).

*Heterogeneity of Variances with Balanced Data*

In this first set of simulations, each group mean was based on 50 observations randomly drawn from a normal distribution with a mean of 0 and varying standard deviations.

The standard deviation for groups 1 and 2 was held constant at 1, whereas the standard deviation for group 3 varied from 1 to 5 in increments of 1.

For each increment in the standard deviation for group 3, I ran 1,000 simulations and counted the number of tests that yielded statistically significant results at the conventional .05 alpha level. Given that the population means were equal, this provides an estimate of the probability of making a Type 1 error as a function of increasing variability in group 3.

Figure 2 shows the proportion of simulations that yielded statistically significant results as a function of the ratio of the largest to smallest group variance.

*Figure 2: Probability of a Type 1 Error as a Function of the Ratio of the Largest to Smallest Group Variance (n _{1} = 50, n_{2} = 50, n_{3} = 50).*

*Heterogeneity of Variances with Unbalanced Data*

In the second set of simulations, the means for group 1 and group 2 were again based on 50 observations randomly drawn from a normal distribution with a mean of 0 and a standard deviation of 1.

However, the mean for group 3 was based on only 10 observations randomly drawn from a normal distribution with a mean of 0. The standard deviation for group 3 again varied from 1 to 5 in increments of 1.

Figure 3 shows the proportion of simulations that yielded statistically significant results as a function of the ratio of the largest to smallest group variance when the mean for group 3 was based on only 10 observations.

*Figure 3: Probability of a Type 1 Error as a Function of the Ratio of the Largest to Smallest Group Variance (n _{1} = 50, n_{2} = 50, n_{3} = 10).*

Finally, Figure 4 shows the proportion of simulations that yielded statistically significant results as a function of varying the number of observations for group 3, with the ratio of the largest to smallest group variance held constant at around 4.

*Figure 4: Probability of a Type 1 Error as a Function of Group 3 Sample Size (n _{1} = 50, n_{2} = 50, sd_{1} = 1, sd_{2} = 1, sd_{3} = 2).*

As you can see by looking at Figures 2 and 3, heterogeneity of variances generally leads to an inflated Type 1 error rate – and so increases the chances of mistakenly concluding that group means differ.

Furthermore, as you can see by comparing Figures 2 and 3, heterogeneity of variances is particularly problematic with *unbalanced data* – when the number of observations differs between groups.

In fact, computer simulations by other researchers have shown that heterogeneity of variances with unbalanced data can either increase *or* *decrease* the Type 1 error rate depending on which group(s) possess greater variability.

When large variances are paired with small group sizes, heterogeneity of variance will increase the Type 1 error rate, as demonstrated here.

However, when large variances are paired with large group sizes, heterogeneity of variance will *decrease* the probability of a Type 1 error and increase the probability of a *Type 2* error – i.e., mistakenly *failing to reject* the null hypothesis when group means actually differ.

In general, prior research on this matter has shown that heterogeneity of variance does not pose a serious problem to ANOVA provided there are an equal number of observations in each group and provided the ratio of the largest to smallest variance is less than 4:1.

In the simulations presented here, a 4:1 ratio of variances with balanced data increased alpha by only .01 (from the nominal .05 level to .06). Meanwhile a 4:1 ratio of variances with unbalanced data increased alpha by .14 (from the nominal .05 level to .19).

So the take-home message is this:

With balanced data, ANOVA is generally robust to violations of the homogeneity of variance assumption (again, provided the ratio of the largest to smallest group variance is less than 4:1).

However, this is not true with unbalanced data, as even relatively small differences in group variances can be problematic. As such, when looking to analyze unbalanced data with unequal variances, it is best to seek an alternative to ANOVA, such as the Welch (F_{w}) Test.