Sunday, December 7, 2014

Sour Grapes And Sweet Mixed Models: Avoiding The Goldilocks Problem in Publishing

Longtime readers of this blog, and people who look in the archives (try 2010), know that I sometimes write poems. I think they're good, and my kinder readers, or those who know less about academic poetry -- and I certainly belong in both categories -- ask if I've tried to publish them. And I say I would if I could, but I know I can't. And also, I already have.

Before Gutenberg, people basically passed notes. It was democratic, it was slow, it could be subversive, but it wasn't easy to connect across long distances or outside of one's social networks. The invention of printing allowed a much wider and more rapid dissemination of all kinds of writing. At the same time, it concentrated power in the hands of publishers, who decided what to print and what not to. And made money off it.

Today, publishers still make these decisions, and they still make money off it, but they exist for a different reason. No longer necessary for getting your work "out there", publishing (as opposed to self-publishing) gives the writer prestige and hopefully career advancement, and it helps readers decide what is worth reading (or at least what to cite).

In today's world, where there is a lot out there even on obscure topics, and an overwhelming amount on popular ones, it is certainly necessary for judgements and decisions to be made. No one can read everything, and not everything is worth reading. However, we should be able to make these judgments and decisions for ourselves. In Anglo-American legal terms, we don't need, or want, "prior restraint".

Publishing is outdated and will eventually disappear. Pre-publication peer review will fall along with it and, in my view, be replaced with some version of post-self-publication peer review. But many more thoughtful people than me are out there debating what this new world might look like.

The revolution will be bad news for publishing houses, obviously, and will also pose problems for anyone wanting to evaluate academics the way they are now -- and thus for academics who need to be evaluated, promoted, given tenure, etc. Without getting lost in speculation about how these institutions might work in a post-publishing world, we can anticipate that they will work better than they do now, depending heavily on a flawed system of peer review and publishing.

At a minimum, an author asks one thing from the current system: "If my article be rejected, Lord, it is thy will, but please let not a blatantly shittier article be published in the same journal within six months." But the author's perspective is not only biased, but limited. We may think our article is good (or know we really need to publish it), but we are not interested in every other type of thing the journal might legitimately publish.

And while "not interested" may indeed sometimes mean "a waste of time for me to read", it doesn't necessarily mean "a waste of paper for the journal to print", because other readers with other interests are out there. (I'm setting aside the important fact that there is no paper, that any space restrictions have become intentional, not physical.)

The problem with pre-publication peer review is not that there are too many submissions of varying quality coming from too many different perspectives (where the relationship between quality and perspective is complicated), although there are, and this will make editors' jobs difficult as long as journals exist. The problem is that given this quantity and diversity, pre-publication peer review doesn't (and can't) employ nearly enough reviewers.

Potential reviewers will be more or less familiar with the topic of a submission, and (especially if they are more familiar with it) more or less already in agreement with what it says. Less familiar reviewers may overestimate the originality of the submission, while more familiar reviewers may overestimate the knowledge of the intended audience. There may be reviewers whose own work is praised or developed further in the submission, and others whose work is criticized.

Even when these reviewers all provide useful feedback, a publishing decision must be made based on their recommendations. And the more inconsistency there is between reviewers, or any raters, the more opinions you need to achieve a reliable outcome. Considering how much inter-reviewer variation exists in our field, I think two or three reviewers, no matter how carefully chosen, are not always enough. So it doesn't surprise me that certain things "fall through the cracks", while others do the opposite (whatever the metaphor would be). Luckily, in the future, we will all be reviewers, so this problem will be eliminated along with the journals and publishers.

Some years ago, having published an article pointing out one advantage of using mixed-effects models on sociolinguistic data, when speakers vary individually in their use of a variable -- namely the reduction of absurd levels of Type I error for between-speaker predictors (Johnson 2009) -- I was invited to say more on the topic at a panel at NWAV 38 in October 2009.

In this presentation, I reiterated the point about Type I error, and discussed three other advantages of mixed models: better estimation of the effects of between-speaker predictors (when some speakers have more data than others), better estimation of the effects of within-speaker predictors (when those predictors are not balanced across speakers), and better estimation of the effects of within-speaker predictors (in general, in logistic regression).

I wrote this up and submitted it to a journal (8/10), revised and resubmitted it twice (4/11, 2/12), and then submitted and re-submitted it to another journal (7/12, 4/13). While this process has greatly improved the manuscript, it still tries to make the same points as the NWAV presentation. While the reviewers did not challenge these points directly, they raised valid concerns about the manuscript's length (too much of it) and organization (too little of it), and about its appeal to the various potential readers of the respective journals.

For example, some judged it to be inaccessible, while for others, it read too much like a textbook. Another point of disagreement related to the value of using simulated data to illustrate statistical points, which I had done in Johnson 2009, and more recently here, here, here, here, and here.

When it came to the core content, the reviewers' opinions were equally divergent. I was told that fixed-effects models could be as good as (or even better than) mixed-effects models: "too new!" I was told that the mixed models I focus on, which don't contain random slopes, were not good enough: "too old!" And I was told that the mixed models I discussed were just fine -- "just right?" -- but that everyone knows this already: "d'oh!"

The editors gave detailed and thoughtful recommendations, both general and specific, on how to revamp the manuscript, including trying to find a middle ground between these apparently contradictory perspectives. But even if the article were published in a journal, any reader would still find themselves falling into one (or more) of these camps. Or, more likely, having a unique perspective of their own.

My perspective is that this article could be useful to some readers, and that readers should be able to decide if it is for themselves. Clearly, not everyone already "knows" that we "should" use mixed models, since fixed-effects models are still seen in the wild. They rarely appear in the mixed-esque form devised by Paolillo 2012, but more often in their native (naive) incarnation, where any individual variation -- by speaker, by word, etc. -- is simply and majestically ignored. GoldVarb Bear made several appearances at NWAV 43.

Random intercepts are a stepping-stone, not an alternative, to random slopes (but see Bear, er, Barr et al. 2013). And even if the reduction of Type I error is the most important thing for some readers, the other advantages of random intercepts -- their effect on the regression coefficients themselves -- deserve to be known and discussed. To me, they are not very complicated to understand, but they are not obvious. Readers are welcome to disagree.

Of course, whether individual speakers and words actually have different rates of variation (let alone constraints on variation) is not a settled question in the first place. This article assumes (conservatively?) that they do, and I think most people would agree when it comes to speakers, while many would disagree when it comes to words. But a nice thing about mixed-effects models is that they work either way.

All of this is a way of introducing this document, and placing what is partially sour grapes in some kind of coherent frame. This article is distributed free of charge and without any restriction on its use. If it seems long, boring, and/or repetitive, you can skim it, and certainly don't print it! If something seems wrong, please let me know. And if it's helpful in your work, you can cite it if you like.

This particular article is certainly not a model of anything, just a few ideas, or observations, of a quantitative and methodological kind. But maybe we don't always need a publishing infrastructure to share our ideas -- or at least not if they're not our best ones.


Barr, Dale J., Roger Levy, Christoph Scheepers and Harry J. Tily. 2013. Random effects structure for confirmatory hypothesis testing: Keep it maximal. Journal of Memory and Language 68: 255–278.

Johnson, Daniel Ezra. 2009. Getting off the GoldVarb standard: Introducing Rbrul for mixed-effects variable rule analysis. Language and Linguistics Compass 3(1): 359-383.

Johnson, Daniel Ezra. 2014. Progress in regression: Why natural language data calls for mixed-effects models. Self-published manuscript.

Paolillo, John C. 2013. Individual effects in variation analysis: Model, software, and research design. Language Variation and Change 25(1): 89-118.

Wednesday, December 3, 2014

Are You Talkin' To ME? In defense of mixed-effects models

At NWAV 43 in Chicago, Joseph Roy and Stephen Levey presented a poster calling for "caution" in the use of mixed-effects models in situations where the data is highly unbalanced, especially if some of the random-effects groups (speakers) have only a small number of observations (tokens).

One of their findings involved a model where a certain factor received a very high factor weight, like .999, which pushed the other factor weights in the group well below .500. Although I have been unable to look at their data, and so can't determine what caused this to happen, it reminded me that sum-contrast coefficients or factor weights can only be interpreted relative to the other ones in the same group.

An outlying coefficient C does not affect the difference between A and B. This is much easier to see if the coefficients are expressed in log-odds units. In log-odds, it seems obvious that the difference between A: -1 and B: +1 is the same as the difference between A': -5 and B': -3. The difference in each case is 2 log-odds.

Expressed as factor weights -- A: .269, B: .731; A': .007, B': .047 -- this equivalence is obscured, to say the least. It is impossible to consistently describe the difference between any two factor weights, if there are three or more in the group. To put it mildly, this is one of the disadvantages of using factor weights for reporting the results of logistic regressions.

Since factor weights (and the Varbrul program that produces them) have several other drawbacks, I am more interested in the (software-independent) question that Roy & Levey raise, about fitting mixed-effects models to unbalanced data. Even though handling unbalanced data is one of the main selling points of mixed models (Pinheiro & Bates 2000), Roy and Levey claim that such analyses "with less than 30-50 tokens per speaker, with at least 30-50 speakers, vastly overestimate variance", citing Moineddin et al. (2007).

However, Moineddin et al. actually only claim to find such an overestimate "when group size is small (e.g. 5)". In any case, the focus on group size points to the possibility that the small numbers of tokens for some speakers is the real issue, rather than the data imbalance itself.

Fixed-effects models like Varbrul's vastly underestimate speaker variance by not estimating it at all and assuming it to be zero. Therefore, they inflate the significance of between-speaker (social) factors. P-values associated with these factors are too low, increasing the rate of Type I error beyond the nominal 5% (this behavior is called "anti-conservative"). All things being equal, the more tokens there are per speaker, the worse the performance of a fixed-effects model will be (Johnson 2009).

With only 20 tokens per speaker, the advantage of the mixed-effects model can be small, but there is no sign that mixed models ever err in the opposite direction, by overestimating speaker variance -- at least, not in the balanced, simulated data sets of Johnson (2009). If they did, they would show p-values that are higher than they should be, resulting in Type I error rates below 5% (this behavior is called "conservative").

It is difficult to compare the performance of statistical models on real data samples (as Roy and Levey do for three Canadian English variables), because the true population parameters are never known. Simulations are a much better way to assess the consequences of a claim like this.

I simulated data from 20 "speakers" in two groups -- 10 "male", 10 "female" -- with a population gender effect of zero, and speaker effects normally distributed with a standard deviation of either zero (no individual-speaker effects), 0.1 log-odds (95% of speakers with input probabilities between .451 and .549), or 0.2 log-odds (95% of speakers between .403 and .597).

The average number of tokens per speaker (N_s) ranged from 5 to 100. The number of tokens per speaker was either balanced (all speakers have N_s tokens), imbalanced (N_s * rnorm(20, 1, 0.5), or very imbalanced (N_s * rnorm(20, 1, 1). Each speaker had at least one token and no speaker had more than three times the average number of tokens.

For each of these settings, 1000 datasets were generated and two models were fit to each dataset: a fixed-effects model with a predictor for gender (equivalent to the "cautious" Varbrul model that Roy & Levey implicitly recommend), and a mixed-effects (glmer) model with a predictor for gender and a random intercept for speaker. In each case, the drop1 function (a likelihood-ratio test) was used to calculate the Type I error rate -- the proportion of the 1000 models with p < .05 for gender. Because there is no real gender effect, if everything is working properly, this rate should always be 5%.

For each panel, the figure above plots the proportion of significant p-values (p < .05) obtained from the simulation, in blue for the fixed-effects model and in magenta for the mixed model. A loess smoothing line has been added to each panel. Again, since the true population gender difference is always zero, any result deemed significant is a type I error. The figure shows that:

1) If there is no individual-speaker variation (left column), the fixed-effects model appears to behave properly, with 5% Type I error, and the mixed model is slightly conservative, with 4% Type I error. There is no effect of the average number of tokens per speaker (within each panel), nor is there any effect of data imbalance (between the rows of the figure).

2) If there is individual-speaker variation (center and right columns), the fixed-effects model error rate is always above 5%, and it increases roughly linearly in proportion to the number of tokens per speaker. The greater the individual-speaker variation, the faster the increase in the Type I error rate for the fixed-effects model, and therefore the larger the disadvantage compared with the mixed model.

The mixed model proportions are much closer to 5%. We do see a small increase in Type I error as the number of tokens per speaker increases; the mixed model goes from being slightly conservative (p-values too high, Type I error below 5%) to slightly anti-conservative (p-values too low, Type I error above 5%).

Finally, there is a small increase in Type I error associated with greater data imbalance across groups. However, this effect can be seen for both types of models. There is no evidence that mixed models are more susceptible to error from this source, either with a low or a high number of average tokens per speaker.

In summary, the simulation does not show any sign of markedly overconservative behavior from the mixed models, even when the number of tokens per speaker is low, and the degree of imbalace is high. This is likely to be because the mixed model is not "vastly overestimating" speaker variance in any general way, despite Roy & Levey's warnings to the contrary.

We can look at what is going on with these estimates of speaker variance, starting with a "middle-of-the-road" case where the average number of tokens per speaker is 50, the true individual-speaker standard deviation is 0.1, and there is no imbalance across groups.

For this balanced case, the fixed-effects model gives an overall Type I error rate of 6.4%, while the mixed model gives 4.4%. The mean estimate of individual-speaker variance, in the mixed model, is 0.063. Note that this average is an underestimate, not an overestimate, of the variance in the population, which is 0.1.

Indeed, in 214 of the 1000 runs, the mixed model underestimated the speaker variance as much as it possibly could: it came out as zero. For these runs, the proportion of Type I error was higher: 6.1%, and similar to the fixed-effects model, as we would expect.

In 475 of the runs, a positive speaker variance was estimated that was still below 0.1, and the Type I error rate was 5.3%. And in 311 runs, the variance was indeed overestimated, that is, it was higher than 0.1. The Type I error rate for these runs was only 1.9%.

Mixed models can overestimate speaker variance -- incidentally, this is because of the sample data they are given, not because of some glitch -- and when this happens, the p-value for a between-speaker effect will be too high (conservative), compared to what we would calculate if the true variance in the population were known. However, in just as many cases, the opposite thing happens: the speaker variance is underestimated, resulting p-values that are too low (anti-conservative). On average, though, the mixed-effects model does not behave in an overly conservative way.

If we make the same data quite unbalanced across groups (keeping the average of 50 tokens per speaker and the speaker standard deviation of 0.1), the Type I error rates rise to 8.3% for the fixed-effects model and 5.6% for the mixed model. So data imbalance does inflate Type I error, but mixed models still maintain a consistent advantage. And it is still as common for the mixed model to estimate zero speaker variance (35% of runs) as it is to overestimate the true variance (28% of runs).

I speculated above that small groups -- speakers with few tokens -- might pose more of a problem than unbalanced data itself. Keeping the population speaker variance of 0.1, and the high level of data imbalance, but considering the case with only 10 tokens per speaker on average, we see that the Type I error rates are 4.5% for fixed, 3.0% for mixed.

The figure of 4.5% would probably average out close to 5%; it's within the range of error exhibited by the points on the figure above (top row, middle column). Recall that our simulations go as low as 5 tokens per speaker, and if there were only 1 token per speaker, no one would assail the accuracy of a fixed-effects model because it ignored individual-speaker variation (or, put another way, within-speaker correlation). But sociolinguistic studies with only a handful of observations per speaker or text are not that common, outside of New York department stores, rare discourse variables, and historical syntax.

For the mixed model, the Type I error rate is the lowest we have seen, even though only 28% of runs overestimated the speaker variance. Many of these overestimated it considerably, however, contributing to the overall conservative behavior.

Perhaps this is all that Roy & Levey intended by their admonition to use caution with mixed models. But a better target of caution might be any data set like this one: a binary linguistic variable, collected from 10 "men" and 10 "women", where two people contributed one token each, another contributed 2, another 4, etc., while others contributed 29 or 30 tokens. As much as we love "naturalistic" data, it is not hard to see that such a data set is far from ideal for answering the question of whether men or women use a linguistic variable more often. If we have to start with very unbalanced data sets, including groups with too few observations to reasonably generalize from, it is too much to expect that any one statistical procedure can always save us.

The simulations used here are idealized -- for one thing, they assume normal distributions of speaker effects -- but they are replicable, and can be tweaked and improved in any number of ways. Simulations are not meant to replicate all the complexities of "real data", but rather to allow the manipulation of known properties of the data. When comparing the performance of two models, it really helps to know the actual properties of what is being modeled. Attempting to use real data to compare the performance of models at best confuses sample and population, and at worst casts unwarranted doubt on reliable tools.


Johnson, D. E. 2009. Getting off the GoldVarb standard: introducing Rbrul for mixed-effects variable rule analysis. Language and Linguistics Compass 3/1: 359-383.

Moineddin, R., F. I. Matheson and R. H. Glazier. 2007. A simulation study of sample size for multilevel logistic regression models. BMC Medical Research Methodology 7(1): 34.

Pinheiro, J. C. and D. M. Bates. 2000. Mixed-effect models in S and S-PLUS. New York: Springer.

Roy, J. and S. Levey. 2014. Mixed-effects models and unbalanced sociolinguistic data: the need for caution. Poster presented at NWAV 43, Chicago.