Thursday, July 25, 2013

Random Slopes: Now That Rbrul Has Them, You May Want Them Too


I've made the first major update to Rbrul in a long time, adding support for random slopes. While in most cases, models with random intercepts perform better than those without them, a recent paper (Barr et al. 2013) has convincingly argued that for each fixed effect in a mixed model, one or more corresponding random slopes should also be considered.

So what are random slopes and what benefits do they provide? If we start with the simple regression equation y = ax + b, the intercept is b and the slope is a. A random intercept allows b to vary; if the data is drawn from different speakers, each speaker essentially has their own value for b. A random slope allows each speaker to have their own value for a as well.

The sociolinguistic literature usually concedes that speakers can vary in their intercepts (average values or rates of application). But at least since Guy (1980), it has been suggested or assumed that the speakers in a community do not vary in their slopes (constraints). As we saw last week, though, in some data sets the effect of following consonant vs. vowel on t/d-deletion varies by speaker more than might be expected by chance.

In the Buckeye Corpus, the estimated standard deviation, across speakers, of this consonant-vs.-vowel slope was 0.70 log-odds; in the Philadelphia Neighborhood Corpus, it was 0.67. A simulation reproducing the number of speakers, number of tokens, balance of following segments, overall following-segment effect, and speaker intercepts produced a median standard deviation of only 0.10 for Ohio and 0.16 for Philadelphia. Speaker slopes as dispersed as the ones actually observed would occur very rarely by chance (Ohio, p < .001; Philadelphia, p = .003).1

If rates and constraints can vary by speaker, it is important not to ignore speaker when analyzing the data. In assessing between-speaker effects – gender, class, age, etc. – ignoring speaker is equivalent to assuming that every token comes from a different speaker. This greatly overestimates the significance of these between-speaker effects (Johnson 2009). The same applies to intercepts (different rates between groups) and slopes (different constraints between groups). The figure below illustrates this.

By keeping track of speaker variation, random intercepts and slopes help provide accurate p-values (left). Without them, data gets "lumped" and p-values can be meaninglessly low (right).

Especially if your data is unbalanced, there are other benefits to using random slopes if your constraints might differ by speaker (or by word, or another grouping factor); these will not be discussed here. Mixed-effects models with random slopes not only control for by-speaker constraint variation, they also provide an estimate of its size. Mixed models with only random intercepts, like fixed-effects models, rather blindly assume the slope variation to be zero, and are only accurate if it really is. No doubt, this "Shared Constraints Hypothesis" (Guy 2004) is roughly, qualitatively correct: for example, all 83 speakers from Ohio and Philadelphia showed more deletion before consonants than before vowels (except one person with only two tokens!) But the hypothesis has been taken for granted far more often than it has been supported with quantitative evidence.

Rbrul has always fit models with random intercepts, allowing users to stop assuming that individual speakers have equal rates of application of a binary variable (or the same average values of a continuous variable). Now Rbrul allows random slopes, so the Shared Constraints Hypothesis can be treated like the hypothesis it is, rather than an inflexible axiom built into our software. The new feature may not be working perfectly, so please send feedback to danielezrajohnson@gmail.com (or comment here) if you encounter any problems or have questions. Also feel free to be in touch if you have requests for other features to be added in the future!

1These models did not control for other within-subjects effects that could have increased the apparent diversity in the following-segment effect.


P.S. A major drawback to using random slopes is that models containing them can take a long time to fit, and sometimes they don't fit at all, causing "false convergences" and "singular convergences" that Rbrul reports with an "Error Message". There is not always a solution to this – see here and here for suggestions from Jaeger – but it is always a good idea to center any continuous variables, or at least keep the zero-point close to the center. For example, if you have a date-of-birth predictor, make 0 the year 1900 or 1950, not the year 0. Add random slopes one at a time so processing times don't get out of hand too quickly. Sonderegger has suggested dropping the correlation terms that lmer() estimates (by default) among the random effects. While this speeds up model fitting considerably, it seems to make the questionable assumption that the random effects are uncorrelated, so it has not been implemented.

P.P.S. Like lmer(), Rbrul will not stop you from adding a nonsensical random slope that does not vary within levels of the grouping factor. For example, a by-speaker slope for gender makes no sense because a given speaker is – at least traditionally – always the same gender. If speaker is the grouping factor, use random slopes that can vary within a speaker's data: style, topic, and most internal linguistic variables. If you are using word as a grouping factor, it is possible that different words show different gender effects; using a by-word slope for gender could be revealing.

P.P.P.S. I also added the AIC (Akaike Information Criterion) to the model output. The AIC is the deviance plus two times the number of parameters. Comparing the AIC values of two models is an alternative to performing a likelihood-ratio test. The model with lower AIC is supposed to be better.


References:

Barr, Dale J., Roger Levy, Christoph Scheepers, and Harry J. Tily. 2013. Random effects structure for confirmatory hypothesis testing: keep it maximal. Journal of Memory and Language 68: 255-278. [pdf]

Guy, Gregory R. 1980. Variation in the group and the individual: the case of final stop deletion. In W. Labov (ed.), Locating language in time and space. New York: Academic Press. 1-36. [pdf]

Guy, Gregory R. 2004. Dialect unity, dialect contrast: the role of variable constraints. Talk presented at the Meertens Institute, August 2004.

Johnson, Daniel E. 2009. Getting off the GoldVarb standard: introducing Rbrul for mixed-effects variable rule analysis. Language and Linguistics Compass 3(1): 359-383. [pdf]

1 comment:

  1. I have claimed that using a random effect does not assume that the corresponding grouping units (e.g. individual speakers) are variable, and also maintained that omitting the random effect does assume that they are not variable.

    If a group of speakers share an underlying rate of variation, of course they will differ in practice due to chance alone. In this case, including a random intercept can have two outcomes.

    First, lmer() may estimate the speaker standard deviation as 0, which means that the model is truly identical to one without the random effect. Second, the estimate may be a small positive number, in which case inference about between-speaker predictors will be slightly impaired (some Type II error).

    I did a simulation with 40 speakers, each of whom produced 300 tokens of a binary variable, all with an underlying probability of 0.5. (This is based on the size of the Buckeye Corpus of t/d-deletion, assuming no variation based on speaker or anything else).

    I then ran a model consisting of a fixed-effect intercept and a by-speaker random intercept. I recorded the estimated speaker standard deviation and repeated the process 1000 times. The results were as follows:

    38.5% of the time: 0.
    22.7% of the time: between 0 and 0.01.
    21.8% of the time: between 0.01 and 0.05.
    17.0% of the time: between 0.05 and 0.12.

    So while including a random speaker intercept does not assume by-speaker variation, the fitted lmer() model does sometimes include a spurious speaker-variation term, whose size is always small. It is not believed that this would lead to appreciable Type II error in practice.

    In all real data sets that I have analyzed, speaker variation is estimated to be much larger than the above values, usually at least 0.5 log-odds and often several times larger.

    To test the same question for random slopes, we can use the same data set, and add a dummy binary predictor, with each value repeated 150 times per speaker. The real effect of this predictor was set at 2 log-odds (close to the average following-consonant/vowel difference in the Buckeye corpus).

    Again, the speakers will diverge from this value somewhat, due simply to chance. The model estimates of the by-speaker slope standard deviation come out as follows:

    13.6% of the time: 0.
    30.1% of the time: between 0 and 0.01.
    21.1% of the time: between 0.01 and 0.05.
    24.0% of the time: between 0.05 and 0.1.
    11.2% of the time: between 0.1 and 0.23.

    If we obtain random slope standard deviations in this range, we should not reject the null hypothesis that speakers share constraints, and we may not want to retain the random slopes in our models. [1] If the estimates are much larger (as they were for both Ohio and Philadelphia, above), then we should definitely include random slopes.

    Only with more data from each individual can we find out if speakers in the same community really do differ in constraints as well as rates. Similarly with individual words, where the hypothesis of invariance may be even stronger. We may find that there is little reason to use invariance as a null hypothesis.

    [1] It is possible to formally test random effects, including slopes, for significance. Rbrul does not support this feature, because of a problem of circular reasoning. Sometimes a fixed effect is only significant in the absence of a random effect, but that random effect is only significant in the absence of the fixed effect. Rbrul considers it to be a reasonable and conservative procedure to automatically include random effects that reflect the structure of the data.

    ReplyDelete