Thursday, July 25, 2013

Random Slopes: Now That Rbrul Has Them, You May Want Them Too

I've made the first major update to Rbrul in a long time, adding support for random slopes. While in most cases, models with random intercepts perform better than those without them, a recent paper (Barr et al. 2013) has convincingly argued that for each fixed effect in a mixed model, one or more corresponding random slopes should also be considered.

So what are random slopes and what benefits do they provide? If we start with the simple regression equation y = ax + b, the intercept is b and the slope is a. A random intercept allows b to vary; if the data is drawn from different speakers, each speaker essentially has their own value for b. A random slope allows each speaker to have their own value for a as well.

The sociolinguistic literature usually concedes that speakers can vary in their intercepts (average values or rates of application). But at least since Guy (1980), it has been suggested or assumed that the speakers in a community do not vary in their slopes (constraints). As we saw last week, though, in some data sets the effect of following consonant vs. vowel on t/d-deletion varies by speaker more than might be expected by chance.

In the Buckeye Corpus, the estimated standard deviation, across speakers, of this consonant-vs.-vowel slope was 0.70 log-odds; in the Philadelphia Neighborhood Corpus, it was 0.67. A simulation reproducing the number of speakers, number of tokens, balance of following segments, overall following-segment effect, and speaker intercepts produced a median standard deviation of only 0.10 for Ohio and 0.16 for Philadelphia. Speaker slopes as dispersed as the ones actually observed would occur very rarely by chance (Ohio, p < .001; Philadelphia, p = .003).1

If rates and constraints can vary by speaker, it is important not to ignore speaker when analyzing the data. In assessing between-speaker effects – gender, class, age, etc. – ignoring speaker is equivalent to assuming that every token comes from a different speaker. This greatly overestimates the significance of these between-speaker effects (Johnson 2009). The same applies to intercepts (different rates between groups) and slopes (different constraints between groups). The figure below illustrates this.

By keeping track of speaker variation, random intercepts and slopes help provide accurate p-values (left). Without them, data gets "lumped" and p-values can be meaninglessly low (right).

Especially if your data is unbalanced, there are other benefits to using random slopes if your constraints might differ by speaker (or by word, or another grouping factor); these will not be discussed here. Mixed-effects models with random slopes not only control for by-speaker constraint variation, they also provide an estimate of its size. Mixed models with only random intercepts, like fixed-effects models, rather blindly assume the slope variation to be zero, and are only accurate if it really is. No doubt, this "Shared Constraints Hypothesis" (Guy 2004) is roughly, qualitatively correct: for example, all 83 speakers from Ohio and Philadelphia showed more deletion before consonants than before vowels (except one person with only two tokens!) But the hypothesis has been taken for granted far more often than it has been supported with quantitative evidence.

Rbrul has always fit models with random intercepts, allowing users to stop assuming that individual speakers have equal rates of application of a binary variable (or the same average values of a continuous variable). Now Rbrul allows random slopes, so the Shared Constraints Hypothesis can be treated like the hypothesis it is, rather than an inflexible axiom built into our software. The new feature may not be working perfectly, so please send feedback to (or comment here) if you encounter any problems or have questions. Also feel free to be in touch if you have requests for other features to be added in the future!

1These models did not control for other within-subjects effects that could have increased the apparent diversity in the following-segment effect.

P.S. A major drawback to using random slopes is that models containing them can take a long time to fit, and sometimes they don't fit at all, causing "false convergences" and "singular convergences" that Rbrul reports with an "Error Message". There is not always a solution to this – see here and here for suggestions from Jaeger – but it is always a good idea to center any continuous variables, or at least keep the zero-point close to the center. For example, if you have a date-of-birth predictor, make 0 the year 1900 or 1950, not the year 0. Add random slopes one at a time so processing times don't get out of hand too quickly. Sonderegger has suggested dropping the correlation terms that lmer() estimates (by default) among the random effects. While this speeds up model fitting considerably, it seems to make the questionable assumption that the random effects are uncorrelated, so it has not been implemented.

P.P.S. Like lmer(), Rbrul will not stop you from adding a nonsensical random slope that does not vary within levels of the grouping factor. For example, a by-speaker slope for gender makes no sense because a given speaker is – at least traditionally – always the same gender. If speaker is the grouping factor, use random slopes that can vary within a speaker's data: style, topic, and most internal linguistic variables. If you are using word as a grouping factor, it is possible that different words show different gender effects; using a by-word slope for gender could be revealing.

P.P.P.S. I also added the AIC (Akaike Information Criterion) to the model output. The AIC is the deviance plus two times the number of parameters. Comparing the AIC values of two models is an alternative to performing a likelihood-ratio test. The model with lower AIC is supposed to be better.


Barr, Dale J., Roger Levy, Christoph Scheepers, and Harry J. Tily. 2013. Random effects structure for confirmatory hypothesis testing: keep it maximal. Journal of Memory and Language 68: 255-278. [pdf]

Guy, Gregory R. 1980. Variation in the group and the individual: the case of final stop deletion. In W. Labov (ed.), Locating language in time and space. New York: Academic Press. 1-36. [pdf]

Guy, Gregory R. 2004. Dialect unity, dialect contrast: the role of variable constraints. Talk presented at the Meertens Institute, August 2004.

Johnson, Daniel E. 2009. Getting off the GoldVarb standard: introducing Rbrul for mixed-effects variable rule analysis. Language and Linguistics Compass 3(1): 359-383. [pdf]

Tuesday, July 16, 2013

Testing The Logistic Model Of Constant Constraint Effects: A Miniature Study

Sociolinguistics, like other fields, decided during the 1970s that logistic regression was the best way to analyze the effects of contextual factors on binary variables. Labov (1969) initially conceived of the probability of variable rule application as additive: p = p0 + pi + pj + … . Cedergren & Sankoff (1974) introduced the multiplicative model: p = p0 × pi × pj × … . But it was a slightly more complex equation that eventually prevailed: log(p/(1-p)) = log(p0/(1-p0)) + log(pi/(1-pi)) + log(pj/(1-pj)) + … .

Sankoff & Labov (1979: 195-6) note that this "logistic-linear" model "replaced the others in general use after 1974", although it was not publicly described until Rousseau & Sankoff (1978). It has specific advantages for sociolinguistics (treating favoring and disfavoring effects equally), but it is identical to the general form of logistic regression covered in e.g. Cox (1970). The VARBRUL and GoldVarb programs (Sankoff et al. 2012) apply Iterative Proportional Fitting to log-linear models (disallowing "knockouts" and continuous predictors). Such models are equivalent to logistic models, with identical outputs (pace Roy 2013).

The logistic transformation, devised by Verhulst in 1838, was first used to model population growth over time (Cramer 2002). If a population is limited by a maximum value (which we label 1.0), and the rate of increase is proportional to both the current level p and the remaining room for expansion 1-p, then the population, over time, will follow an S-shaped logistic curve, with a location parameter a and a slope parameter b:
p = exp(a+bt)/(1+exp(a+bt)). We see several logistic curves below.

Its origin in population growth curves makes logistic regression a natural choice for analyzing discrete linguistic change, and it is extensively used in historical syntax (Kroch 1989) and increasingly in phonology (Fruehwald et al. 2009). However, if the independent variable is anything other than time, it is fair to ask whether its effect actually has the signature S-shape.

For social factors, which rarely resemble continuous variables, this is difficult to do. Labov's charts - juxtaposed in Silverstein (2003) - show that in mid-1960's New York, the effect of social class on the (r) variable and the (th) variable were quite different. For (r), the social classes toward the edges of the hierarchy are more dispersed; the lower classes (0 vs. 1) are further apart than the working classes (2-3 vs. 4-5). This is the opposite of a logistic curve, which always changes fastest in the middle. However, (th) shows a different pattern, which is more consistent with a S-curve: the lower and working classes are similar, with a large gap between them and the middle class groups. Finally, while it is hard to judge, neither variable appears to respond to contextual style in a clearly sigmoid manner.

Linguistic factors offer a better approach to the question. Rather than observe the shape of the response to a multi-level predictor - the levels of linguistic factors are often unordered - we can compare the size of binary linguistic constraints among speakers who vary in their overall rates of use. The idea that speakers in a community (and sometimes across community lines) use variables at different rates while sharing constraints began as a surprising observation (Labov 1966, G. Sankoff 1973, Guy 1980) but has become an assumption of the VARBRUL/GoldVarb paradigm (Guy 1991, Lim & Guy 2005, Meyerhoff & Walker 2007, Tagliamonte 2011; but see also Kay 1978, Kay & McDaniel 1979, Sankoff & Labov 1979).

Recently, speaker variation has motivated the introduction of mixed-effects models with random speaker intercepts. But if the variable in question is binary (and the regression is therefore logistic), a constant effect (in logistic terms) should have larger consequences (in percentage terms) in the middle of the range. If speaker A, in a certain context, shifts from 40% to 60% use, then speaker B should shift from 80% to 90% - not 100%. These two changes are equal in logistic units (called log-odds): log(.6/(1−.6)) − log(.4/(1−.4)) = log(.9/(1−.9)) − log(.8/(1−.8)) = 0.811.

In a classic paper, Guy (1980) compared the linguistic constraints on t/d-deletion for many individuals, but (despite the late year) presented factor weights from the multiplicative model, making it impossible to evaluate the relationship between rates and constraints from his tables. Therefore, we will use the following t/d data sets to attempt to address the issue:

Daleszynska (p.c.): 1,998 tokens from 30 Bequia speakers.
Labov et al. (2013): 14,992 tokens from 42 Philadelphia speakers.
Pitt et al. (2007): 13,664 tokens from 40 Central Ohio speakers.
Walker (p.c.): 4,022 tokens from 48 Toronto speakers.

The t/d-deletion predictor to be investigated is following consonant (west coast) vs. following vowel (west end). In most varieties of English, a following consonant favors deletion, while a following vowel disfavors it. Because the consonant-vowel difference has an articulatory basis, we might expect it to remain fairly constant across speakers. But if so, will it be constant in percentage terms, or in logistic terms? In fact, if deletion before consonants is a "late" phonetic process (caused by overlapping gestures?), we might observe a third pattern, where the effect would be smaller in proportion to the amount of deletion generated "earlier".1

That is, if a linguistic factor - e.g. following consonant vs. vowel in the case of t/d-deletion - has a constant effect in percentage terms, we find a horizontal line (1). If the constraint is constant in log-odds terms, as assumed by logistic regression, we see a curve with a maximum at 50% overall retention (2). If the effect arises from "extra" deletion before consonants, it increases in proportion to the overall retention rate (3).

Comparing the four community studies leads to some interesting results. There is a lot of variation between speakers in each community. We already knew that speakers varied in their overall rates of deletion, but the ranges here are wide. In Philadelphia, the median deletion level is 51%, but the range extends from 22% to 71% (considering people with more than 50 tokens). In the Ohio (Buckeye) corpus, the median rate was lower, 41%, with an even larger range, 18% to 73%. The Torontonians (with less deletion) and the Bequians (with more) also varied widely.

We also observe that speakers within communities differ in the observed following-segment effect. For example, in Philadelphia there were two speakers with very similar overall deletion rates, but one deleted 94% of the time before consonants and only 6% before vowels, while the other had 76% deletion before consonants and 20% before vowels. In Ohio, the consonant-vowel effect was smaller overall, with at least as much between-speaker variation: one speaker produced 79% deletion _#C and 4% _#V, while another produced 51% deletion _#C and 38% _#V. While it would require a statistical demonstration, this amount of divergence is probably more than would be expected by chance. If this is the case, even within speech communities, then we may need to take more care to model speaker constraint differences (for example, with random slopes).

Excludes speakers with <10 tokens, preceding /n/ and following /t, d/.

What about differences between communities? Clearly, the average deletion rates of the four communities differ: Bequia, the Caribbean island, has the most t/d deletion, while Canada's largest city shows the least. The white American communities are intermediate. Such differences are to be expected. What is more interesting is that the largest absolute following-segment effect is found in Philadelphia, where the data is closest to 50% average deletion. Ohio and Toronto, with around 40% deletion, show a smaller effect, in percentage terms. Bequia, with average deletion of nearly 90%, shows no clear following-segment effect at all. These findings are consistent with the logistic interpretation. The effect may be constant - but on the log-odds scale. On the percentage scale, it appears greatest in Philadelphia, where the median speaker shows a difference of 91% _#C vs. 16% _#V. But an effect this large - almost 4 log-odds - should show up in Bequia, yet it does not (of course, the Bequia variety is quite distinct from the others treated here; could it lack this basic constraint?).

Within each community, the logistic model predicts the same thing: the closer a speaker is to 50% deletion overall, the larger the consonant-vowel difference should appear. The data suggest that this prediction is borne out, at least to a first approximation. In Philadelphia, Ohio, and Toronto, all the largest effects are found in the 40% - 60% range, and the smallest effects mostly occur outside that range. While the following-segment constraint differs across communities (larger in Philadelphia, smaller in Bequia), and probably across individual speakers, it seems to follow an inherent arch-shaped curve, similar to (2). A cubic approximation of this curve is superimposed below on the data from all four communities.

In conclusion, the evidence from four studies of t/d-deletion suggests that speaker effects and phonological effects combine additively on a logistic scale, supporting the standard variationist model. However, both rates and constraints can vary, not only between communities, but within them.

Thanks to Agata Daleszynska, Meredith Tamminga, and James Walker.

1A diagonal also results from the "lexical exception" theory (Guy 2007), where t/d-deletion is bled when reduced forms are lexicalized, creating an' alongside and. When a word's underlying form may already be reduced, any contextual effects - like that of following segment - will be smaller in proportion. But if individual-word variation is part of the deletion process, we would expect the logistic curve (2) rather than the diagonal line (3).


Cedergren, Henrietta and David Sankoff. 1974. Variable rules: Performance as a statistical reflection of competence. Language 50(2): 333-355.

Cox, David R. 1970. The analysis of binary data. London: Methuen.

Cramer, Jan S. 2002. The origins of logistic regression. Tinbergen Institute Discussion Paper 119/4.

Fruehwald, Josef, Jonathan Gress-Wright, and Joel Wallenberg. 2009. Phonological rule change: the constant rate effect. Paper presented at North-Eastern Linguistic Society (NELS) 40, MIT.

Guy, Gregory R. 1980. Variation in the group and the individual: the case of final stop deletion. In W. Labov (ed.), Locating language in time and space. New York: Academic Press. 1-36.

Guy, Gregory R. 1991a. Explanation in variable phonology: an exponential model of morphological constraints. Language Variation and Change 3(1): 1-22.

Guy, Gregory R. 1991b. Contextual conditioning in variable lexical phonology. Language Variation and Change 3(2): 223-240.

Guy, Gregory R. 2007. Lexical exceptions in variable phonology. Penn Working Papers in Linguistics 13(2), Papers from NWAV 35.

Kay, Paul. 1978. Variable rules, community grammar and linguistic change. In D. Sankoff (ed.), Linguistic variation: models and methods. New York: Academic Press. 71-83.

Kay, Paul and Chad K. McDaniel. 1979. On the logic of variable rules. Language in Society 8(2): 151-187.

Labov, William. 1966. The social stratification of English in New York City. Washington, D.C.: Center for Applied Linguistics.

Labov, William. 1969. Contraction, deletion, and inherent variability of the English copula. Language 45(4): 715-762.

Labov, William et al. 2013. The Philadelphia Neighborhood Corpus of LING560 Studies.

Lim, Laureen T. and Gregory R. Guy. 2005. The limits of linguistic community: speech styles and variable constraint effects. Penn Working Papers in Linguistics 13.2, Papers from NWAVE 32. 157-170.

Meyerhoff, Miriam and James A. Walker. 2007. The persistence of variation in individual grammars: copula absence in ‘urban sojourners’ and their stay‐at‐home peers, Bequia (St. Vincent and the Grenadines). Journal of Sociolinguistics 11(3): 346-366.

Pitt, M. A. et al. 2007. Buckeye Corpus of Conversational Speech. Columbus, OH: Department of Psychology, Ohio State University.

Rousseau, Pascale and David Sankoff. Advances in variable rule methodology. In D. Sankoff (ed.), Linguistic Variation: Models and Methods. New York: Academic Press. 57-69.

Roy, Joseph. 2013. Sociolinguistic Statistics: the intersection between statistical models, empirical data and sociolinguistic theory. Proceedings of Methods in Dialectology XIV in London, Ontario.

Sankoff, David, Sali Tagliamonte, and Eric Smith. 2012. Goldvarb LION: A variable rule application for Macintosh. Department of Linguistics, University of Toronto.

Sankoff, David and William Labov. 1979. On the uses of variable rules. Language in Society 8(2): 189-222.

Sankoff, Gillian. 1973. Above and beyond phonology in variable rules. In C.-J. N. Bailey & R. W. Shuy (eds), New ways of analyzing variation in English. Washington, D.C.: Georgetown University Press. 44-61.

Silverstein, Michael. 2003. Indexical order and the dialectics of sociolinguistic life. Language & Communication 23(3-4): 193-229.

Tagliamonte, Sali A. 2011. Variationist sociolinguistics: change, observation, interpretation. Hoboken, N.J.: Wiley.

Sunday, July 7, 2013

Does Neg- vs. Aux-Contraction Vary Geographically In England? A Miniature Study

On Friday Sam Kirkham and I met some sixth-formers [high school students] and gave them an introduction to our department and sociolinguistics in general. We decided to take advantage of the opportunity and use the students as unpaid research assistants. We designed a small questionnaire that they could give to each other, and to people waiting for a bus or sitting on the [unusually] sunny steps of Alexandra Square. To illustrate north-south differences, we included a few questions about TRAP-BATH and FOOT-STRUT. We also had this item:

This alternation, which has all but disappeared from US English, involves a choice between so-called negative contraction (I haven't been) and auxiliary [or operator] contraction (I've not been). The auxiliary in question can be is, are, have, has, had, will, or would (Varela Pérez 2013: 257); in this study we are looking at a single instance with have, an environment that favors negative contraction, compared to is or are.

Peter Trudgill was the first sociolinguist to suggest a geographic correlation for this variable, claiming that auxiliary contraction increases "the further north one goes" (1978: 13). However, his early proclamations of this sort have not always survived later scrutiny. As another example, Hughes & Trudgill (1979: 25) stated that the particle verb alternation (pour out the tea vs. pour the tea out) also patterned along a north-south continuum, but this was not at all borne out in an experimental study involving 145 UK (and Irish) speakers (Haddican & Johnson 2012).

Regarding contraction, studies have indeed found either no clear geographic correlation (Anderwald 2002, Smith & Tagliamonte 2002), or a weak one in the opposite direction, meaning that southerners may slightly prefer auxiliary contraction (Gasparrini 2001). However, nothing approaching a dialectological study of this variable has ever been conducted. For example, the eight places studied by Tagliamonte & Smith are scattered and to some extent intentionally unrepresentative of UK speech. In the present miniature study, we will achieve far less depth about each location, but a wider geographical coverage (though certainly unrepresentative in its own way).

We obtained 52 responses to the question on contraction, associated with 36 places of origin in England. The distribution was as follows: 23 people said "I haven't been to Ireland" was "much better", 16 said it was "slightly better", 7 said the two alternatives were "equally good", 3 said "I've not been to Ireland" was "slightly better", and 3 said it was "much better". This overall strong preference for NEG-contraction with have is in line with the literature. A simple way to address the geographical question is to divide the responses into three categories - South, Midlands, and North - and compare the responses in each group. Although the measurement scale of the question is ordinal, we will assume linearity and assign numerical scores, ranging from 0 for judging NEG-contraction "much better", to 4 for judging AUX-contraction "much better".

Using Wikipedia's traditional definition of the Midlands to divide the regions led to average scores that are, at the very least, suggestive of a difference in line with Trudgill's original formula.

South (11 responses): 0.55
Midlands (8 responses): 0.63
North (33 responses): 1.21

If we combine the very similar South and Midlands regions and contrast their data with that from the North, it is initially unclear just how much evidence we have for a geographical difference. While a conventional t.test() returns a p-value of .03, the non-parametric wilcox.test() (or Mann-Whitney test, more appropriate here because the response is not only ordinal but quite skewed) gives p = .12, which would not be interpreted as statistically significant. However, we should also consider that none of the 19 respondents from the South and Midlands expressed a positive preference for AUX-contraction, while 6 of 33 Northern subjects did so. While dispreferred everywhere, AUX-contraction appears to be more acceptable in the North.

It is rarely a good idea to reduce a continuous variable to a set of discrete categories, and collapsing 36 distinct places into three regions is no exception, even though the historical division between North, Midlands and South has considerable historical precedent (the areas correspond roughly to the Northumbrian, Mercian, and Saxon kingdoms - and dialects - of the Old English period). If AUX-contraction really increases in a continuous manner "the further north one goes", then an analysis that treats latitude as a continuous variable will be more successful in revealing the effect. Incorporating the dimension of longitude as well, though it makes the statistics more complex, is potentially even more revealing.

The place names given by the respondents (usually cities or towns, sometimes counties) were entered into an online geocoder to obtain their latitudes and longitudes. There are many R packages (as well as other software) that could produce a map of this data; some options are described here. I found an outline map of England here, intended for use with the sp package, but I plotted it with ordinary 'base' R graphics (since I have yet to learn ggplot2, I do not know how to produce maps like this!). The only commands used for this map are plot(), points(), cluster.overplot() to separate the responses from the same place, and legend().

A basic spatial statistic called Moran's I is often run to establish whether the data show global spatial autocorrelation. Like any correlation, Moran's I can range from -1 to +1. A value of 0 would reflect a random spatial distribution of high and low values (dark and light points). A positive value means that similar values tend to cluster together, while a negative value means that high and low values are inter-dispersed more than randomness would expect (imagine the black and white squares on a chessboard). The statistic depends on a matrix of spatial weights; for example, all points within a certain distance could be considered neighbors, or the closest k points regardless of distance. Other, more gradual criteria can also be applied (see here and also Grieve et al. 2011).

I decided, somewhat arbitrarily, to use 5-nearest-neighbors as the threshold. If responses, on average, are more similar to their 5 nearest neighbors than to responses further away, then Moran's I should be positive. In fact, Moran's I is -0.102, which is associated with a p-value of .27. This means that the distribution of responses favoring AUX-contraction and NEG-contraction are not clustered, but in fact almost random in their spatial patterning. This conclusion is disappointing! On the bright side, a lack of spatial autocorrelation means that an ordinary regression can be performed with less fear of error. But a lm() model with latitude as a predictor is also not statistically significant (p = .27). Of course, such a model implies a gradual effect of latitude which to some extent goes against the idea of coherent dialect regions.

If a linguistic feature has wide variability in every community, then it is possible that global spatial autocorrelation will be low - especially with a small number of respondents - even though an overall geographical difference may exist. As this is a miniature study, we cannot pursue the debate further but can only note that if a small amount of crude data collected one afternoon in Lancaster can provide this much information, a larger collection effort could likely settle the question once and for all as to whether the preferred means of contraction has a geographic component. We will conclude by using the method of generalized additive modeling (mgcv package) to create a smoothed map of contraction preference.

Based on this plot, we would think that contraction varies geographically! But geographic patterns, like other types, can certainly arise by chance. To solve this question would require a dialectological investigation - that is, one conducted at many places. But the data collected on Friday, in a few hours, by sixth form students, restores some faith in Peter Trudgill's conjecture, which may have been dismissed too hastily by linguists.


Anderwald, Lieselotte. 2002. Negation in Non-standard British English: Gaps, Regularizations, Asymmetries. London: Routledge.

Gasparrini, Désirée. 2001. It isn’t, it is not or it’s not? Regional Differences in Contraction in Spoken British English. Master’s thesis. University of Zürich.

Grieve, Jack, Dirk Speelman and Dirk Geeraerts. 2011. A statistical method for the identification and aggregation of regional linguistic variation. Language Variation and Change 23: 193-221.

Haddican, Bill and Daniel Ezra Johnson. 2012. Effects on the Particle Verb Alternation across English Dialects. University of Pennsylvania Working Papers in Linguistics 18(2): Article 5.

Hughes, Arthur and Peter Trudgill. 1979. English Accents and Dialects: An Introduction to Social and Regional Varieties of British English. London: Edward Arnold.

Tagliamonte, Sali and Jennifer Smith. 2002. 'Either it isn’t or it’s not': NEG/AUX Contraction in British Dialects. English World-Wide 23(2): 251-281.

Trudgill, Peter. 1978. Sociolinguistic patterns in British English. London: Edward Arnold.

Varela Pérez, José Ramón. 2013. Operator and negative contraction in spoken British English: a change in progress. In Bas Aarts, Joanne Close, and Geoffrey Leech (eds.), The Verb Phrase in English: Investigating Recent Language Change With Corpora. Cambridge University Press. 256-285.