Saturday, October 26, 2013

What Should We Do With Frequency?

UPDATE: According to Kyle Gorman (p.c.), the best psycholinguistic frequency measure is simply the rank (of frequency), which makes sense given a "serial access" model of lexical processing. This metric apparently outperforms both log-frequency and raw frequency in lexical decision tasks. The upshot for sociolinguistic frequency results remains to be determined.

Frequency is the count of how many times some relatively rare item, such as a particular word, occurs over a certain length of time, or in a text or corpus of a certain length. Such a count — called a Poisson variable — clearly can't be negative, but in a regression context the independent variables can often be positive or negative. One reason for using the logarithmic transformation of the count is to avoid the problems resulting from this.

Also, in linear regression, effects on the dependent variable are additive, being proportional to the changes in the independent variables. But effects on counts are usually multiplicative, proportional to the size of the count itself. If we use the log link, instead of having to multiply the effects, we can add them, like we usually do.

But when word frequency is an independent variable in a regression, the above concerns do not apply. Here we have to contend with another issue, which is that word frequencies form a highly skewed distribution. Even in large texts or corpora, a few words always comprise a substantial fraction of the total, while many words occur only once. Zipf's Law is more precise, stating that the most frequent word in a large text is twice as frequent as the second-most-frequent word, the 10th most frequent word twice as frequent as the 20th, and so forth. This "power law" says that a word's frequency, multiplied by its rank frequency, is a constant.

For Zipf (1949), this relationship illustrated his "principle of least effort"; less frequent words were more difficult to access and therefore lazy humans avoided retrieving them. But both Zipf and others pointed out that similar relationships apply much more generally, including well outside the realm of human behavior: "the size of cities, the number of hits on websites, the magnitude of earthquakes and the diameters of moon craters have all been shown to follow power laws" (West 2008). Newman (2006) provides other interesting examples and offers six main mechanisms by which they can arise.

Newman (2006:13-14) refers to Cover and Thomas (1991:85), who show that the from an information-theory standpoint, the optimum length of a codeword is the negative logarithm of its probability. We can extend this, as Newman does, to the question of the distribution of word lengths, and we see that this in fact predicts a power-law distribution (if "length" is defined appropriately). Perhaps the optimum "semantic length" of words is behaving similarly, explaining the specific form of the inverse relationship between word frequency and rank frequency (of course, an inverse relationship of some kind exists by definition).

We saw that when frequency is the dependent variable, it should be expressed in logarithmic units. But when frequency is an independent variable, whether to transform it is not so clear. Ideally, our choice should reflect our thoughts on how frequency might be represented internally. Indeed, comparing between two transformations, or between raw and transformed data, could possibly be used to distinguish between theories.

If words are merely "tagged" for frequency, a log-transformation might be convenient given the skewed distribution, because it decreases the effect of "leverage points" (independent-variable outliers). The distribution will still be skewed, but this is more or less OK, since it is an independent variable. But in a theory where each use of a word calls up the totality of the speaker or hearer's experience with that word, one would expect that raw frequency numbers would be more appropriate.

Transforming frequency as an independent variable does not change the model as such, only the coefficient values for frequency and any other variables interacting with it. However, as we know from Erker and Guy (2012), plots with raw frequency or log-frequency on the x-axis can look quite different, and the corresponding regression slopes and correlations can even switch from positive to negative or vice-versa, complicating the interpretation of the results.

Perhaps a more clear-cut issue in regression is weighting. If we are interested in investigating the effect of frequency, whether on its original axis or on a transformed one, it seems unlikely that we would like our estimates to be affected more by high-frequency words than by low-frequency ones. This is not a problem in experiments where words are selected based on frequency, because they are then presented in a balanced way. But in studies of natural speech, especially if they include frequency as an independent variable, either a random word effect or explicit inverse-frequency weighting should be used to counteract the bias.


Cover, Thomas M. and Joy A. Thomas. 1991. Elements of information theory. New York: John Wiley & Sons.

Erker, Daniel and Gregory R. Guy. 2012. The role of lexical frequency in syntactic variability: variable subject personal pronoun expression in Spanish. Language 88(3): 526-557.

Newman, Mark. 2006. Power laws, Pareto distributions and Zipf’s law.

West, Marc. 2008. The mystery of Zipf.

Zipf, George Kingsley. 1949. Human Behavior and the Principle of Least Effort. Cambridge, Mass.: Addison-Wesley.

Saturday, October 5, 2013

If Individuals Follow The Exponential Hypothesis, Groups Don't (And Vice Versa)

If you haven't heard of the Exponential Hypothesis, read this, this and this, and if you want more, look here. Guy's paper inspired me to want to do this kind of linguistics. But now it seems that the patterns he so cleverly explained were just meaningless coincidences - leaving this, uncontested, as the most impressive quantitative LVC paper of all time. But I digress.

Sociolinguists have differed for aeons regarding the relationship between the individual and the group. Even Labov's clear statements along the lines that "language is not a property of the individual, but of the community" are qualified, or undermined, by defining a speech community as "a group of people who share a given set of norms of language" (see also pp. 206-210 of the same paper for a staunch defense of the study of individuals).

The typical variationist recognizes the practical need to combine data from a group of speakers, even if their theoretical goal is the analysis of individual grammars. After some years spent in ignorance of the statistical ramifications of this situation, they have now generally adopted mixed-effects regression modeling as a way to have their cake and eat it too.

But the Exponential Hypothesis is not well-equipped to bridge this gap. If each individual i retains final t/d at a rate of ri for regular past tense forms, ri2 for weak past tense forms, and ri3 for monomorphemes - and if ri varies by individual (as has always been conceded) - then the pooled data from all speakers can never show an exponential relationship.

I will demonstrate this under four assumptions of how speakers might vary: 1) the probability of retention, r, is normally distributed across the population; 2) the probability of retention is uniformly (evenly) distributed over a similar range; 3) the log-odds of retention - log(r / (1 - r)) - is normally distributed; 4) the log-odds of retention is uniformly distributed.

Using a central value for r of +2 log-odds (.881), and allowing speakers to vary with a standard deviation of 1 (in log-odds) or 0.15 (in probability), I obtained the following results, with 100,000 speakers in each simulation:

Probability Normal Theoretical (Exponential) Empirical (Group Mean)
Regular Past .862 .862
Weak Past .743 .759
Monomorpheme .641 .679

Probability Uniform Theoretical (Exponential) Empirical (Group Mean)
Regular Past .862 .862
Weak Past .742 .758
Monomorpheme .639 .680

Log-Odds Normal Theoretical (Exponential) Empirical (Group Mean)
Regular Past .844 .844
Weak Past .712 .728
Monomorpheme .601 .638

Log-Odds Uniform Theoretical (Exponential) Empirical (Group Mean)
Regular Past .842 .842
Weak Past .710 .724
Monomorpheme .598 .633

These simulations assume an equal amount of data from each speaker, and an equal balance of words from each speaker (which matters if individual words vary). If these conditions are not met, like in real data, the groups will likely deviate even more from the exponential pattern. Looking at it the other way round, the very existence of an exponential pattern in pooled data - as is found for t/d-deletion in English! - is evidence that the true Exponential Hypothesis, for individual grammars, is false.

P.S. Why should this be, you ask? Let me try some math.

A function f(x) is strictly convex over an interval if the second derivative of the function is positive for all x in that interval.

Now let f(x) = xn, where n > 1. The second derivative is n · (n-1) · xn-2. Since n > 1, both n and (n-1) are positive. If x is positive, xn-2 is positive, making the second derivative positive, which means that xn is strictly convex over the whole interval 0 < x < ∞.

Jensen's inequality states that if x is a random variable and f(x) is a strictly convex function, then f(E[x]) < E[f(x)]. That is, if we take the expected value of a variable over an interval, and then apply a strictly convex function to it, the result is always less than if we apply the function first, and then take the expected value of the outcome.

In our case, x is the probability of t/d retention, and like all probabilities, it lies on the interval between 0 and 1, where we know xn is strictly convex. By Jensen's inequality, E[x]n < E[xn]. This means that if we take the mean rate of retention for a group of speakers, and raise it to some power, the result is always less than if we raise each speaker's rate to that power, and then take the mean.

Therefore, the theoretical exponential rate will always be less than the empirical group mean rate, which is what we observed in all the simulations above.

Tuesday, October 1, 2013

On Exactitude In Science: A Miniature Study On The Effects Of Typical And Current Context

In that Empire, the Art of Cartography attained such Perfection that the map of a single Province occupied the entirety of a City, and the map of the Empire, the entirety of a Province. In time, even these immense Maps no longer satisfied, and the Cartographers' Guilds surveyed a Map of the Empire whose Size was that of the Empire, and which coincided point for point with it. The following Generations, less addicted to the Study of Cartography, realized that that vast Map was useless, and not without some Pitilessness delivered it up to the Inclemencies of Sun and Winter. In the Deserts of the West, there remain tattered Ruins of that Map, inhabited by Animals and Beggars; in all the Land there is no other Relic of the Disciplines of Geography.  (J. L. Borges)

A scientific model makes predictions based on a number of variables and parameters. The more complex the model, the more accurate its predictions. But all things being equal, a simpler model is preferred. As Newton put it: "We are to admit no more causes of natural things than such as are both true and sufficient to explain their appearances."

Exemplar Theory makes predictions about the present or future based on an enormous amount of stored information about the past. For example, a speaker is said to pronounce a word by taking into account the thousands of ways he or she has produced and heard it before. If such feats of memory are possible - I ain't sayin' I don't believe Goldinger 1998 - we should not be surprised by the accuracy of models that rely on them. And if language can be shown to rely on them, so be it. But the abandonment of parsimony, in the absence of clear evidence, should be resisted. (See the comment by S. Brown below for an alternative view of this issue.)

The same phenomena can often be accounted for "equally well" by a more deductive traditional theory or by a more inductive, bottom-up approach. The chemical elements were assigned to groups (alkali metals, halogens, etc.) because of their similar physical properties and bonding behavior long before the nuclear basis for these similarities was discovered. In biology, the various taxonomic branchings of species can be thought of as a reflection and continuation of their historical evolution, but the differences exist on a synchronic level as well - in organisms' DNA.

In classical mechanics, if an object begins at rest, its current velocity can be determined by integrating its acceleration over time. By storing the details of acceleration at every point along a trajectory, the current velocity can be calculated by integration: v(t) = ∫ a(t) dt. If we ride a bus with our eyes closed and our fingers in our ears, we can estimate our present speed if we remember everything about our past accelerations.

A free-falling object, under the constant acceleration of gravity, g, has velocity v(t) = g · t. But a block sliding down a ramp made up of various curved and angled sections, like a roller coaster, has an acceleration that changes with time. The acceleration at any moment is proportional to the sine of the slope of the ramp. Integrating, v(t) = g · ∫ sin(θ(t)) dt.

On a simple inclined plane, the angle is constant, so the acceleration is too. The velocity increases linearly: like free-fall, only slower. If the shape of the ramp is complicated, solving the integral of acceleration can be very difficult. (It might be beyond the capacity of the brain to calculate - but on a real roller coaster, we don't have to remember the past ten seconds to know how fast we are going now! We use other cues to accomplish that.)

But forgetting integration, we can solve for velocity in another way, showing that it depends only on the vertical height fallen: v = sqrt(2 · g · h). Obviously this is simpler than keeping track of complex and changing accelerations over time. This equation, rewritten as 1/2 · v2 = g · h, also reflects the balance between kinetic and potential energy, one part of the important physical law of conservation of energy. Instead of a near-truism with a messy implementation, we have an elegant and powerful principle.

Both expressions for velocity fit the same data, but to call the second an "emergent generalization", à la Bybee 2001, ignores its universality and demotivates the essential next step: the search for deeper explanations.

Admittedly, this physical allegory is unlikely to convince any exemplar theorists to recant. But we should realize that given its powerful and unexplanatory nature, any correct predictions made by ET do not constitute real evidence in its favor. We need to determine if the data also support an alternative theory, or at least find places where the data is more compatible with a weaker version of ET, rather than a stronger one.

A recently-discussed case suggests that with respect to phonological processes, individual words are not only influenced by their current contexts, but also, to a lesser degree, by their typical contexts (Guy, Hay & Walker 2008; Hay 2013). This is one of several recent studies by Hay and colleagues that show widespread and principled lexical variation, well beyond the idiosyncratic lexical exceptionalism sometimes acknowledged in the past, e.g. "when substantial lexical differences appear in variationist studies, they appear to be confined to a few lexical items" (Guy 2009).

The strong-ET interpretation is that all previous pronunciations of a word are stored in memory, and this gives us the typical-context distribution for each word. But if this is the case, the current-context effect must derive from something else: either from mechanical considerations or from analogy to other words. It can't also reflect sampling from sub-clouds of past exemplars, because that would cancel out the typical-context effect.

For words to be stored along with their environments is actually a weak version of word-specific phonetics (Pierrehumbert 2002). It is not that words are explicitly marked to behave differently; they only do so because of the environments they typically find themselves in. For Yang (2013: 6325), "these use asymmetries are unlikely to be linguistic but only mirror life." But whether they mirror life or reflect language-internal collocational structures, these asymmetries are not properties of individual words.

Under this model of word production - sampling from the population of stored tokens, then applying a constant multiplicative contextual effect - we observe the following pattern (in this case, the process is t/d-deletion, as in Hay 2013; the parameters are roughly based on real data):

Exemplar Model: Contextual Effect Greatest When Pool Is Least Reduced

This pattern has two main features: as words' typical contexts favor retention more, retention rates increase linearly before both V and C, with a widening gap between the two. From the Exemplar Theory perspective, when the pool of tokens contains mainly unreduced forms, the differences between constraints on reduction can be seen more clearly. But when many of the tokens in the pool are reduced already, the difference between pre-consonantal and pre-vocalic environments appears smaller. Such reduced constraint sizes are the natural result when a process of "pre-deletion" intersects with a set of rules or constraints that apply later, as discussed in this post, and in a slightly different sense in Guy 2007.

An alternative to storing every token is to say that words acquire biases from their contexts, and that these biases become properties of the words themselves. The source of a bias could be irrelevant to its representation - one word typically heard before consonants, another typically heard from African-American speakers, and another typically heard in fast speech contexts could all be marked for "extra deletion" in the same way.

From the point of view of parsimony, this is appealing. To figure out how a speaker might pronounce a word, the grammar would have to refer to a medium-sized list of by-word intercepts, but not search through a linguistic biography thick and complex enough to have been written by Robert Caro.

But the theoretical rubber needs to hit the empirical road, or else we are just spinning our wheels here. So, compared to the Stor-It-All model, does a stripped-down word-intercept approach make adequate predictions, or - dare we hope - even better ones? Are the predictions even that different?

If we assume that for binary variables, by-word intercepts (like by-speaker intercepts) combine with contextual effects additively on the log-odds scale (which seems more or less true), we obtain a pattern like this:

Intercept Model: Typical Context Combines W/ Constant Current Context

Although the two figures are not wildly different, we can see that in this case, there is no steady separation of the _V and _C effects as overall retention increases. The following-segment effect is constant in log-odds (by stipulation), and this manifests as a slight widening near the center of the distribution. The effects of current context and typical context are independent in this model, as opposed to what we saw above.

As usual, the Buckeye Corpus (Pitt et al. 2007) is a good proving ground for competing predictions of this kind. The Philadelphia Neighborhood Corpus has a similar amount of t/d coded (with more coming soon). Starting with Buckeye, I only included tokens of word-final t/d that were followed by a vowel or a consonant. I excluded all tokens with preceding /n/, in keeping with the sociolinguistic lore, "Beware the nasal flap!" I then restricted the analysis to words with at least 10 total tokens each - and excluded the word "just", because it had about eight times as many tokens as any of the other words. I was left with 2418 tokens of 69 word types.

Incidentally, there is no significant relationship between a word's overall deletion rate and its (log) frequency, whether the frequency measure is taken from the Buckeye Corpus itself (p = .15) or from the 51-million-word corpus of Brysbaert et al. 2013 (p = .37). The absence of a significant frequency effect on what is arguably a lenition process goes against a key tenet of Exemplar Theory (Bybee 2000, Pierrehumbert 2001), but the issue of frequency is not our main concern here.

I first plotted two linear regression lines, one for the pre-vocalic environments and one for the the pre-consonantal environments. The regressions were weighted according to the number of tokens for each word. I then tried a quadratic rather than a linear regression. However, these curves did not provide a significantly better fit to the data - p(_V) = .57, p(_C) = .46 - so I retreated to the linear models. The straight lines plotted below look parallel; in fact the slope of the _V line is 0.301 and the slope of the _C line is 0.369. Since the lines converge slightly rather than diverging markedly, this data is less consistent with the exemplar model sketched above, and more consistent with the word-intercept model.

Buckeye Corpus: Parallel Lines Support Intercept Model, Not Exemplars

One way to improve this analysis would be to use a larger corpus, at least for the x-axis, to more accurately estimate the proportion that a given word ending in t/d is followed by a vowel rather than a consonant. For example, the spoken section of COCA (Davies 2008-) is about 250 times larger than the Buckeye Corpus. Of course, for a few words the estimate from the local corpus might better represent those speakers' biases.

Turning finally to data from the Philadelphia Neighborhood Corpus, we see a fairly similar picture. Note that some of the words' left-right positions differ noticeably between the two studies. The word "most", despite having 150-200 tokens, occurs before a vowel 75% of the time in Philadelphia, but only 52% of the time in Ohio. It is hard to think what this could be besides sampling error, but if it is that, it casts some doubt on the reliability of these results, especially as most words have far fewer tokens.

Philadelphia Neighborhood Corpus: Convergence, Not Exemplar Prediction

Regarding the regression lines, there are two main differences. First, Philadelphia speakers delete much more before consonants than Ohio speakers, while there is no overall difference before vowels. This creates the greater following-segment effect noticed for Philadelphia before.

The second difference is that in Philadelphia, a word's typical context seems to barely affect its behavior before vowels. The slope before consonants, 0.317, is close to those observed in Ohio, but the slope before vowels is only 0.143 - not significantly different from zero (p = .14). Recall that under the exemplar model, the _V slope should always be larger than the _C slope; words almost always occurring before vowels - passed, walked, talked - should provide a pool of pristine, unreduced exemplars upon which the effects of current context should be most visible.

I have no explanation at present for the opposite trend being found in Philadelphia, but it is clear that neither the PNC data nor the Buckeye Corpus data show the quantitative patterns predicted by the exemplar theory model. This, and a general preference for parsimony - in storage, and arguably in computation (again, see S. Brown below) - points to typical-context effects being "ordinary" lexical effects. "[We] shall know a word by the company it keeps" (Firth 1957: 11), but we still have no reason to believe that the word itself knows all the company it has ever kept. And to find our way forward, we may not need a map at 1:1 scale.

Thanks: Stuart Brown, Kyle Gorman, Betsy Sneller, & Meredith Tamminga.


Borges, Jorge Luis. 1946. Del rigor en la ciencia. Los Anales de Buenos Aires 1(3): 53.

Brysbaert, Marc, Boris New and Emmanuel Keuleers. 2013. SUBTLEX-US frequency list with PoS information final text version. Available online at

Bybee, Joan. 2000. The phonology of the lexicon: evidence from lexical diffusion. In Michael Barlow and Suzanne Kemmer (eds.), Usage-based models of language. Stanford: CSLI. 65-85.

Bybee, Joan. 2001. Phonology and language use. Cambridge Studies in Linguistics 94. Cambridge: Cambridge University Press.

Davies, Mark. 2008-. The Corpus of Contemporary American English: 450 million words, 1990-present. Available online at

Firth, John R. 1957. A synopsis of linguistic theory, 1930-1955. In Studies in Linguistic Analysis, Special volume of the Philological Society. nOxford: Basil Blackwell.

Guy, Gregory. 2007. Lexical exceptions in variable phonology. Penn Working Papers in Linguistics 13(2), Papers from NWAV 35, Columbus.

Guy, Gregory. 2009. GoldVarb: Still the right tool. NWAV 38, Ottawa.

Guy, Gregory, Jennifer Hay and Abby Walker. 2008. Phonological, lexical, and frequency factors in coronal stop deletion in early New Zealand English. LabPhon 11, Wellington.

Hay, Jennifer. 2013. Producing and perceiving "living words". UKLVC 9, Sheffield.

Pierrehumbert, Janet. 2001. Exemplar dynamics: word frequency, lenition and contrast. In Joan Bybee and Paul Hopper (eds.), Frequency and the emergence of linguistic structure. Amsterdam: John Benjamins. 137-157.

Pierrehumbert, Janet. 2002. Word-specific phonetics. Laboratory Phonology 7. Berlin: Mouton de Gruyter. 101-139.

Pitt, Mark A. et al. 2007. Buckeye Corpus of Conversational Speech. Columbus: Department of Psychology, Ohio State University.

Yang, Charles. 2013. Ontogeny and phylogeny of language. Proceedings of the National Academy of Sciences 110(16): 6324-6327.

Thursday, July 25, 2013

Random Slopes: Now That Rbrul Has Them, You May Want Them Too

I've made the first major update to Rbrul in a long time, adding support for random slopes. While in most cases, models with random intercepts perform better than those without them, a recent paper (Barr et al. 2013) has convincingly argued that for each fixed effect in a mixed model, one or more corresponding random slopes should also be considered.

So what are random slopes and what benefits do they provide? If we start with the simple regression equation y = ax + b, the intercept is b and the slope is a. A random intercept allows b to vary; if the data is drawn from different speakers, each speaker essentially has their own value for b. A random slope allows each speaker to have their own value for a as well.

The sociolinguistic literature usually concedes that speakers can vary in their intercepts (average values or rates of application). But at least since Guy (1980), it has been suggested or assumed that the speakers in a community do not vary in their slopes (constraints). As we saw last week, though, in some data sets the effect of following consonant vs. vowel on t/d-deletion varies by speaker more than might be expected by chance.

In the Buckeye Corpus, the estimated standard deviation, across speakers, of this consonant-vs.-vowel slope was 0.70 log-odds; in the Philadelphia Neighborhood Corpus, it was 0.67. A simulation reproducing the number of speakers, number of tokens, balance of following segments, overall following-segment effect, and speaker intercepts produced a median standard deviation of only 0.10 for Ohio and 0.16 for Philadelphia. Speaker slopes as dispersed as the ones actually observed would occur very rarely by chance (Ohio, p < .001; Philadelphia, p = .003).1

If rates and constraints can vary by speaker, it is important not to ignore speaker when analyzing the data. In assessing between-speaker effects – gender, class, age, etc. – ignoring speaker is equivalent to assuming that every token comes from a different speaker. This greatly overestimates the significance of these between-speaker effects (Johnson 2009). The same applies to intercepts (different rates between groups) and slopes (different constraints between groups). The figure below illustrates this.

By keeping track of speaker variation, random intercepts and slopes help provide accurate p-values (left). Without them, data gets "lumped" and p-values can be meaninglessly low (right).

Especially if your data is unbalanced, there are other benefits to using random slopes if your constraints might differ by speaker (or by word, or another grouping factor); these will not be discussed here. Mixed-effects models with random slopes not only control for by-speaker constraint variation, they also provide an estimate of its size. Mixed models with only random intercepts, like fixed-effects models, rather blindly assume the slope variation to be zero, and are only accurate if it really is. No doubt, this "Shared Constraints Hypothesis" (Guy 2004) is roughly, qualitatively correct: for example, all 83 speakers from Ohio and Philadelphia showed more deletion before consonants than before vowels (except one person with only two tokens!) But the hypothesis has been taken for granted far more often than it has been supported with quantitative evidence.

Rbrul has always fit models with random intercepts, allowing users to stop assuming that individual speakers have equal rates of application of a binary variable (or the same average values of a continuous variable). Now Rbrul allows random slopes, so the Shared Constraints Hypothesis can be treated like the hypothesis it is, rather than an inflexible axiom built into our software. The new feature may not be working perfectly, so please send feedback to (or comment here) if you encounter any problems or have questions. Also feel free to be in touch if you have requests for other features to be added in the future!

1These models did not control for other within-subjects effects that could have increased the apparent diversity in the following-segment effect.

P.S. A major drawback to using random slopes is that models containing them can take a long time to fit, and sometimes they don't fit at all, causing "false convergences" and "singular convergences" that Rbrul reports with an "Error Message". There is not always a solution to this – see here and here for suggestions from Jaeger – but it is always a good idea to center any continuous variables, or at least keep the zero-point close to the center. For example, if you have a date-of-birth predictor, make 0 the year 1900 or 1950, not the year 0. Add random slopes one at a time so processing times don't get out of hand too quickly. Sonderegger has suggested dropping the correlation terms that lmer() estimates (by default) among the random effects. While this speeds up model fitting considerably, it seems to make the questionable assumption that the random effects are uncorrelated, so it has not been implemented.

P.P.S. Like lmer(), Rbrul will not stop you from adding a nonsensical random slope that does not vary within levels of the grouping factor. For example, a by-speaker slope for gender makes no sense because a given speaker is – at least traditionally – always the same gender. If speaker is the grouping factor, use random slopes that can vary within a speaker's data: style, topic, and most internal linguistic variables. If you are using word as a grouping factor, it is possible that different words show different gender effects; using a by-word slope for gender could be revealing.

P.P.P.S. I also added the AIC (Akaike Information Criterion) to the model output. The AIC is the deviance plus two times the number of parameters. Comparing the AIC values of two models is an alternative to performing a likelihood-ratio test. The model with lower AIC is supposed to be better.


Barr, Dale J., Roger Levy, Christoph Scheepers, and Harry J. Tily. 2013. Random effects structure for confirmatory hypothesis testing: keep it maximal. Journal of Memory and Language 68: 255-278. [pdf]

Guy, Gregory R. 1980. Variation in the group and the individual: the case of final stop deletion. In W. Labov (ed.), Locating language in time and space. New York: Academic Press. 1-36. [pdf]

Guy, Gregory R. 2004. Dialect unity, dialect contrast: the role of variable constraints. Talk presented at the Meertens Institute, August 2004.

Johnson, Daniel E. 2009. Getting off the GoldVarb standard: introducing Rbrul for mixed-effects variable rule analysis. Language and Linguistics Compass 3(1): 359-383. [pdf]

Tuesday, July 16, 2013

Testing The Logistic Model Of Constant Constraint Effects: A Miniature Study

Sociolinguistics, like other fields, decided during the 1970s that logistic regression was the best way to analyze the effects of contextual factors on binary variables. Labov (1969) initially conceived of the probability of variable rule application as additive: p = p0 + pi + pj + … . Cedergren & Sankoff (1974) introduced the multiplicative model: p = p0 × pi × pj × … . But it was a slightly more complex equation that eventually prevailed: log(p/(1-p)) = log(p0/(1-p0)) + log(pi/(1-pi)) + log(pj/(1-pj)) + … .

Sankoff & Labov (1979: 195-6) note that this "logistic-linear" model "replaced the others in general use after 1974", although it was not publicly described until Rousseau & Sankoff (1978). It has specific advantages for sociolinguistics (treating favoring and disfavoring effects equally), but it is identical to the general form of logistic regression covered in e.g. Cox (1970). The VARBRUL and GoldVarb programs (Sankoff et al. 2012) apply Iterative Proportional Fitting to log-linear models (disallowing "knockouts" and continuous predictors). Such models are equivalent to logistic models, with identical outputs (pace Roy 2013).

The logistic transformation, devised by Verhulst in 1838, was first used to model population growth over time (Cramer 2002). If a population is limited by a maximum value (which we label 1.0), and the rate of increase is proportional to both the current level p and the remaining room for expansion 1-p, then the population, over time, will follow an S-shaped logistic curve, with a location parameter a and a slope parameter b:
p = exp(a+bt)/(1+exp(a+bt)). We see several logistic curves below.

Its origin in population growth curves makes logistic regression a natural choice for analyzing discrete linguistic change, and it is extensively used in historical syntax (Kroch 1989) and increasingly in phonology (Fruehwald et al. 2009). However, if the independent variable is anything other than time, it is fair to ask whether its effect actually has the signature S-shape.

For social factors, which rarely resemble continuous variables, this is difficult to do. Labov's charts - juxtaposed in Silverstein (2003) - show that in mid-1960's New York, the effect of social class on the (r) variable and the (th) variable were quite different. For (r), the social classes toward the edges of the hierarchy are more dispersed; the lower classes (0 vs. 1) are further apart than the working classes (2-3 vs. 4-5). This is the opposite of a logistic curve, which always changes fastest in the middle. However, (th) shows a different pattern, which is more consistent with a S-curve: the lower and working classes are similar, with a large gap between them and the middle class groups. Finally, while it is hard to judge, neither variable appears to respond to contextual style in a clearly sigmoid manner.

Linguistic factors offer a better approach to the question. Rather than observe the shape of the response to a multi-level predictor - the levels of linguistic factors are often unordered - we can compare the size of binary linguistic constraints among speakers who vary in their overall rates of use. The idea that speakers in a community (and sometimes across community lines) use variables at different rates while sharing constraints began as a surprising observation (Labov 1966, G. Sankoff 1973, Guy 1980) but has become an assumption of the VARBRUL/GoldVarb paradigm (Guy 1991, Lim & Guy 2005, Meyerhoff & Walker 2007, Tagliamonte 2011; but see also Kay 1978, Kay & McDaniel 1979, Sankoff & Labov 1979).

Recently, speaker variation has motivated the introduction of mixed-effects models with random speaker intercepts. But if the variable in question is binary (and the regression is therefore logistic), a constant effect (in logistic terms) should have larger consequences (in percentage terms) in the middle of the range. If speaker A, in a certain context, shifts from 40% to 60% use, then speaker B should shift from 80% to 90% - not 100%. These two changes are equal in logistic units (called log-odds): log(.6/(1−.6)) − log(.4/(1−.4)) = log(.9/(1−.9)) − log(.8/(1−.8)) = 0.811.

In a classic paper, Guy (1980) compared the linguistic constraints on t/d-deletion for many individuals, but (despite the late year) presented factor weights from the multiplicative model, making it impossible to evaluate the relationship between rates and constraints from his tables. Therefore, we will use the following t/d data sets to attempt to address the issue:

Daleszynska (p.c.): 1,998 tokens from 30 Bequia speakers.
Labov et al. (2013): 14,992 tokens from 42 Philadelphia speakers.
Pitt et al. (2007): 13,664 tokens from 40 Central Ohio speakers.
Walker (p.c.): 4,022 tokens from 48 Toronto speakers.

The t/d-deletion predictor to be investigated is following consonant (west coast) vs. following vowel (west end). In most varieties of English, a following consonant favors deletion, while a following vowel disfavors it. Because the consonant-vowel difference has an articulatory basis, we might expect it to remain fairly constant across speakers. But if so, will it be constant in percentage terms, or in logistic terms? In fact, if deletion before consonants is a "late" phonetic process (caused by overlapping gestures?), we might observe a third pattern, where the effect would be smaller in proportion to the amount of deletion generated "earlier".1

That is, if a linguistic factor - e.g. following consonant vs. vowel in the case of t/d-deletion - has a constant effect in percentage terms, we find a horizontal line (1). If the constraint is constant in log-odds terms, as assumed by logistic regression, we see a curve with a maximum at 50% overall retention (2). If the effect arises from "extra" deletion before consonants, it increases in proportion to the overall retention rate (3).

Comparing the four community studies leads to some interesting results. There is a lot of variation between speakers in each community. We already knew that speakers varied in their overall rates of deletion, but the ranges here are wide. In Philadelphia, the median deletion level is 51%, but the range extends from 22% to 71% (considering people with more than 50 tokens). In the Ohio (Buckeye) corpus, the median rate was lower, 41%, with an even larger range, 18% to 73%. The Torontonians (with less deletion) and the Bequians (with more) also varied widely.

We also observe that speakers within communities differ in the observed following-segment effect. For example, in Philadelphia there were two speakers with very similar overall deletion rates, but one deleted 94% of the time before consonants and only 6% before vowels, while the other had 76% deletion before consonants and 20% before vowels. In Ohio, the consonant-vowel effect was smaller overall, with at least as much between-speaker variation: one speaker produced 79% deletion _#C and 4% _#V, while another produced 51% deletion _#C and 38% _#V. While it would require a statistical demonstration, this amount of divergence is probably more than would be expected by chance. If this is the case, even within speech communities, then we may need to take more care to model speaker constraint differences (for example, with random slopes).

Excludes speakers with <10 tokens, preceding /n/ and following /t, d/.

What about differences between communities? Clearly, the average deletion rates of the four communities differ: Bequia, the Caribbean island, has the most t/d deletion, while Canada's largest city shows the least. The white American communities are intermediate. Such differences are to be expected. What is more interesting is that the largest absolute following-segment effect is found in Philadelphia, where the data is closest to 50% average deletion. Ohio and Toronto, with around 40% deletion, show a smaller effect, in percentage terms. Bequia, with average deletion of nearly 90%, shows no clear following-segment effect at all. These findings are consistent with the logistic interpretation. The effect may be constant - but on the log-odds scale. On the percentage scale, it appears greatest in Philadelphia, where the median speaker shows a difference of 91% _#C vs. 16% _#V. But an effect this large - almost 4 log-odds - should show up in Bequia, yet it does not (of course, the Bequia variety is quite distinct from the others treated here; could it lack this basic constraint?).

Within each community, the logistic model predicts the same thing: the closer a speaker is to 50% deletion overall, the larger the consonant-vowel difference should appear. The data suggest that this prediction is borne out, at least to a first approximation. In Philadelphia, Ohio, and Toronto, all the largest effects are found in the 40% - 60% range, and the smallest effects mostly occur outside that range. While the following-segment constraint differs across communities (larger in Philadelphia, smaller in Bequia), and probably across individual speakers, it seems to follow an inherent arch-shaped curve, similar to (2). A cubic approximation of this curve is superimposed below on the data from all four communities.

In conclusion, the evidence from four studies of t/d-deletion suggests that speaker effects and phonological effects combine additively on a logistic scale, supporting the standard variationist model. However, both rates and constraints can vary, not only between communities, but within them.

Thanks to Agata Daleszynska, Meredith Tamminga, and James Walker.

1A diagonal also results from the "lexical exception" theory (Guy 2007), where t/d-deletion is bled when reduced forms are lexicalized, creating an' alongside and. When a word's underlying form may already be reduced, any contextual effects - like that of following segment - will be smaller in proportion. But if individual-word variation is part of the deletion process, we would expect the logistic curve (2) rather than the diagonal line (3).


Cedergren, Henrietta and David Sankoff. 1974. Variable rules: Performance as a statistical reflection of competence. Language 50(2): 333-355.

Cox, David R. 1970. The analysis of binary data. London: Methuen.

Cramer, Jan S. 2002. The origins of logistic regression. Tinbergen Institute Discussion Paper 119/4.

Fruehwald, Josef, Jonathan Gress-Wright, and Joel Wallenberg. 2009. Phonological rule change: the constant rate effect. Paper presented at North-Eastern Linguistic Society (NELS) 40, MIT.

Guy, Gregory R. 1980. Variation in the group and the individual: the case of final stop deletion. In W. Labov (ed.), Locating language in time and space. New York: Academic Press. 1-36.

Guy, Gregory R. 1991a. Explanation in variable phonology: an exponential model of morphological constraints. Language Variation and Change 3(1): 1-22.

Guy, Gregory R. 1991b. Contextual conditioning in variable lexical phonology. Language Variation and Change 3(2): 223-240.

Guy, Gregory R. 2007. Lexical exceptions in variable phonology. Penn Working Papers in Linguistics 13(2), Papers from NWAV 35.

Kay, Paul. 1978. Variable rules, community grammar and linguistic change. In D. Sankoff (ed.), Linguistic variation: models and methods. New York: Academic Press. 71-83.

Kay, Paul and Chad K. McDaniel. 1979. On the logic of variable rules. Language in Society 8(2): 151-187.

Labov, William. 1966. The social stratification of English in New York City. Washington, D.C.: Center for Applied Linguistics.

Labov, William. 1969. Contraction, deletion, and inherent variability of the English copula. Language 45(4): 715-762.

Labov, William et al. 2013. The Philadelphia Neighborhood Corpus of LING560 Studies.

Lim, Laureen T. and Gregory R. Guy. 2005. The limits of linguistic community: speech styles and variable constraint effects. Penn Working Papers in Linguistics 13.2, Papers from NWAVE 32. 157-170.

Meyerhoff, Miriam and James A. Walker. 2007. The persistence of variation in individual grammars: copula absence in ‘urban sojourners’ and their stay‐at‐home peers, Bequia (St. Vincent and the Grenadines). Journal of Sociolinguistics 11(3): 346-366.

Pitt, M. A. et al. 2007. Buckeye Corpus of Conversational Speech. Columbus, OH: Department of Psychology, Ohio State University.

Rousseau, Pascale and David Sankoff. Advances in variable rule methodology. In D. Sankoff (ed.), Linguistic Variation: Models and Methods. New York: Academic Press. 57-69.

Roy, Joseph. 2013. Sociolinguistic Statistics: the intersection between statistical models, empirical data and sociolinguistic theory. Proceedings of Methods in Dialectology XIV in London, Ontario.

Sankoff, David, Sali Tagliamonte, and Eric Smith. 2012. Goldvarb LION: A variable rule application for Macintosh. Department of Linguistics, University of Toronto.

Sankoff, David and William Labov. 1979. On the uses of variable rules. Language in Society 8(2): 189-222.

Sankoff, Gillian. 1973. Above and beyond phonology in variable rules. In C.-J. N. Bailey & R. W. Shuy (eds), New ways of analyzing variation in English. Washington, D.C.: Georgetown University Press. 44-61.

Silverstein, Michael. 2003. Indexical order and the dialectics of sociolinguistic life. Language & Communication 23(3-4): 193-229.

Tagliamonte, Sali A. 2011. Variationist sociolinguistics: change, observation, interpretation. Hoboken, N.J.: Wiley.

Sunday, July 7, 2013

Does Neg- vs. Aux-Contraction Vary Geographically In England? A Miniature Study

On Friday Sam Kirkham and I met some sixth-formers [high school students] and gave them an introduction to our department and sociolinguistics in general. We decided to take advantage of the opportunity and use the students as unpaid research assistants. We designed a small questionnaire that they could give to each other, and to people waiting for a bus or sitting on the [unusually] sunny steps of Alexandra Square. To illustrate north-south differences, we included a few questions about TRAP-BATH and FOOT-STRUT. We also had this item:

This alternation, which has all but disappeared from US English, involves a choice between so-called negative contraction (I haven't been) and auxiliary [or operator] contraction (I've not been). The auxiliary in question can be is, are, have, has, had, will, or would (Varela Pérez 2013: 257); in this study we are looking at a single instance with have, an environment that favors negative contraction, compared to is or are.

Peter Trudgill was the first sociolinguist to suggest a geographic correlation for this variable, claiming that auxiliary contraction increases "the further north one goes" (1978: 13). However, his early proclamations of this sort have not always survived later scrutiny. As another example, Hughes & Trudgill (1979: 25) stated that the particle verb alternation (pour out the tea vs. pour the tea out) also patterned along a north-south continuum, but this was not at all borne out in an experimental study involving 145 UK (and Irish) speakers (Haddican & Johnson 2012).

Regarding contraction, studies have indeed found either no clear geographic correlation (Anderwald 2002, Smith & Tagliamonte 2002), or a weak one in the opposite direction, meaning that southerners may slightly prefer auxiliary contraction (Gasparrini 2001). However, nothing approaching a dialectological study of this variable has ever been conducted. For example, the eight places studied by Tagliamonte & Smith are scattered and to some extent intentionally unrepresentative of UK speech. In the present miniature study, we will achieve far less depth about each location, but a wider geographical coverage (though certainly unrepresentative in its own way).

We obtained 52 responses to the question on contraction, associated with 36 places of origin in England. The distribution was as follows: 23 people said "I haven't been to Ireland" was "much better", 16 said it was "slightly better", 7 said the two alternatives were "equally good", 3 said "I've not been to Ireland" was "slightly better", and 3 said it was "much better". This overall strong preference for NEG-contraction with have is in line with the literature. A simple way to address the geographical question is to divide the responses into three categories - South, Midlands, and North - and compare the responses in each group. Although the measurement scale of the question is ordinal, we will assume linearity and assign numerical scores, ranging from 0 for judging NEG-contraction "much better", to 4 for judging AUX-contraction "much better".

Using Wikipedia's traditional definition of the Midlands to divide the regions led to average scores that are, at the very least, suggestive of a difference in line with Trudgill's original formula.

South (11 responses): 0.55
Midlands (8 responses): 0.63
North (33 responses): 1.21

If we combine the very similar South and Midlands regions and contrast their data with that from the North, it is initially unclear just how much evidence we have for a geographical difference. While a conventional t.test() returns a p-value of .03, the non-parametric wilcox.test() (or Mann-Whitney test, more appropriate here because the response is not only ordinal but quite skewed) gives p = .12, which would not be interpreted as statistically significant. However, we should also consider that none of the 19 respondents from the South and Midlands expressed a positive preference for AUX-contraction, while 6 of 33 Northern subjects did so. While dispreferred everywhere, AUX-contraction appears to be more acceptable in the North.

It is rarely a good idea to reduce a continuous variable to a set of discrete categories, and collapsing 36 distinct places into three regions is no exception, even though the historical division between North, Midlands and South has considerable historical precedent (the areas correspond roughly to the Northumbrian, Mercian, and Saxon kingdoms - and dialects - of the Old English period). If AUX-contraction really increases in a continuous manner "the further north one goes", then an analysis that treats latitude as a continuous variable will be more successful in revealing the effect. Incorporating the dimension of longitude as well, though it makes the statistics more complex, is potentially even more revealing.

The place names given by the respondents (usually cities or towns, sometimes counties) were entered into an online geocoder to obtain their latitudes and longitudes. There are many R packages (as well as other software) that could produce a map of this data; some options are described here. I found an outline map of England here, intended for use with the sp package, but I plotted it with ordinary 'base' R graphics (since I have yet to learn ggplot2, I do not know how to produce maps like this!). The only commands used for this map are plot(), points(), cluster.overplot() to separate the responses from the same place, and legend().

A basic spatial statistic called Moran's I is often run to establish whether the data show global spatial autocorrelation. Like any correlation, Moran's I can range from -1 to +1. A value of 0 would reflect a random spatial distribution of high and low values (dark and light points). A positive value means that similar values tend to cluster together, while a negative value means that high and low values are inter-dispersed more than randomness would expect (imagine the black and white squares on a chessboard). The statistic depends on a matrix of spatial weights; for example, all points within a certain distance could be considered neighbors, or the closest k points regardless of distance. Other, more gradual criteria can also be applied (see here and also Grieve et al. 2011).

I decided, somewhat arbitrarily, to use 5-nearest-neighbors as the threshold. If responses, on average, are more similar to their 5 nearest neighbors than to responses further away, then Moran's I should be positive. In fact, Moran's I is -0.102, which is associated with a p-value of .27. This means that the distribution of responses favoring AUX-contraction and NEG-contraction are not clustered, but in fact almost random in their spatial patterning. This conclusion is disappointing! On the bright side, a lack of spatial autocorrelation means that an ordinary regression can be performed with less fear of error. But a lm() model with latitude as a predictor is also not statistically significant (p = .27). Of course, such a model implies a gradual effect of latitude which to some extent goes against the idea of coherent dialect regions.

If a linguistic feature has wide variability in every community, then it is possible that global spatial autocorrelation will be low - especially with a small number of respondents - even though an overall geographical difference may exist. As this is a miniature study, we cannot pursue the debate further but can only note that if a small amount of crude data collected one afternoon in Lancaster can provide this much information, a larger collection effort could likely settle the question once and for all as to whether the preferred means of contraction has a geographic component. We will conclude by using the method of generalized additive modeling (mgcv package) to create a smoothed map of contraction preference.

Based on this plot, we would think that contraction varies geographically! But geographic patterns, like other types, can certainly arise by chance. To solve this question would require a dialectological investigation - that is, one conducted at many places. But the data collected on Friday, in a few hours, by sixth form students, restores some faith in Peter Trudgill's conjecture, which may have been dismissed too hastily by linguists.


Anderwald, Lieselotte. 2002. Negation in Non-standard British English: Gaps, Regularizations, Asymmetries. London: Routledge.

Gasparrini, Désirée. 2001. It isn’t, it is not or it’s not? Regional Differences in Contraction in Spoken British English. Master’s thesis. University of Zürich.

Grieve, Jack, Dirk Speelman and Dirk Geeraerts. 2011. A statistical method for the identification and aggregation of regional linguistic variation. Language Variation and Change 23: 193-221.

Haddican, Bill and Daniel Ezra Johnson. 2012. Effects on the Particle Verb Alternation across English Dialects. University of Pennsylvania Working Papers in Linguistics 18(2): Article 5.

Hughes, Arthur and Peter Trudgill. 1979. English Accents and Dialects: An Introduction to Social and Regional Varieties of British English. London: Edward Arnold.

Tagliamonte, Sali and Jennifer Smith. 2002. 'Either it isn’t or it’s not': NEG/AUX Contraction in British Dialects. English World-Wide 23(2): 251-281.

Trudgill, Peter. 1978. Sociolinguistic patterns in British English. London: Edward Arnold.

Varela Pérez, José Ramón. 2013. Operator and negative contraction in spoken British English: a change in progress. In Bas Aarts, Joanne Close, and Geoffrey Leech (eds.), The Verb Phrase in English: Investigating Recent Language Change With Corpora. Cambridge University Press. 256-285.

Sunday, June 23, 2013

What if my variable has more than two variants?

Since the dawn of time, sociolinguists have used logistic regression (or something very similar) to estimate the factors affecting binary linguistic variables. And linear regression has occasionally been applied when the dependent variable is a continuous numeric measurement. For the most part, however, variables with k categories (where k > 2) have been avoided, or approached by performing k-1 separate logistic regressions, a workaround that, while not ideal, makes a lot of sense for some variables.

Models specifically designed for multi-categorical (or "polytomous") responses do exist: if the k categories have a meaningful order, then some form of ordinal logistic regression may be appropriate. Otherwise, we can use a more general method, multinomial logistic regression, which basically optimizes k-1 binary models simultaneously. The R function mlogit() fits models of this type; its use is somewhat different from lm() or glm().

Over the past ten years, many sociolinguists have realized the advantages of mixed-effects regression models. Estimates of between-speaker properties, including many of the traditional "social factors", cannot be accurate if the grouped - and often imbalanced - structure of the data is ignored. The same applies to individual words, where achieving balance in spontaneous speech is never possible, due to Zipf's Law. Mixed models can distinguish between group and individual influences, giving better estimates of effect size as well as statistical significance. The R function lmer() is a popular way to fit these models, but it does not perform multinomial logistic regression.

Mixed-effects multinomial logistic regression has been implemented in SAS (using the NLMIXED procedure) and in Stata (with the gllamm module). Some R packages with a Bayesian orientation can also fit such models, such as bayesm and mcmcGLMM. Indeed, even mlogit() now advertises this functionality, although learning it does not appear to be entirely straightforward, at least from the point of view of an lmer() user.

However, if you are working with data that is well-balanced across speakers, and if the variables you're most interested in are all between-speaker variables, there is another option that might be appealing. This is the method known as CoDa, or compositional data analysis. If we aggregate the observations for each speaker (e.g. 20% variant A, 50% variant B, 30% variant C), we get a set of proportions that add up to 1; this is a composition. Each speaker produces a different composition, and we want to identify the factors most responsible for those differences.

Although compositional data is numeric, we can't perform (multivariate) linear regression on these numbers directly; the fact that each composition sums to 1 would introduce spurious negative correlations. Instead, CoDa transforms the k compositional parts to k-1 coordinates using log-ratios. Three variants can become two coordinates in several ways. One popular method makes the first coordinate axis proportional to log (A * B / C) and the second axis proportional to log (B / A). We can then carry out various statistical operations on the transformed data - including linear regression and hypothesis testing.

Because of this choice of coordinates, the results of regression are interpretable in the transformed space (with the x-axis, for example, showing the relationship between C and the other two variants, and the y-axis showing the relationship between A and B). Alternatively, the results can be transformed back into the original compositional space; a triangular "ternary diagram" is a useful display for this, especially in the case of three variants.

The diagrams below are just an example of how we can apply linear regression using a single predictor - age - with three-part compositional data on coda /r/ pronunciation in two Scottish towns. One of the variants, a tap or trill, is used mainly by older speakers. However, in Eyemouth, the apparent-time trend is towards an approximant realization, while in Gretna a zero variant is the norm for younger speakers. The two graphs are equivalent, but only using the logratio-transformed representation (on the right) can we test the significance of both changes in progress.

Multivariate F-tests (with Pillai traces) show that both communities are undergoing significant change. Univariate tests show that in Eyemouth, only one dimension of the change, that from tap/trill to approximant, is significant. In Gretna, it is the other dimension, from tap/trill/approximant to zero, that shows a significant change over time.


Croissant, Yves. Estimation of multinomial logit models in R: the mlogit packages.

Egozcue, Juan José et al. 2012. Simplicial regression: the normal model. Journal of applied probability and statistics 6(1-2): 87-108.

Gorman, Kyle and Daniel Ezra Johnson. 2013. Quantitative analysis. In Bayley, Cameron and Lucas (eds.), The Oxford Handbook of Sociolinguistics, 214-240.

Llamas, Carmen et al. 2008. Variable /r/ use along the Scottish-English border. Poster presented at NWAV 37.

Pawlowsky-Glahn, Vera et al. 2011. Lecture notes on compositional data analysis.

Saturday, March 9, 2013

To tweet, or not to tweet?

People have had to choose between different forms of communication ever since the first cavewoman asked herself, will he know what I mean if I paint two horses on the living room wall, or should I just tell him how I feel? Later times brought new dilemmas: hieroglyphs or demotic? Latin or the vernacular? Telegram or phone call? Email or text? And recently, Facebook or Twitter?

As a devoted Facebook user since February 13, 2005 (according to Timeline), and a newcomer to Twitter - not counting a period in 2009 when I had a locked profile and mainly exchanged messages with two friends who no longer talk to me in any medium - I began with a strong bias. And while I started to learn a lot from celebrities, stripper intellectuals, and brilliantly offensive amateur comedians, in terms of my own self-expression it seemed that Facebook still offered everything Twitter did, and more. Fine, no one appreciated the poke as much as I did, but I still wondered why anyone would want to restrict themselves to 140 characters (and less privacy) when they didn't have to?

I'm still not completely sure (and I still prefer Facebook), but I've come to understand that Facebook and Twitter are like men and women, or meth and coke: some people do one, some people do both, and some people do neither - and these are all legitimate choices. Some find Facebook to be "a shitty boring claustrophobic dollhouse", while others consider Tweets no more than "bursts of mental flatus". I can understand both comments, but I continue to enjoy both platforms.

But what to post on Twitter and what to post on Facebook? That is the question - even if no one answered it when I posted it on-line. The easy cop-out is to have your Twitter posts (or some of them) copied automatically to Facebook. This can be obnoxious because people who read you in both places will have to read it twice, and it's inefficient because any comments or discussion will end up split in half. But if your friends and followers don't overlap that much, it can be OK.

If your friends and followers don't overlap much, though, it's probably less likely that you want to say the exact same thing to both groups. So you can make a choice and tell either your friends or your followers what you want to tell them, or what they want to hear, or post wherever you think you'll get more comments, or likes/faves/retweets, or followers, or pokes, or piss the most people off, or whatever you feel compelled to do.

Having spent a lot of time in a small, friendly academic subfield, my circles overlap a lot. So I asked a colleague (and friend and Facebook friend and Twitter follower) how he resolves this dilemma himself. He said that on Twitter, he's more careful about what he says - it's public, after all - and for posts that might cause mild offense, he considers Facebook. I thought that was a brilliant solution.

Then I realized I had been pursuing the opposite strategy. Or my version of the opposite: if I was quite sure that something would be widely (or deeply) offensive, I'd deliberately choose Twitter. For the rest of my opinions, most of which are somewhat offensive too? Facebook. This way, I thought that two sides of my personality could find separate outlets. I was creating separate personas out of my language use, just like I had been taught. But on second thought, wouldn't my friends who also follow me think I was even more of an asshole than they already probably did?

Maybe I should just post to Google Plus. Or go paint some horses.