Tuesday, October 1, 2013

On Exactitude In Science: A Miniature Study On The Effects Of Typical And Current Context

In that Empire, the Art of Cartography attained such Perfection that the map of a single Province occupied the entirety of a City, and the map of the Empire, the entirety of a Province. In time, even these immense Maps no longer satisfied, and the Cartographers' Guilds surveyed a Map of the Empire whose Size was that of the Empire, and which coincided point for point with it. The following Generations, less addicted to the Study of Cartography, realized that that vast Map was useless, and not without some Pitilessness delivered it up to the Inclemencies of Sun and Winter. In the Deserts of the West, there remain tattered Ruins of that Map, inhabited by Animals and Beggars; in all the Land there is no other Relic of the Disciplines of Geography.  (J. L. Borges)

A scientific model makes predictions based on a number of variables and parameters. The more complex the model, the more accurate its predictions. But all things being equal, a simpler model is preferred. As Newton put it: "We are to admit no more causes of natural things than such as are both true and sufficient to explain their appearances."

Exemplar Theory makes predictions about the present or future based on an enormous amount of stored information about the past. For example, a speaker is said to pronounce a word by taking into account the thousands of ways he or she has produced and heard it before. If such feats of memory are possible - I ain't sayin' I don't believe Goldinger 1998 - we should not be surprised by the accuracy of models that rely on them. And if language can be shown to rely on them, so be it. But the abandonment of parsimony, in the absence of clear evidence, should be resisted. (See the comment by S. Brown below for an alternative view of this issue.)

The same phenomena can often be accounted for "equally well" by a more deductive traditional theory or by a more inductive, bottom-up approach. The chemical elements were assigned to groups (alkali metals, halogens, etc.) because of their similar physical properties and bonding behavior long before the nuclear basis for these similarities was discovered. In biology, the various taxonomic branchings of species can be thought of as a reflection and continuation of their historical evolution, but the differences exist on a synchronic level as well - in organisms' DNA.

In classical mechanics, if an object begins at rest, its current velocity can be determined by integrating its acceleration over time. By storing the details of acceleration at every point along a trajectory, the current velocity can be calculated by integration: v(t) = ∫ a(t) dt. If we ride a bus with our eyes closed and our fingers in our ears, we can estimate our present speed if we remember everything about our past accelerations.

A free-falling object, under the constant acceleration of gravity, g, has velocity v(t) = g · t. But a block sliding down a ramp made up of various curved and angled sections, like a roller coaster, has an acceleration that changes with time. The acceleration at any moment is proportional to the sine of the slope of the ramp. Integrating, v(t) = g · ∫ sin(θ(t)) dt.

On a simple inclined plane, the angle is constant, so the acceleration is too. The velocity increases linearly: like free-fall, only slower. If the shape of the ramp is complicated, solving the integral of acceleration can be very difficult. (It might be beyond the capacity of the brain to calculate - but on a real roller coaster, we don't have to remember the past ten seconds to know how fast we are going now! We use other cues to accomplish that.)

But forgetting integration, we can solve for velocity in another way, showing that it depends only on the vertical height fallen: v = sqrt(2 · g · h). Obviously this is simpler than keeping track of complex and changing accelerations over time. This equation, rewritten as 1/2 · v2 = g · h, also reflects the balance between kinetic and potential energy, one part of the important physical law of conservation of energy. Instead of a near-truism with a messy implementation, we have an elegant and powerful principle.

Both expressions for velocity fit the same data, but to call the second an "emergent generalization", à la Bybee 2001, ignores its universality and demotivates the essential next step: the search for deeper explanations.

Admittedly, this physical allegory is unlikely to convince any exemplar theorists to recant. But we should realize that given its powerful and unexplanatory nature, any correct predictions made by ET do not constitute real evidence in its favor. We need to determine if the data also support an alternative theory, or at least find places where the data is more compatible with a weaker version of ET, rather than a stronger one.

A recently-discussed case suggests that with respect to phonological processes, individual words are not only influenced by their current contexts, but also, to a lesser degree, by their typical contexts (Guy, Hay & Walker 2008; Hay 2013). This is one of several recent studies by Hay and colleagues that show widespread and principled lexical variation, well beyond the idiosyncratic lexical exceptionalism sometimes acknowledged in the past, e.g. "when substantial lexical differences appear in variationist studies, they appear to be confined to a few lexical items" (Guy 2009).

The strong-ET interpretation is that all previous pronunciations of a word are stored in memory, and this gives us the typical-context distribution for each word. But if this is the case, the current-context effect must derive from something else: either from mechanical considerations or from analogy to other words. It can't also reflect sampling from sub-clouds of past exemplars, because that would cancel out the typical-context effect.

For words to be stored along with their environments is actually a weak version of word-specific phonetics (Pierrehumbert 2002). It is not that words are explicitly marked to behave differently; they only do so because of the environments they typically find themselves in. For Yang (2013: 6325), "these use asymmetries are unlikely to be linguistic but only mirror life." But whether they mirror life or reflect language-internal collocational structures, these asymmetries are not properties of individual words.

Under this model of word production - sampling from the population of stored tokens, then applying a constant multiplicative contextual effect - we observe the following pattern (in this case, the process is t/d-deletion, as in Hay 2013; the parameters are roughly based on real data):

Exemplar Model: Contextual Effect Greatest When Pool Is Least Reduced

This pattern has two main features: as words' typical contexts favor retention more, retention rates increase linearly before both V and C, with a widening gap between the two. From the Exemplar Theory perspective, when the pool of tokens contains mainly unreduced forms, the differences between constraints on reduction can be seen more clearly. But when many of the tokens in the pool are reduced already, the difference between pre-consonantal and pre-vocalic environments appears smaller. Such reduced constraint sizes are the natural result when a process of "pre-deletion" intersects with a set of rules or constraints that apply later, as discussed in this post, and in a slightly different sense in Guy 2007.

An alternative to storing every token is to say that words acquire biases from their contexts, and that these biases become properties of the words themselves. The source of a bias could be irrelevant to its representation - one word typically heard before consonants, another typically heard from African-American speakers, and another typically heard in fast speech contexts could all be marked for "extra deletion" in the same way.

From the point of view of parsimony, this is appealing. To figure out how a speaker might pronounce a word, the grammar would have to refer to a medium-sized list of by-word intercepts, but not search through a linguistic biography thick and complex enough to have been written by Robert Caro.

But the theoretical rubber needs to hit the empirical road, or else we are just spinning our wheels here. So, compared to the Stor-It-All model, does a stripped-down word-intercept approach make adequate predictions, or - dare we hope - even better ones? Are the predictions even that different?

If we assume that for binary variables, by-word intercepts (like by-speaker intercepts) combine with contextual effects additively on the log-odds scale (which seems more or less true), we obtain a pattern like this:

Intercept Model: Typical Context Combines W/ Constant Current Context

Although the two figures are not wildly different, we can see that in this case, there is no steady separation of the _V and _C effects as overall retention increases. The following-segment effect is constant in log-odds (by stipulation), and this manifests as a slight widening near the center of the distribution. The effects of current context and typical context are independent in this model, as opposed to what we saw above.

As usual, the Buckeye Corpus (Pitt et al. 2007) is a good proving ground for competing predictions of this kind. The Philadelphia Neighborhood Corpus has a similar amount of t/d coded (with more coming soon). Starting with Buckeye, I only included tokens of word-final t/d that were followed by a vowel or a consonant. I excluded all tokens with preceding /n/, in keeping with the sociolinguistic lore, "Beware the nasal flap!" I then restricted the analysis to words with at least 10 total tokens each - and excluded the word "just", because it had about eight times as many tokens as any of the other words. I was left with 2418 tokens of 69 word types.

Incidentally, there is no significant relationship between a word's overall deletion rate and its (log) frequency, whether the frequency measure is taken from the Buckeye Corpus itself (p = .15) or from the 51-million-word corpus of Brysbaert et al. 2013 (p = .37). The absence of a significant frequency effect on what is arguably a lenition process goes against a key tenet of Exemplar Theory (Bybee 2000, Pierrehumbert 2001), but the issue of frequency is not our main concern here.

I first plotted two linear regression lines, one for the pre-vocalic environments and one for the the pre-consonantal environments. The regressions were weighted according to the number of tokens for each word. I then tried a quadratic rather than a linear regression. However, these curves did not provide a significantly better fit to the data - p(_V) = .57, p(_C) = .46 - so I retreated to the linear models. The straight lines plotted below look parallel; in fact the slope of the _V line is 0.301 and the slope of the _C line is 0.369. Since the lines converge slightly rather than diverging markedly, this data is less consistent with the exemplar model sketched above, and more consistent with the word-intercept model.

Buckeye Corpus: Parallel Lines Support Intercept Model, Not Exemplars

One way to improve this analysis would be to use a larger corpus, at least for the x-axis, to more accurately estimate the proportion that a given word ending in t/d is followed by a vowel rather than a consonant. For example, the spoken section of COCA (Davies 2008-) is about 250 times larger than the Buckeye Corpus. Of course, for a few words the estimate from the local corpus might better represent those speakers' biases.

Turning finally to data from the Philadelphia Neighborhood Corpus, we see a fairly similar picture. Note that some of the words' left-right positions differ noticeably between the two studies. The word "most", despite having 150-200 tokens, occurs before a vowel 75% of the time in Philadelphia, but only 52% of the time in Ohio. It is hard to think what this could be besides sampling error, but if it is that, it casts some doubt on the reliability of these results, especially as most words have far fewer tokens.

Philadelphia Neighborhood Corpus: Convergence, Not Exemplar Prediction

Regarding the regression lines, there are two main differences. First, Philadelphia speakers delete much more before consonants than Ohio speakers, while there is no overall difference before vowels. This creates the greater following-segment effect noticed for Philadelphia before.

The second difference is that in Philadelphia, a word's typical context seems to barely affect its behavior before vowels. The slope before consonants, 0.317, is close to those observed in Ohio, but the slope before vowels is only 0.143 - not significantly different from zero (p = .14). Recall that under the exemplar model, the _V slope should always be larger than the _C slope; words almost always occurring before vowels - passed, walked, talked - should provide a pool of pristine, unreduced exemplars upon which the effects of current context should be most visible.

I have no explanation at present for the opposite trend being found in Philadelphia, but it is clear that neither the PNC data nor the Buckeye Corpus data show the quantitative patterns predicted by the exemplar theory model. This, and a general preference for parsimony - in storage, and arguably in computation (again, see S. Brown below) - points to typical-context effects being "ordinary" lexical effects. "[We] shall know a word by the company it keeps" (Firth 1957: 11), but we still have no reason to believe that the word itself knows all the company it has ever kept. And to find our way forward, we may not need a map at 1:1 scale.

Thanks: Stuart Brown, Kyle Gorman, Betsy Sneller, & Meredith Tamminga.


Borges, Jorge Luis. 1946. Del rigor en la ciencia. Los Anales de Buenos Aires 1(3): 53.

Brysbaert, Marc, Boris New and Emmanuel Keuleers. 2013. SUBTLEX-US frequency list with PoS information final text version. Available online at http://expsy.ugent.be/subtlexus/.

Bybee, Joan. 2000. The phonology of the lexicon: evidence from lexical diffusion. In Michael Barlow and Suzanne Kemmer (eds.), Usage-based models of language. Stanford: CSLI. 65-85.

Bybee, Joan. 2001. Phonology and language use. Cambridge Studies in Linguistics 94. Cambridge: Cambridge University Press.

Davies, Mark. 2008-. The Corpus of Contemporary American English: 450 million words, 1990-present. Available online at http://corpus.byu.edu/coca/.

Firth, John R. 1957. A synopsis of linguistic theory, 1930-1955. In Studies in Linguistic Analysis, Special volume of the Philological Society. nOxford: Basil Blackwell.

Guy, Gregory. 2007. Lexical exceptions in variable phonology. Penn Working Papers in Linguistics 13(2), Papers from NWAV 35, Columbus.

Guy, Gregory. 2009. GoldVarb: Still the right tool. NWAV 38, Ottawa.

Guy, Gregory, Jennifer Hay and Abby Walker. 2008. Phonological, lexical, and frequency factors in coronal stop deletion in early New Zealand English. LabPhon 11, Wellington.

Hay, Jennifer. 2013. Producing and perceiving "living words". UKLVC 9, Sheffield.

Pierrehumbert, Janet. 2001. Exemplar dynamics: word frequency, lenition and contrast. In Joan Bybee and Paul Hopper (eds.), Frequency and the emergence of linguistic structure. Amsterdam: John Benjamins. 137-157.

Pierrehumbert, Janet. 2002. Word-specific phonetics. Laboratory Phonology 7. Berlin: Mouton de Gruyter. 101-139.

Pitt, Mark A. et al. 2007. Buckeye Corpus of Conversational Speech. Columbus: Department of Psychology, Ohio State University.

Yang, Charles. 2013. Ontogeny and phylogeny of language. Proceedings of the National Academy of Sciences 110(16): 6324-6327.


  1. (1 of 2) You've commented enough on my various posts, so it's time to return the favour. This isn't, of course, my area; I and I don't really know the literature. I'm probably more sympathetic than you are to Exemplar Theory, although I share some of your concerns. But, I do have a couple of points, mainly oriented around your use of the principle of parsimony.
    Firstly, I think the comparison with parsimony in physical explanations is a no goer. Cognitive systems are necessarily evolved, and it is in the nature of evolution that an inefficient system which is just good enough is preserved. That inefficient system may even be repurposed at a subsequent stage of evolution for another use, for which it is even more inefficient. The point is, if it does the job at all, then nothing will evolve to do the job instead. As Dawkins puts it, you cannot climb down the slopes of Mount Improbable. Whatever cognitive processes preceded the language function will not necessarily have been fully efficient for their own purposes; and may have become even more inefficient as they were repurposed for linguistic use. This, to me, is one of the spectacular oversights of a large (for which read Chomskyist) portion of theoretical linguistics: the presumption that parsimony or efficiency is a suitable criteria with which to rationalise cognitive function independent of (or even contrary to) evidence is almost creationist in its perspective of structure of the human mind.

    1. In this post (which is sort of pre-published actually), I guess I didn't end up making an argument to non-parsimony but rather tried to show that the wrong empirical predictions were made. The issue about parsimony was something I'd been thinking about before, which is if you have a model that is complex and flexible enough to predict basically whatever data is observed, then how can you argue against it? I'm going to say that we store detailed exemplars of every time we've ever said "good morning" and also remember the associated time, temperature, humidity, and the eye color of the interlocutor. I hope you agree that eventually the storage task becomes implausible, but how can you ever show that it's not happening?

    2. This is the problem of all social science. Proper, hard science can get away with being entirely empirical. The Feynman “shut up and calculate” school of instrumentalism. Of course, it's more interesting to try and be a realist about hard science, but the roads it's currently pointing down lead us into such a conceptual quagmire that I do sympathise with those who just wish to model the phenomena damn whether quantum mechanics requires locality at the expense of causality, or vice versa. However, in hard science, the principle of efficiency in terms of reducing the number of variables seems reasonable. You may shave away at physics as much as you please with Occam's razor. My point was that it is a false comparison to want to do that with evolved, and therefore necessarily inefficient, systems. We must put that particular ontological blade away, to great extent.
      As in social science this instrumentalism doesn't work, we have the constant problem that we do not know which variables are confounding or not. A possible solution to this is to mix the empirical studies with a rationalised, realist theory that rules out certain variables (such as eye colour, or time of day). My beef firstly is that, to my mind, the majority of realist theories currently postulated for linguistics are largely barking up the wrong tree in that they are (however modified from the original generative view) atomistic: they presuppose all utterances to be pieced together in the same way (whatever that way is) from the same level of lexical unit regardless of whether or not be individual locutor has used them 1000 times or zero times before, that (as I previously said) they privilege the individual utterance over our total communicative behaviour; and secondly that either, per Noam, there is too much emphasis on those theories, their supposed intuitive precision, and damn the evidence, or, per any number of postgraduate socio-linguistic studies (my own probably not excluded), too much of a tendency to be content with aggregating a bunch of data, performing a few regressions, and going ta-da! The kind of thing you've done here we need more of, and in a more structured way. Jen's PhD was very good along these lines as well. And, indeed, Greg seems to me to be a Guy who is very much in the right place regarding the interface of empiricism and rational theorising. It would be lovely to think I could contribute something substantive to this methodological problem, and more than just random critiques on blogs. But I'm far too late to this game, by the time I got up on all the literature it would be time to retire. Plus, apparently, I can't do any of that study shit without bleeding out my eyes anyway.

    3. Re: this part: "atomistic: they presuppose all utterances to be pieced together in the same way (whatever that way is) from the same level of lexical unit regardless of whether or not be individual locutor has used them 1000 times or zero times before, they privilege the individual utterance over our total communicative behaviour".

      I think some interesting work could be done there, without too much effort and revamping of research plans. To continue with this example, we could ask if there is more or less variation in final t/d pronunciation if the word and the following word form a more frequent collocation as opposed to a less frequent one. I think you (Stuart) would predict less, while exemplar theory, I'm not sure.

      I'm working on the particle verb alternation, whic has variation spanning different amounts of structure. The identity of the verb, the particle, the verb-particle combination, and the verb-particle-object combination all seem to have their own variation associated, not to mention other factors. I'm not saying I know what theory can account for this, but the exemplar TYPE of theory - more or less "repeat what you've heard" - just seems weak to me. It seems like a weak type of explanation, at best. Could be the way the brain works, as I said, though.

      I'm surprised at so much mentioning of generativism and Chomsky. Besides the usage-based infection that has felled a mind or two at York - cross the Pennines, and you'll see that NC is treated as a joke and the "cognitive theories" that people work with, as far as I understand them (approximately not at all), have a very different way of looking at things. Maybe you don't like it either.

  2. (2 of 2) Secondly, even were we to accept efficiency as a criteria underlying our linguistic models, you have conflated retrieval from a massive database with operational complexity. Retrieval from a database, no matter how large, ultimately consists of a single online function. It will, of course, be dependent for its success and speed upon the quality of the index; but that index may be structured and reordered off-line. But for every reduction in the size of the database, you have to posit a corresponding increase in the number of online processes required. An old style, generative view of construction is thoroughly parsimonious in what it allows the database (at the most parsimonious, lexemes only), but requires a correspondingly large number of online processes to be performed even to utter the most simple and often-repeated phrases. For me, what I see in exemplar theory, (to stick with the computational analogy, though I am wary of it for the reasons above) is something corresponding more closely to an index structuring mechanism than the database itself. Crucially for me, the processing of the examplars does not have to take place online. Current research on sleep, memory, and learnt behaviour suggests that much or our information structure and corresponding learnt behaviour is to a large extent reorganised during sleep; I see no reason why this should not extend to linguistic data.
    Personally, and as I say this is really not my area, I find it hard to believe that we have anything other than the laziest of parsing mechanisms, requiring the minimum of online processes. If there is any parsimony to be looked for it is there, and not in the database. A huge database is really not a problem: we have a spectacular number of nonlinguistic memories which we have no issues organising and retrieving. As recently discussed with you in another place, I have an rather automated view of the human mind and so I think that the originality involved in linguistic production is massively overstressed. I rather feel that our database probably extends to a whole bunch of complete utterances in prototype form, and merely a few substitution rules (to maintain that dodgy computer analogy, regexes) with which to insert occasionally varying lexical items.
    Far too many linguists have grown up learning from Chomskyist textbooks, which all start lauding the wonderful and infinite productivity of the language function. When I write my big book of linguistic theory, it will start as follows: “The vast majority of what you say, you have said before. When you say ‘Good morning,’ you have said it uncountable times before. When you get on the bus, and say, ‘A single to town, please,’ you have said that every morning of every weekday for maybe years on end. To presuppose that exactly the same processing goes into these phrases as similarly-structured but novel phrases, that each time you say them they are created anew from their various component items by your mind, seems to be an unreasonable demand to place on our overwrought cognitive functions. Colourless green ideas may, indeed, sleep furiously; and we may have no problems expressing that fact. But most of the time we do not. We express the fact that it is raining, or that the train is late, or that it is very nice to see you, it's been far too long. If, out of some desire to model maximum efficiency for any individual instance of utterance production, we ignore the fact that most of our utterances do not require such heavy online processing as those irate and transparent concepts, we have started to look at linguistics from the wrong way. We are favouring efficiency of the individual instance over the overall efficiency of our communicative lives. We are indulging in the fantasy that we are constant, spontaneous, and delightful creators, not robotic automatons. This is wrong.”
    No one will buy it of course, but I shall enjoy writing it nonetheless.