Saturday, October 26, 2013

What Should We Do With Frequency?


UPDATE: According to Kyle Gorman (p.c.), the best psycholinguistic frequency measure is simply the rank (of frequency), which makes sense given a "serial access" model of lexical processing. This metric apparently outperforms both log-frequency and raw frequency in lexical decision tasks. The upshot for sociolinguistic frequency results remains to be determined.


Frequency is the count of how many times some relatively rare item, such as a particular word, occurs over a certain length of time, or in a text or corpus of a certain length. Such a count — called a Poisson variable — clearly can't be negative, but in a regression context the independent variables can often be positive or negative. One reason for using the logarithmic transformation of the count is to avoid the problems resulting from this.

Also, in linear regression, effects on the dependent variable are additive, being proportional to the changes in the independent variables. But effects on counts are usually multiplicative, proportional to the size of the count itself. If we use the log link, instead of having to multiply the effects, we can add them, like we usually do.

But when word frequency is an independent variable in a regression, the above concerns do not apply. Here we have to contend with another issue, which is that word frequencies form a highly skewed distribution. Even in large texts or corpora, a few words always comprise a substantial fraction of the total, while many words occur only once. Zipf's Law is more precise, stating that the most frequent word in a large text is twice as frequent as the second-most-frequent word, the 10th most frequent word twice as frequent as the 20th, and so forth. This "power law" says that a word's frequency, multiplied by its rank frequency, is a constant.

For Zipf (1949), this relationship illustrated his "principle of least effort"; less frequent words were more difficult to access and therefore lazy humans avoided retrieving them. But both Zipf and others pointed out that similar relationships apply much more generally, including well outside the realm of human behavior: "the size of cities, the number of hits on websites, the magnitude of earthquakes and the diameters of moon craters have all been shown to follow power laws" (West 2008). Newman (2006) provides other interesting examples and offers six main mechanisms by which they can arise.

Newman (2006:13-14) refers to Cover and Thomas (1991:85), who show that the from an information-theory standpoint, the optimum length of a codeword is the negative logarithm of its probability. We can extend this, as Newman does, to the question of the distribution of word lengths, and we see that this in fact predicts a power-law distribution (if "length" is defined appropriately). Perhaps the optimum "semantic length" of words is behaving similarly, explaining the specific form of the inverse relationship between word frequency and rank frequency (of course, an inverse relationship of some kind exists by definition).

We saw that when frequency is the dependent variable, it should be expressed in logarithmic units. But when frequency is an independent variable, whether to transform it is not so clear. Ideally, our choice should reflect our thoughts on how frequency might be represented internally. Indeed, comparing between two transformations, or between raw and transformed data, could possibly be used to distinguish between theories.

If words are merely "tagged" for frequency, a log-transformation might be convenient given the skewed distribution, because it decreases the effect of "leverage points" (independent-variable outliers). The distribution will still be skewed, but this is more or less OK, since it is an independent variable. But in a theory where each use of a word calls up the totality of the speaker or hearer's experience with that word, one would expect that raw frequency numbers would be more appropriate.

Transforming frequency as an independent variable does not change the model as such, only the coefficient values for frequency and any other variables interacting with it. However, as we know from Erker and Guy (2012), plots with raw frequency or log-frequency on the x-axis can look quite different, and the corresponding regression slopes and correlations can even switch from positive to negative or vice-versa, complicating the interpretation of the results.

Perhaps a more clear-cut issue in regression is weighting. If we are interested in investigating the effect of frequency, whether on its original axis or on a transformed one, it seems unlikely that we would like our estimates to be affected more by high-frequency words than by low-frequency ones. This is not a problem in experiments where words are selected based on frequency, because they are then presented in a balanced way. But in studies of natural speech, especially if they include frequency as an independent variable, either a random word effect or explicit inverse-frequency weighting should be used to counteract the bias.

References:

Cover, Thomas M. and Joy A. Thomas. 1991. Elements of information theory. New York: John Wiley & Sons.

Erker, Daniel and Gregory R. Guy. 2012. The role of lexical frequency in syntactic variability: variable subject personal pronoun expression in Spanish. Language 88(3): 526-557.

Newman, Mark. 2006. Power laws, Pareto distributions and Zipf’s law.

West, Marc. 2008. The mystery of Zipf.

Zipf, George Kingsley. 1949. Human Behavior and the Principle of Least Effort. Cambridge, Mass.: Addison-Wesley.

Saturday, October 5, 2013

If Individuals Follow The Exponential Hypothesis, Groups Don't (And Vice Versa)


If you haven't heard of the Exponential Hypothesis, read this, this and this, and if you want more, look here. Guy's paper inspired me to want to do this kind of linguistics. But now it seems that the patterns he so cleverly explained were just meaningless coincidences - leaving this, uncontested, as the most impressive quantitative LVC paper of all time. But I digress.

Sociolinguists have differed for aeons regarding the relationship between the individual and the group. Even Labov's clear statements along the lines that "language is not a property of the individual, but of the community" are qualified, or undermined, by defining a speech community as "a group of people who share a given set of norms of language" (see also pp. 206-210 of the same paper for a staunch defense of the study of individuals).

The typical variationist recognizes the practical need to combine data from a group of speakers, even if their theoretical goal is the analysis of individual grammars. After some years spent in ignorance of the statistical ramifications of this situation, they have now generally adopted mixed-effects regression modeling as a way to have their cake and eat it too.

But the Exponential Hypothesis is not well-equipped to bridge this gap. If each individual i retains final t/d at a rate of ri for regular past tense forms, ri2 for weak past tense forms, and ri3 for monomorphemes - and if ri varies by individual (as has always been conceded) - then the pooled data from all speakers can never show an exponential relationship.

I will demonstrate this under four assumptions of how speakers might vary: 1) the probability of retention, r, is normally distributed across the population; 2) the probability of retention is uniformly (evenly) distributed over a similar range; 3) the log-odds of retention - log(r / (1 - r)) - is normally distributed; 4) the log-odds of retention is uniformly distributed.

Using a central value for r of +2 log-odds (.881), and allowing speakers to vary with a standard deviation of 1 (in log-odds) or 0.15 (in probability), I obtained the following results, with 100,000 speakers in each simulation:

Probability Normal Theoretical (Exponential) Empirical (Group Mean)
Regular Past .862 .862
Weak Past .743 .759
Monomorpheme .641 .679


Probability Uniform Theoretical (Exponential) Empirical (Group Mean)
Regular Past .862 .862
Weak Past .742 .758
Monomorpheme .639 .680


Log-Odds Normal Theoretical (Exponential) Empirical (Group Mean)
Regular Past .844 .844
Weak Past .712 .728
Monomorpheme .601 .638


Log-Odds Uniform Theoretical (Exponential) Empirical (Group Mean)
Regular Past .842 .842
Weak Past .710 .724
Monomorpheme .598 .633

These simulations assume an equal amount of data from each speaker, and an equal balance of words from each speaker (which matters if individual words vary). If these conditions are not met, like in real data, the groups will likely deviate even more from the exponential pattern. Looking at it the other way round, the very existence of an exponential pattern in pooled data - as is found for t/d-deletion in English! - is evidence that the true Exponential Hypothesis, for individual grammars, is false.

P.S. Why should this be, you ask? Let me try some math.

A function f(x) is strictly convex over an interval if the second derivative of the function is positive for all x in that interval.

Now let f(x) = xn, where n > 1. The second derivative is n · (n-1) · xn-2. Since n > 1, both n and (n-1) are positive. If x is positive, xn-2 is positive, making the second derivative positive, which means that xn is strictly convex over the whole interval 0 < x < ∞.

Jensen's inequality states that if x is a random variable and f(x) is a strictly convex function, then f(E[x]) < E[f(x)]. That is, if we take the expected value of a variable over an interval, and then apply a strictly convex function to it, the result is always less than if we apply the function first, and then take the expected value of the outcome.

In our case, x is the probability of t/d retention, and like all probabilities, it lies on the interval between 0 and 1, where we know xn is strictly convex. By Jensen's inequality, E[x]n < E[xn]. This means that if we take the mean rate of retention for a group of speakers, and raise it to some power, the result is always less than if we raise each speaker's rate to that power, and then take the mean.

Therefore, the theoretical exponential rate will always be less than the empirical group mean rate, which is what we observed in all the simulations above.

Tuesday, October 1, 2013

On Exactitude In Science: A Miniature Study On The Effects Of Typical And Current Context


In that Empire, the Art of Cartography attained such Perfection that the map of a single Province occupied the entirety of a City, and the map of the Empire, the entirety of a Province. In time, even these immense Maps no longer satisfied, and the Cartographers' Guilds surveyed a Map of the Empire whose Size was that of the Empire, and which coincided point for point with it. The following Generations, less addicted to the Study of Cartography, realized that that vast Map was useless, and not without some Pitilessness delivered it up to the Inclemencies of Sun and Winter. In the Deserts of the West, there remain tattered Ruins of that Map, inhabited by Animals and Beggars; in all the Land there is no other Relic of the Disciplines of Geography.  (J. L. Borges)

A scientific model makes predictions based on a number of variables and parameters. The more complex the model, the more accurate its predictions. But all things being equal, a simpler model is preferred. As Newton put it: "We are to admit no more causes of natural things than such as are both true and sufficient to explain their appearances."

Exemplar Theory makes predictions about the present or future based on an enormous amount of stored information about the past. For example, a speaker is said to pronounce a word by taking into account the thousands of ways he or she has produced and heard it before. If such feats of memory are possible - I ain't sayin' I don't believe Goldinger 1998 - we should not be surprised by the accuracy of models that rely on them. And if language can be shown to rely on them, so be it. But the abandonment of parsimony, in the absence of clear evidence, should be resisted. (See the comment by S. Brown below for an alternative view of this issue.)

The same phenomena can often be accounted for "equally well" by a more deductive traditional theory or by a more inductive, bottom-up approach. The chemical elements were assigned to groups (alkali metals, halogens, etc.) because of their similar physical properties and bonding behavior long before the nuclear basis for these similarities was discovered. In biology, the various taxonomic branchings of species can be thought of as a reflection and continuation of their historical evolution, but the differences exist on a synchronic level as well - in organisms' DNA.

In classical mechanics, if an object begins at rest, its current velocity can be determined by integrating its acceleration over time. By storing the details of acceleration at every point along a trajectory, the current velocity can be calculated by integration: v(t) = ∫ a(t) dt. If we ride a bus with our eyes closed and our fingers in our ears, we can estimate our present speed if we remember everything about our past accelerations.

A free-falling object, under the constant acceleration of gravity, g, has velocity v(t) = g · t. But a block sliding down a ramp made up of various curved and angled sections, like a roller coaster, has an acceleration that changes with time. The acceleration at any moment is proportional to the sine of the slope of the ramp. Integrating, v(t) = g · ∫ sin(θ(t)) dt.

On a simple inclined plane, the angle is constant, so the acceleration is too. The velocity increases linearly: like free-fall, only slower. If the shape of the ramp is complicated, solving the integral of acceleration can be very difficult. (It might be beyond the capacity of the brain to calculate - but on a real roller coaster, we don't have to remember the past ten seconds to know how fast we are going now! We use other cues to accomplish that.)

But forgetting integration, we can solve for velocity in another way, showing that it depends only on the vertical height fallen: v = sqrt(2 · g · h). Obviously this is simpler than keeping track of complex and changing accelerations over time. This equation, rewritten as 1/2 · v2 = g · h, also reflects the balance between kinetic and potential energy, one part of the important physical law of conservation of energy. Instead of a near-truism with a messy implementation, we have an elegant and powerful principle.

Both expressions for velocity fit the same data, but to call the second an "emergent generalization", à la Bybee 2001, ignores its universality and demotivates the essential next step: the search for deeper explanations.

Admittedly, this physical allegory is unlikely to convince any exemplar theorists to recant. But we should realize that given its powerful and unexplanatory nature, any correct predictions made by ET do not constitute real evidence in its favor. We need to determine if the data also support an alternative theory, or at least find places where the data is more compatible with a weaker version of ET, rather than a stronger one.

A recently-discussed case suggests that with respect to phonological processes, individual words are not only influenced by their current contexts, but also, to a lesser degree, by their typical contexts (Guy, Hay & Walker 2008; Hay 2013). This is one of several recent studies by Hay and colleagues that show widespread and principled lexical variation, well beyond the idiosyncratic lexical exceptionalism sometimes acknowledged in the past, e.g. "when substantial lexical differences appear in variationist studies, they appear to be confined to a few lexical items" (Guy 2009).

The strong-ET interpretation is that all previous pronunciations of a word are stored in memory, and this gives us the typical-context distribution for each word. But if this is the case, the current-context effect must derive from something else: either from mechanical considerations or from analogy to other words. It can't also reflect sampling from sub-clouds of past exemplars, because that would cancel out the typical-context effect.

For words to be stored along with their environments is actually a weak version of word-specific phonetics (Pierrehumbert 2002). It is not that words are explicitly marked to behave differently; they only do so because of the environments they typically find themselves in. For Yang (2013: 6325), "these use asymmetries are unlikely to be linguistic but only mirror life." But whether they mirror life or reflect language-internal collocational structures, these asymmetries are not properties of individual words.

Under this model of word production - sampling from the population of stored tokens, then applying a constant multiplicative contextual effect - we observe the following pattern (in this case, the process is t/d-deletion, as in Hay 2013; the parameters are roughly based on real data):

Exemplar Model: Contextual Effect Greatest When Pool Is Least Reduced

This pattern has two main features: as words' typical contexts favor retention more, retention rates increase linearly before both V and C, with a widening gap between the two. From the Exemplar Theory perspective, when the pool of tokens contains mainly unreduced forms, the differences between constraints on reduction can be seen more clearly. But when many of the tokens in the pool are reduced already, the difference between pre-consonantal and pre-vocalic environments appears smaller. Such reduced constraint sizes are the natural result when a process of "pre-deletion" intersects with a set of rules or constraints that apply later, as discussed in this post, and in a slightly different sense in Guy 2007.

An alternative to storing every token is to say that words acquire biases from their contexts, and that these biases become properties of the words themselves. The source of a bias could be irrelevant to its representation - one word typically heard before consonants, another typically heard from African-American speakers, and another typically heard in fast speech contexts could all be marked for "extra deletion" in the same way.

From the point of view of parsimony, this is appealing. To figure out how a speaker might pronounce a word, the grammar would have to refer to a medium-sized list of by-word intercepts, but not search through a linguistic biography thick and complex enough to have been written by Robert Caro.

But the theoretical rubber needs to hit the empirical road, or else we are just spinning our wheels here. So, compared to the Stor-It-All model, does a stripped-down word-intercept approach make adequate predictions, or - dare we hope - even better ones? Are the predictions even that different?

If we assume that for binary variables, by-word intercepts (like by-speaker intercepts) combine with contextual effects additively on the log-odds scale (which seems more or less true), we obtain a pattern like this:

Intercept Model: Typical Context Combines W/ Constant Current Context

Although the two figures are not wildly different, we can see that in this case, there is no steady separation of the _V and _C effects as overall retention increases. The following-segment effect is constant in log-odds (by stipulation), and this manifests as a slight widening near the center of the distribution. The effects of current context and typical context are independent in this model, as opposed to what we saw above.

As usual, the Buckeye Corpus (Pitt et al. 2007) is a good proving ground for competing predictions of this kind. The Philadelphia Neighborhood Corpus has a similar amount of t/d coded (with more coming soon). Starting with Buckeye, I only included tokens of word-final t/d that were followed by a vowel or a consonant. I excluded all tokens with preceding /n/, in keeping with the sociolinguistic lore, "Beware the nasal flap!" I then restricted the analysis to words with at least 10 total tokens each - and excluded the word "just", because it had about eight times as many tokens as any of the other words. I was left with 2418 tokens of 69 word types.

Incidentally, there is no significant relationship between a word's overall deletion rate and its (log) frequency, whether the frequency measure is taken from the Buckeye Corpus itself (p = .15) or from the 51-million-word corpus of Brysbaert et al. 2013 (p = .37). The absence of a significant frequency effect on what is arguably a lenition process goes against a key tenet of Exemplar Theory (Bybee 2000, Pierrehumbert 2001), but the issue of frequency is not our main concern here.

I first plotted two linear regression lines, one for the pre-vocalic environments and one for the the pre-consonantal environments. The regressions were weighted according to the number of tokens for each word. I then tried a quadratic rather than a linear regression. However, these curves did not provide a significantly better fit to the data - p(_V) = .57, p(_C) = .46 - so I retreated to the linear models. The straight lines plotted below look parallel; in fact the slope of the _V line is 0.301 and the slope of the _C line is 0.369. Since the lines converge slightly rather than diverging markedly, this data is less consistent with the exemplar model sketched above, and more consistent with the word-intercept model.

Buckeye Corpus: Parallel Lines Support Intercept Model, Not Exemplars

One way to improve this analysis would be to use a larger corpus, at least for the x-axis, to more accurately estimate the proportion that a given word ending in t/d is followed by a vowel rather than a consonant. For example, the spoken section of COCA (Davies 2008-) is about 250 times larger than the Buckeye Corpus. Of course, for a few words the estimate from the local corpus might better represent those speakers' biases.

Turning finally to data from the Philadelphia Neighborhood Corpus, we see a fairly similar picture. Note that some of the words' left-right positions differ noticeably between the two studies. The word "most", despite having 150-200 tokens, occurs before a vowel 75% of the time in Philadelphia, but only 52% of the time in Ohio. It is hard to think what this could be besides sampling error, but if it is that, it casts some doubt on the reliability of these results, especially as most words have far fewer tokens.

Philadelphia Neighborhood Corpus: Convergence, Not Exemplar Prediction

Regarding the regression lines, there are two main differences. First, Philadelphia speakers delete much more before consonants than Ohio speakers, while there is no overall difference before vowels. This creates the greater following-segment effect noticed for Philadelphia before.

The second difference is that in Philadelphia, a word's typical context seems to barely affect its behavior before vowels. The slope before consonants, 0.317, is close to those observed in Ohio, but the slope before vowels is only 0.143 - not significantly different from zero (p = .14). Recall that under the exemplar model, the _V slope should always be larger than the _C slope; words almost always occurring before vowels - passed, walked, talked - should provide a pool of pristine, unreduced exemplars upon which the effects of current context should be most visible.

I have no explanation at present for the opposite trend being found in Philadelphia, but it is clear that neither the PNC data nor the Buckeye Corpus data show the quantitative patterns predicted by the exemplar theory model. This, and a general preference for parsimony - in storage, and arguably in computation (again, see S. Brown below) - points to typical-context effects being "ordinary" lexical effects. "[We] shall know a word by the company it keeps" (Firth 1957: 11), but we still have no reason to believe that the word itself knows all the company it has ever kept. And to find our way forward, we may not need a map at 1:1 scale.

Thanks: Stuart Brown, Kyle Gorman, Betsy Sneller, & Meredith Tamminga.

References:

Borges, Jorge Luis. 1946. Del rigor en la ciencia. Los Anales de Buenos Aires 1(3): 53.

Brysbaert, Marc, Boris New and Emmanuel Keuleers. 2013. SUBTLEX-US frequency list with PoS information final text version. Available online at http://expsy.ugent.be/subtlexus/.

Bybee, Joan. 2000. The phonology of the lexicon: evidence from lexical diffusion. In Michael Barlow and Suzanne Kemmer (eds.), Usage-based models of language. Stanford: CSLI. 65-85.

Bybee, Joan. 2001. Phonology and language use. Cambridge Studies in Linguistics 94. Cambridge: Cambridge University Press.

Davies, Mark. 2008-. The Corpus of Contemporary American English: 450 million words, 1990-present. Available online at http://corpus.byu.edu/coca/.

Firth, John R. 1957. A synopsis of linguistic theory, 1930-1955. In Studies in Linguistic Analysis, Special volume of the Philological Society. nOxford: Basil Blackwell.

Guy, Gregory. 2007. Lexical exceptions in variable phonology. Penn Working Papers in Linguistics 13(2), Papers from NWAV 35, Columbus.

Guy, Gregory. 2009. GoldVarb: Still the right tool. NWAV 38, Ottawa.

Guy, Gregory, Jennifer Hay and Abby Walker. 2008. Phonological, lexical, and frequency factors in coronal stop deletion in early New Zealand English. LabPhon 11, Wellington.

Hay, Jennifer. 2013. Producing and perceiving "living words". UKLVC 9, Sheffield.

Pierrehumbert, Janet. 2001. Exemplar dynamics: word frequency, lenition and contrast. In Joan Bybee and Paul Hopper (eds.), Frequency and the emergence of linguistic structure. Amsterdam: John Benjamins. 137-157.

Pierrehumbert, Janet. 2002. Word-specific phonetics. Laboratory Phonology 7. Berlin: Mouton de Gruyter. 101-139.

Pitt, Mark A. et al. 2007. Buckeye Corpus of Conversational Speech. Columbus: Department of Psychology, Ohio State University.

Yang, Charles. 2013. Ontogeny and phylogeny of language. Proceedings of the National Academy of Sciences 110(16): 6324-6327.