Saturday, October 26, 2013
What Should We Do With Frequency?
UPDATE: According to Kyle Gorman (p.c.), the best psycholinguistic frequency measure is simply the rank (of frequency), which makes sense given a "serial access" model of lexical processing. This metric apparently outperforms both log-frequency and raw frequency in lexical decision tasks. The upshot for sociolinguistic frequency results remains to be determined.
Frequency is the count of how many times some relatively rare item, such as a particular word, occurs over a certain length of time, or in a text or corpus of a certain length. Such a count — called a Poisson variable — clearly can't be negative, but in a regression context the independent variables can often be positive or negative. One reason for using the logarithmic transformation of the count is to avoid the problems resulting from this.
Also, in linear regression, effects on the dependent variable are additive, being proportional to the changes in the independent variables. But effects on counts are usually multiplicative, proportional to the size of the count itself. If we use the log link, instead of having to multiply the effects, we can add them, like we usually do.
But when word frequency is an independent variable in a regression, the above concerns do not apply. Here we have to contend with another issue, which is that word frequencies form a highly skewed distribution. Even in large texts or corpora, a few words always comprise a substantial fraction of the total, while many words occur only once. Zipf's Law is more precise, stating that the most frequent word in a large text is twice as frequent as the second-most-frequent word, the 10th most frequent word twice as frequent as the 20th, and so forth. This "power law" says that a word's frequency, multiplied by its rank frequency, is a constant.
For Zipf (1949), this relationship illustrated his "principle of least effort"; less frequent words were more difficult to access and therefore lazy humans avoided retrieving them. But both Zipf and others pointed out that similar relationships apply much more generally, including well outside the realm of human behavior: "the size of cities, the number of hits on websites, the magnitude of earthquakes and the diameters of moon craters have all been shown to follow power laws" (West 2008). Newman (2006) provides other interesting examples and offers six main mechanisms by which they can arise.
Newman (2006:13-14) refers to Cover and Thomas (1991:85), who show that the from an information-theory standpoint, the optimum length of a codeword is the negative logarithm of its probability. We can extend this, as Newman does, to the question of the distribution of word lengths, and we see that this in fact predicts a power-law distribution (if "length" is defined appropriately). Perhaps the optimum "semantic length" of words is behaving similarly, explaining the specific form of the inverse relationship between word frequency and rank frequency (of course, an inverse relationship of some kind exists by definition).
We saw that when frequency is the dependent variable, it should be expressed in logarithmic units. But when frequency is an independent variable, whether to transform it is not so clear. Ideally, our choice should reflect our thoughts on how frequency might be represented internally. Indeed, comparing between two transformations, or between raw and transformed data, could possibly be used to distinguish between theories.
If words are merely "tagged" for frequency, a log-transformation might be convenient given the skewed distribution, because it decreases the effect of "leverage points" (independent-variable outliers). The distribution will still be skewed, but this is more or less OK, since it is an independent variable. But in a theory where each use of a word calls up the totality of the speaker or hearer's experience with that word, one would expect that raw frequency numbers would be more appropriate.
Transforming frequency as an independent variable does not change the model as such, only the coefficient values for frequency and any other variables interacting with it. However, as we know from Erker and Guy (2012), plots with raw frequency or log-frequency on the x-axis can look quite different, and the corresponding regression slopes and correlations can even switch from positive to negative or vice-versa, complicating the interpretation of the results.
Perhaps a more clear-cut issue in regression is weighting. If we are interested in investigating the effect of frequency, whether on its original axis or on a transformed one, it seems unlikely that we would like our estimates to be affected more by high-frequency words than by low-frequency ones. This is not a problem in experiments where words are selected based on frequency, because they are then presented in a balanced way. But in studies of natural speech, especially if they include frequency as an independent variable, either a random word effect or explicit inverse-frequency weighting should be used to counteract the bias.
Cover, Thomas M. and Joy A. Thomas. 1991. Elements of information theory. New York: John Wiley & Sons.
Erker, Daniel and Gregory R. Guy. 2012. The role of lexical frequency in syntactic variability: variable subject personal pronoun expression in Spanish. Language 88(3): 526-557.
Newman, Mark. 2006. Power laws, Pareto distributions and Zipf’s law.
West, Marc. 2008. The mystery of Zipf.
Zipf, George Kingsley. 1949. Human Behavior and the Principle of Least Effort. Cambridge, Mass.: Addison-Wesley.