Sunday, June 23, 2013

What if my variable has more than two variants?

Since the dawn of time, sociolinguists have used logistic regression (or something very similar) to estimate the factors affecting binary linguistic variables. And linear regression has occasionally been applied when the dependent variable is a continuous numeric measurement. For the most part, however, variables with k categories (where k > 2) have been avoided, or approached by performing k-1 separate logistic regressions, a workaround that, while not ideal, makes a lot of sense for some variables.

Models specifically designed for multi-categorical (or "polytomous") responses do exist: if the k categories have a meaningful order, then some form of ordinal logistic regression may be appropriate. Otherwise, we can use a more general method, multinomial logistic regression, which basically optimizes k-1 binary models simultaneously. The R function mlogit() fits models of this type; its use is somewhat different from lm() or glm().

Over the past ten years, many sociolinguists have realized the advantages of mixed-effects regression models. Estimates of between-speaker properties, including many of the traditional "social factors", cannot be accurate if the grouped - and often imbalanced - structure of the data is ignored. The same applies to individual words, where achieving balance in spontaneous speech is never possible, due to Zipf's Law. Mixed models can distinguish between group and individual influences, giving better estimates of effect size as well as statistical significance. The R function lmer() is a popular way to fit these models, but it does not perform multinomial logistic regression.

Mixed-effects multinomial logistic regression has been implemented in SAS (using the NLMIXED procedure) and in Stata (with the gllamm module). Some R packages with a Bayesian orientation can also fit such models, such as bayesm and mcmcGLMM. Indeed, even mlogit() now advertises this functionality, although learning it does not appear to be entirely straightforward, at least from the point of view of an lmer() user.

However, if you are working with data that is well-balanced across speakers, and if the variables you're most interested in are all between-speaker variables, there is another option that might be appealing. This is the method known as CoDa, or compositional data analysis. If we aggregate the observations for each speaker (e.g. 20% variant A, 50% variant B, 30% variant C), we get a set of proportions that add up to 1; this is a composition. Each speaker produces a different composition, and we want to identify the factors most responsible for those differences.

Although compositional data is numeric, we can't perform (multivariate) linear regression on these numbers directly; the fact that each composition sums to 1 would introduce spurious negative correlations. Instead, CoDa transforms the k compositional parts to k-1 coordinates using log-ratios. Three variants can become two coordinates in several ways. One popular method makes the first coordinate axis proportional to log (A * B / C) and the second axis proportional to log (B / A). We can then carry out various statistical operations on the transformed data - including linear regression and hypothesis testing.

Because of this choice of coordinates, the results of regression are interpretable in the transformed space (with the x-axis, for example, showing the relationship between C and the other two variants, and the y-axis showing the relationship between A and B). Alternatively, the results can be transformed back into the original compositional space; a triangular "ternary diagram" is a useful display for this, especially in the case of three variants.

The diagrams below are just an example of how we can apply linear regression using a single predictor - age - with three-part compositional data on coda /r/ pronunciation in two Scottish towns. One of the variants, a tap or trill, is used mainly by older speakers. However, in Eyemouth, the apparent-time trend is towards an approximant realization, while in Gretna a zero variant is the norm for younger speakers. The two graphs are equivalent, but only using the logratio-transformed representation (on the right) can we test the significance of both changes in progress.


Multivariate F-tests (with Pillai traces) show that both communities are undergoing significant change. Univariate tests show that in Eyemouth, only one dimension of the change, that from tap/trill to approximant, is significant. In Gretna, it is the other dimension, from tap/trill/approximant to zero, that shows a significant change over time.

References:

Croissant, Yves. Estimation of multinomial logit models in R: the mlogit packages. http://cran.r-project.org/web/packages/mlogit/vignettes/mlogit.pdf

Egozcue, Juan José et al. 2012. Simplicial regression: the normal model. Journal of applied probability and statistics 6(1-2): 87-108. http://japs.isoss.net/0416.pdf

Gorman, Kyle and Daniel Ezra Johnson. 2013. Quantitative analysis. In Bayley, Cameron and Lucas (eds.), The Oxford Handbook of Sociolinguistics, 214-240.

Llamas, Carmen et al. 2008. Variable /r/ use along the Scottish-English border. Poster presented at NWAV 37. http://www.danielezrajohnson.com/llamas_et_al_2008.pdf

Pawlowsky-Glahn, Vera et al. 2011. Lecture notes on compositional data analysis. http://www.compositionaldata.com/material/others/Lecture_notes_11.pdf

No comments:

Post a Comment