One of the risks of blogging is that you can fire off ideas into the public domain while you’re still excited about them and haven’t really tested them all that well. Last month I blogged about a random walk model of linguistic complexity (the current post won’t make much sense unless you’ve read the original). Essentially, it was trying to find a baseline for the expected correlation between a population’s size and a measure of linguistic complexity. It assumed that the rate of change in the linguistic measure was linked to population size. Somewhat surprisingly, correlations between the two measures (similar to the kind described in Lupyan & Dale, 2010) emerged, despite there being no directional link.
However, these observations were made on the basis of a relatively small sample size. In order to discover why the model was behaving like this, I needed to run a lot more tests. The model was running slowly in python, so I transliterated it to R. When I did, the results were very different: In the first model an inverse relationship between the population size and the rate of change of linguistic complexity yielded a negative correlation between population size and linguistic complexity (perhaps explaining results such as Lupyan & Dale’s). However in the R model this did not occur. In fact, significant correlations only appeared 5% of the time, with that 5% being split exactly between positive and negative correlations. That is, the baseline model has a standard confidence interval, not the much stricter one I had suggested in the last post.
Why was this happening? In short: Rounding errors and small sample sizes.
I checked the Python code, but couldn’t find a bug, so the correlations really were appearing, and really were favouring a negative correlation. Here’s my best explanation: First, the sample of runs was too low to capture the proper distribution. However, strong correlations were appearing. This could be because although the linguistic complexity measure started out pretty randomly distributed, the individual communities were synchronising at the maximum and minimum of the range as they bumped up against it. This caused temporary clusters in the low ranges where the linguistic complexity was changing rapidly (and therefore more likely to synchronise), creating tied ranks in the corners. In addition to this, the Python script I was using had a lower bit depth for its numbers than R, so was more prone to rounding errors. I have to assume as well that my Python script somehow favoured numbers closer to 1 than to 0. It’s still not a very satisfactory explanation, but the conclusion remains that, as one would expect, affecting just the rate of change of linguistic complexity does not produce correlations.
Modelling evolutionary systems often runs into these kinds of problems: The search spaces are often intractable for some approaches. Also I am not, as a mere linguist, aware of some of the more advanced computational techniques. It’s one of the reasons that Evolutionary Linguistics requires a pluralist approach and tools from many different disciplines.
It’s embarrassing to have to correct previous statements, but I guess that’s what Science is about. In the blogging age ideas can get out before they’re fully tested and potentially affect other work. This has its advantages – good ideas can get out faster. But it also means that the reader must be more critical in order to catch poor ideas like the one I’m correcting here.
Sorry, Science.
Here’s a link to the R script (25 lines of code!).
Lupyan G, & Dale R (2010). Language structure is partly determined by social structure. PloS one, 5 (1) PMID: 20098492