One of the risks of blogging is that you can fire off ideas into the public domain while you’re still excited about them and haven’t really tested them all that well. Last month I blogged about a random walk model of linguistic complexity (the current post won’t make much sense unless you’ve read the original). Essentially, it was trying to find a baseline for the expected correlation between a population’s size and a measure of linguistic complexity. It assumed that the rate of change in the linguistic measure was linked to population size. Somewhat surprisingly, correlations between the two measures (similar to the kind described in Lupyan & Dale, 2010) emerged, despite there being no directional link.
However, these observations were made on the basis of a relatively small sample size. In order to discover why the model was behaving like this, I needed to run a lot more tests. The model was running slowly in python, so I transliterated it to R. When I did, the results were very different: In the first model an inverse relationship between the population size and the rate of change of linguistic complexity yielded a negative correlation between population size and linguistic complexity (perhaps explaining results such as Lupyan & Dale’s). However in the R model this did not occur. In fact, significant correlations only appeared 5% of the time, with that 5% being split exactly between positive and negative correlations. That is, the baseline model has a standard confidence interval, not the much stricter one I had suggested in the last post.
Why was this happening? In short: Rounding errors and small sample sizes.
I checked the Python code, but couldn’t find a bug, so the correlations really were appearing, and really were favouring a negative correlation. Here’s my best explanation: First, the sample of runs was too low to capture the proper distribution. However, strong correlations were appearing. This could be because although the linguistic complexity measure started out pretty randomly distributed, the individual communities were synchronising at the maximum and minimum of the range as they bumped up against it. This caused temporary clusters in the low ranges where the linguistic complexity was changing rapidly (and therefore more likely to synchronise), creating tied ranks in the corners. In addition to this, the Python script I was using had a lower bit depth for its numbers than R, so was more prone to rounding errors. I have to assume as well that my Python script somehow favoured numbers closer to 1 than to 0. It’s still not a very satisfactory explanation, but the conclusion remains that, as one would expect, affecting just the rate of change of linguistic complexity does not produce correlations.
Modelling evolutionary systems often runs into these kinds of problems: The search spaces are often intractable for some approaches. Also I am not, as a mere linguist, aware of some of the more advanced computational techniques. It’s one of the reasons that Evolutionary Linguistics requires a pluralist approach and tools from many different disciplines.
It’s embarrassing to have to correct previous statements, but I guess that’s what Science is about. In the blogging age ideas can get out before they’re fully tested and potentially affect other work. This has its advantages – good ideas can get out faster. But it also means that the reader must be more critical in order to catch poor ideas like the one I’m correcting here.
Sorry, Science.
Here’s a link to the R script (25 lines of code!).
Lupyan G, & Dale R (2010). Language structure is partly determined by social structure. PloS one, 5 (1) PMID: 20098492
What was the precision of the Python script compared to R? I’d guess that both defaulted to “double” precision (64-bits), though I’m not fluent in either language.
Yes, I was using double precision. I’m not entirely sure about the rounding errors problem, but it’s the only thing I can think of. Perhaps there is a bug in the Python code, or I’m not generating the random numbers properly. At any rate, the results from the R program make more sense.
Well, I know neither R nor Python, but one question. Are you sure that you are getting a uniform random number with choice([-1, 1]), and not either -1 or 1.
Based on this web page.
great example of ‘open science’ though!
According to python documentation, its RNG has 53 bits of precision. You can use random.getrandbits(k) to get an arbitrary amount of precision.
I’m not familiar enough with either R or Python’s RNG to speculate, but I did encounter weird problems with C when using rand() instead of random() to generate matrices. [perhaps a similar issue with bit precision?] The weirdness didn’t show up until I ran my models forward based on those matrices and looked at some statistics, which were quite wonky.
Generally speaking, you should not use rand() in C. The C standard doesn’t ask much of it. On some platforms, it could be decent, but on other platforms, it could be bad.
@ Ken
I know this now, but the last time anyone formally taught me anything about C was in high school!
Hmm, I’m certainly learning a lot here. I’ve been told before about the importance of writing models from scratch in other languages, and now I see that’s a sensible approach. Now it seems like I should be testing on multiple systems, too! There was a debate about whether to ask people submitting models to Evolang to provide code for reviewing – that would uncover some problems like this one. Still, it’s a bit scary releasing your code into the wild, a bit like knowing there are naked pictures of you on the internet.
I ran your R script with d = -1 (the inverse relationship you described, and the correlation was below the criterion 6% of the time. The distribution of correlations had a strong negative skew. Then I reran it again and got no correlations below the criterion and a strong positive skew. Looking at the raw data, I noticed strong temporal autocorrelation, which is potentially a major problem.
Instead of looking at the correlation at each time step, re-run with more populations and only look at the correlation once (at the end of the run) after a random initialization.
Hao Ye: Cool – distributed open science at work! Yes, I was assuming that each datapoint was independent when in fact they rely heavily on the previous datapoint. Still, it looks like the baseline is just a standard 5%.
I was wondering whether a more complex dynamic would introduce system-wide fluctuations in the model. That is, if the languages were spatially distributed and your langauge complexity relied on both the population size and the amount of contact with other communities, then you could get chaotic dynamics which could cause periods of correlations. If anyone is up for trying this, feel free – I have to get back to my EvoLang submission!