James and I have written about Galton’s problem in large datasets. Because two modern languages can have a common ancestor, the traits that they exhibit aren’t independent observations. This can lead to spurious correlations: patterns in the data that are statistical artefacts rather than indications of causal links between traits.
However, I’ve often felt like we haven’t articulated the general concept very well. For an upcoming paper, we created some diagrams that try to present the problem in its simplest form.
Spurious correlations can be caused by cultural inheritance
Above is an illustration of how cultural inheritance can lead to spurious correlations. At the top are three independent historical cultures, each of which has a bundle of various traits which are represented as coloured shapes. Each trait is causally independent of the others. On the right is a contingency table for the colours of triangles and squares. There is no particular relationship between the colour of triangles and the colour of squares. However, over time these cultures split into new cultures. Along the bottom of the graph are the currently observable cultures. We now see a pattern has emerged in the raw numbers (pink triangles occur with orange squares, and blue triangles occur with red squares). The mechanism that brought about this pattern is simply that the traits are inherited together, with some combinations replicating more often than others: there is no causal mechanism whereby pink triangles are more likely to cause orange squares.
Spurious correlations can be caused by borrowing
Above is an illustration of how borrowing (or areal effects or horizontal cultural inheritance) can lead to spurious correlations. Three cultures (left to right) evolve over time (top to bottom). Each culture has a bundle of various traits which are represented as coloured shapes. Each trait is causally independent of the others. On the right is a count of the number of cultures with both blue triangles and red squares. In the top generation, only one out of three cultures have both. Over some period of time, the blue triangle is borrowed from the culture on the left to the culture in the middle, and then from the culture in the middle to the culture on the right. By the end, all languages have blue triangles and red squares. The mechanism that brought about this pattern is simply that one trait spread through the population: there is no causal mechanism whereby blue triangles are more likely to cause red squares.
A similar effect would be caused by a bundle of causally unrelated features being borrowed, as shown below.