Corpus Linguistics, Literary Studies, and Description

One of my main hobbyhorses these days is description. Literary studies has to get a lot more sophisticated about description, which is mostly taken for granted and so is not done very rigorously. There isn’t even a sense that there’s something there to be rigorous about. Perhaps corpus linguistics is a way to open up that conversation.
The crucial insight is this: What makes a statement descriptive IS NOT how one arrives at it, but the role it plays in the larger intellectual enterprise.

A Little Background Music

Back in the 1950s there was this notion that the process of aesthetic criticism took the form of a pipeline that started with description, moved on to analysis, then interpretation and finally evaluation. Academic literary practice simply dropped evaluation altogether and concentrated its efforts on interpretation. There were attempts to side-step the difficulties of interpretation by asserting that one is simply describing what’s there. To this Stanley Fish has replied (“What Makes an Interpretation Acceptable?” in Is There a Text in This Class?, Harvard 1980, p. 353):

 

The basic gesture then, is to disavow interpretation in favor of simply presenting the text: but it actually is a gesture in which one set of interpretive principles is replaced by another that happens to claim for itself the virtue of not being an interpretation at all.

 

And that takes care of that.
Except that it doesn’t. Fish is correct in asserting that there’s no such thing as a theory-free description. Literary texts are rich and complicated objects. When the critic picks this or that feature for discussion those choices are done with something in mind. They aren’t innocent.
But, as Michael Bérubé has pointed out in “There is Nothing Inside the Text, or, Why No One’s Heard of Wolfgang Iser” (in Gary Olson and Lynn Worsham, eds. Postmodern Sophistries, SUNY Press 2004, pp. 11-26) there is interpretation and there is interpretation and they’re not alike. The process by which the mind’s eye makes out letters and punctuation marks from ink smudges is interpretive, for example, but it’s rather different from throwing Marx and Freud at a text and coming up with meaning.
Thus I take it that the existence of some kind of interpretive component to any description need not imply that the necessity of interpretation implies that it is impossible to descriptively carve literary texts at their joints. And that’s one of the things that I want from description, to carve texts at their joints.
Of course, one has to know how to do that. And THAT, it would seem, is far from obvious.

Literary History, the Future: Kemp Malone, Corpus Linguistics, Digital Archaeology, and Cultural Evolution

In scientific prognostication we have a condition analogous to a fact of archery—the farther back you draw your longbow, the farther ahead you can shoot.
– Buckminster Fuller

The following remarks are rather speculative in nature, as many of my remarks tend to be. I’m sketching large conclusions on the basis of only a few anecdotes. But those conclusions aren’t really conclusions at all, not in the sense that they are based on arguments presented prior to them. I’ve been thinking about cultural evolution for years, and about the need to apply sophisticated statistical techniques to large bodies of text—really, all the texts we can get, in all languages—by way of investigating cultural evolution.

So it is no surprise that this post arrives at cultural evolution and concludes with remarks on how the human sciences will have to change their institutional ways to support that kind of research. Conceptually, I was there years ago. But now we have a younger generation of scholars who are going down this path, and it is by no means obvious that the profession is ready to support them. Sure, funding is there for “digital humanities” and so deans and department chairs can get funding and score points for successful hires. But you can’t build a profound and a new intellectual enterprise on financially-driven institutional gamesmanship alone.

You need a vision, and though I’d like to be proved wrong, I don’t see that vision, certainly not on the web. That’s why I’m writing this post. Consider it sequel to an article I published back in 1976 with my teacher and mentor, David Hays: Computational Linguistics and the Humanist. This post presupposes the conceptual framework of that vision, but does not restate nor endorse its specific recommendations (given in the form of a hypothetical program for simulating the “reading” of texts).

The world has changed since then and in ways neither Hays nor I anticipated. This post reflects those changes and takes as its starting point a recent web discussion about recovering the history of literary studies by using the largely statistical techniques of corpus linguistics in a kind of digital archaeology. But like Tristram Shandy, I approach that starting point indirectly, by way of a digression.

Who’s Kemp Malone?

Back in the ancient days when I was still an undergraduate, and we tied an onion in our belts as was the style at the time, I was at an English Department function at Johns Hopkins and someone pointed to an old man and said, in hushed tones, “that’s Kemp Malone.” Who is Kemp Malone, I thought? From his Wikipedia bio:

Born in an academic family, Kemp Malone graduated from Emory College as it then was in 1907, with the ambition of mastering all the languages that impinged upon the development of Middle English. He spent several years in Germany, Denmark and Iceland. When World War I broke out he served two years in the United States Army and was discharged with the rank of Captain.

Malone served as President of the Modern Language Association, and other philological associations … and was etymology editor of the American College Dictionary, 1947.

Who’d have thought the Modern Language Association was a philological association? Continue reading “Literary History, the Future: Kemp Malone, Corpus Linguistics, Digital Archaeology, and Cultural Evolution”

“Hierarchical structure is rarely…needed to explain how language is used in practice”

How hierarchical is language use?

Stefan L. Frank, Rens Bod and Morten H. Christiansen

Abstract: It is generally assumed that hierarchical phrase structure plays a central role in human language. However, considerations of simplicity and evolutionary continuity suggest that hierarchical structure should not be invoked too hastily. Indeed, recent neurophysiological, behavioural and computational studies show that sequential sentence structure has considerable explanatory power and that hierarchical processing is often not involved. In this paper, we review evidence from the recent literature supporting the hypothesis that sequential structure may be fundamental to the comprehension, production and acquisition of human language. Moreover, we provide a preliminary sketch outlining a non-hierarchical model of language use and discuss its implications and testable predictions. If linguistic phenomena can be explained by sequential rather than hierarchical structure, this will have considerable impact in a wide range of fields, such as linguistics, ethology, cognitive neuroscience, psychology and computer science.

Published online before print September 12, 2012, doi: 10.1098/rspb.2012.1741
Proceedings of the Royal Society B

Full text online HERE.

Wild Replicator’s Got Funky Rhythm, Part 2

As its name indicates, this post builds on Wild Replicator’s Got Funky Rhythm, Part 1. I want to call your attention, in particular, to the next to the last section, Becoming Memetic. There I trace, albeit sketchily, the history of Rhythm Changes. The point is that Rhymthm Changes didn’t exist as a memetic entity in 1930, when George Gershwin wrote “I Got Rhythm.” Just when the chord changes had become differentiated from the song itself is not clear. But it had certainly happened, at least in the jazz world, by the mid 1940s. Thus, it is not as though certain patterns are essentially memetic while others are not. It’s a question of how the patterns function in the cultural system.

* * * * *

In the previous post I took a look at Rhythm Changes, a memetic entity that has played an important role in jazz and, in particular, in bebop. FWIW, Rhythm Changes has also been used in the theme song for well-known some well-known cartoons, Woody Woodpecker and The Flintstones. In this post I want to do several things:

  • consider all the elements of “I Got Rhythm,” rather than just the chord changes,
  • think briefly about how pools of memetic elements function in defining musical styles, and
  • look briefly at how the chord changes to Gershwin’s tune became memetically active.

Taken together those discussions flesh out the role of memetic elements in music systems in the large. I conclude by

  • examining this discussion of memes in music in the context of a recent article by Evelyn Fox Keller and David Harel, Beyond the Gene, and not some broad thematic similarities between their discussion and mine.

I Got Rhythm, Whole

As I’ve indicated, Rhythm Changes is derived from, abstracted from, George Gershwin’s “I Got Rhythm.” Now let’s think about the whole tune, not just its harmonic trajectory, i.e. Rhythm Changes. In addition to that trajectory we also have a specific melody, the lyrics, the rhythmic framework, and the arrangement. The lyrics are optional; the tune can be performed without them, and among jazz musicians that is the typical, if not universal, performance practice. Note, however, that any consideration of the lyrics brings a whole other memetic field into consideration, that of language. Continue reading “Wild Replicator’s Got Funky Rhythm, Part 2”

Wild Replicator’s Got Funky Rhythm, Part 1

Now that the replicator meme is out and about I’ve got more to say. I’m going to republish two more posts from my 2010 cultural evolution series. These posts are about music. I have various reasons for using music as my cultural evolution conceptual sandbox. For one thing, it means that I don’t have to contend with semantic meanings arbitrarily associated with bits of music. In music, all we’ve got is the physical signal.

In these two posts I choose, not a simple musical example but, rather, a complex one, something jazz musicians know as Rhythm Changes. While I could talk about the four-note motif Beethoven used to construct the first movement of his Fifth Symphony, which is a memetic favorite, that’s too easy. Thinking about it won’t stretch our intuitions about the memetic properties of mere physical things. That motif has four notes, with specific durations and specific note-to-note pitch relationships.

Rhythm Changes isn’t like that. It’s an abstract property of a sound stream. There is now specific number of notes, no specific durations, and no specific note-to-note pitch relationships. Thousands upon thousands of specific musical streams, many quite different from one another, have exemplified the properties of Rhythm Changes.

In the previous post (in this series) I argued memes, the cultural parallel to the biological gene, are those physical properties of objects, events, and processes that allow different individuals to coordinate their participation in those things. In this view, memes are not physical objects, like genes, that spread through a population. Rather, memes are about sharability; they are physical properties that can easily be identified by human nervous systems and thus be the basis for shared (cultural) activity.

In that post I considered a very basic case, people making noise at regular intervals. In that case we have two memes, period (the interval between “hits”) and phase (the relationship between streams of hits by different individuals). Now I want to consider a considerably more complex case, the entity jazz musicians know as Rhythm Changes. This entity assumes that, for a given performance, period length and phase value are agreed upon. In fact it assumes a lot more. We’re dealing with a whole lot of memes here.

But I don’t want to get hung up in those details. I just want to characterize Rhythm Changes in a reasonable way and explain just why I insist that we regard Rhythm Changes as a structured collection of physical properties that can be ascribed to a stream of sound. While it would be nice to characterize Rhythm Changes using the language of acoustics, it’s not at all clear to me that we’ve got the necessary concepts. In any event, if we do, I don’t know them. Instead, I’ll couch my description in the schematic terms jazz musicians tend to use when talking about their craft; these terms are derived, in part, from descriptive and analytic concepts developed for European art music (i.e. classical music).

I’m going do this in two posts, the first will be confined to Rhythm Changes itself. The second will consider how Rhythm Changes came into being and how it functions in the popular music system. Continue reading “Wild Replicator’s Got Funky Rhythm, Part 1”

In Search of the Wild Replicator


The key to the treasure is the treasure.
– John Barth

In view of Sean’s post about Andrew Smith’s take on linguistic replicators I’ve decided to repost this rather longish note from New Savanna. I’d orignally posted it in the Summer of 2010 as part of a run-up to a post on cultural evolution for the National Humanities Center (USA); I’ve collected those notes into a downloadable PDF. Among other things the notes deal with William Croft’s notions (at least as they existed in 2000) and suggests that we’ll find language replicators on the emic side of the emic/etic distinction.

I’ve also appended some remarks I made to John Lawler in the subsequent discussion at New Savanna.

* * * * *
There’s been a fair amount of work done on language from an evolutionary point of view, which is not surprising, as historical linguistics has well-developed treatments of language lineages and taxonomy, the “stuff” of large-scale evolutionary investigation. While this work is directly relevant to a consideration of cultural evolution, however, I will not be reviewing or discussing it. For it doesn’t deal with the theoretical issues that most concern me in these posts, namely, a conceptualization of the genetic and phenotypic entities of culture. This literature is empirically oriented in a way that doesn’t depend on such matters.

The Arbitrariness of the Sign

In particular, I want to deal with the arbitrariness of the sign. Given my approach to memes, that arbitrariness would appear to eliminate the possibility that word meanings could have memetic status. For, as you may recall, I’ve defined memes to be perceptual properties – albeit sometimes very complex and abstract ones – of physical things and events. Memes can be defined over speech sounds, language gestures, or printed words, but not over the meanings of words. Note that by “meaning” I mean the mental or neural event that is the meaning of the word, what Saussure called the signified. I don’t mean the referent of the word, which, in many cases, but by no means all, would have perceptible physical properties. I mean the meaning, the mental event. In this conception, it would seem that that cannot be memetic.

That seems right to me. Language is different from music and drawing and painting and sculpture and dance, it plays a different role in human society and culture. On that basis one would expect it to come out fundamentally different on a memetic analysis.

This, of course, leaves us with a problem. If word meaning is not memetic, then how is it that we can use language to communicate, and very effectively over a wide range of cases? Not only language, of course, but everything that depends on language. Continue reading “In Search of the Wild Replicator”

Conrad’s Special K: Periodicity in Heart of Darkness

Digital Humanities Sandbox Goes to the Congo, Part II

While Kurtz is the center of attention in Heart of Darkness, he doesn’t appear until relatively late in the story. He isn’t mentioned until about 8000 words into the 38000 word text nor do we know much about him until a long paragraph that starts roughly 23,000 words into the text. That paragraph, which I’ve called the nexus, is structurally central to the text, and is roughly 1500 words long.

I decided to investigated Kurtz’s presence in the text by the simple expedient of noting where the name “Kurtz” occurs. The result, my colleague Tim Perper subsequently told me, is what’s called a periodogram (PDF):


HoD500
Figure 1: Periodicity in the appearance of “Kurtz”
Visual inspection suggests that the appearance of “Kurtz” is periodic, with two components, a short one and a significantly longer one. Before discussing this further, however, I would like to explain what I’ve done. Continue reading “Conrad’s Special K: Periodicity in Heart of Darkness”

Digital Humanities Sandbox Goes to the Congo

Or, Speculations in Computational Evolutionary Psychology

Note: This version of the post has been revised from an earlier version in which I suggested that the distribution in the first chart followed a power law. Cosma Shalizi checked it for me and it’s not a power law distribution. It’s an exponential distribution.

So, I’ve been exploring Conrad’s Heart of Darkness. In the last two posts I’ve examined one paragraph in the text, the so-called nexus. It’s the longest paragraph in the text, it’s structurally central, and it covers a lot of semantic territory.

OK, but what about the other paragraphs.

What about them?

Aren’t you going to look at them?

Well, yeah, but I sure don’t have time to troll through them like I did the nexus. I mean, that post stretched from here to Sunday.

I get your point. Why don’t you do the Moretti thing?

Moretti thing?

You know, distant reading.

Distant reading? You mean count something? Count what?

How about paragraph length?

What’ll that get me?

I don’t know. Just do it. I mean, you already know that the nexus is the longest paragraph in the text. There must be something going on with that. Mess around and see if something turns up.


* * * * *
I did and it did.

I used the MSWord word-count tool to count the words in every paragraph in the text. All 198 of them. One at a time. Real tedious stuff. Then I loaded the results into a spreadsheet and created a bar chart showing paragraph length from longest to shortest:

HD whole ordered 2 Continue reading “Digital Humanities Sandbox Goes to the Congo”

Cognitivism and the Critic 2: Symbol Processing

It has long been obvious to me that the so-called cognitive revolution is what happened when computation – both the idea and the digital technology – hit the human sciences. But I’ve seen little reflection of that in the literary cognitivism of the last decade and a half. And that, I fear, is a mistake.

Thus, when I set out to write a long programmatic essay, Literary Morphology: Nine Propositions in a Naturalist Theory of Form, I argued that we think of literary text as a computational form. I submitted the essay and found that both reviewers were puzzled about what I meant by computation. While publication was not conditioned on providing such satisfaction, I did make some efforts to satisfy them, though I’d be surprised if they were completely satisfied by those efforts.

That was a few years ago.

Ever since then I pondered the issue: how do I talk about computation to a literary audience? You see, some of my graduate training was in computational linguistics, so I find it natural to think about language processing as entailing computation. As literature is constituted by language it too must involve computation. But without some background in computational linguistics or artificial intelligence, I’m not sure the notion is much more than a buzzword that’s been trendy for the last few decades – and that’s an awful long time for being trendy.

I’ve already written one post specifically on this issue: Cognitivism for the Critic, in Four & a Parable, where I write abstracts of four texts which, taken together, give a good feel for the computational side of cognitive science. Here’s another crack at it, from a different angle: symbol processing.

Operations on Symbols

I take it that ordinary arithmetic is most people’s ‘default’ case for what computation is. Not only have we all learned it, it’s fundamental to our knowledge, like reading and writing. Whatever we know, think, or intuit about computation is built on our practical knowledge of arithmetic.

As far as I can tell, we think of arithmetic as being about numbers. Numbers are different from words. And they’re different from literary texts. And not merely different. Some of us – many of whom study literature professionally – have learned that numbers and literature are deeply and utterly different to the point of being fundamentally in opposition to one another. From that point of view the notion that literary texts be understood computationally is little short of blasphemy.

Not so. Not quite.

The question of just what numbers are – metaphysically, ontologically – is well beyond the scope of this post. But what they are in arithmetic, that’s simple; they’re symbols. Words too are symbols; and literary texts are constituted of words. In this sense, perhaps superficial, but nonetheless real, the reading of literary texts and making arithmetic calculations are the same thing, operations on symbols. Continue reading “Cognitivism and the Critic 2: Symbol Processing”

Statistics and Symbols in Mimicking the Mind

MIT recently held a symposium on the current status of AI, which apparently has seen precious little progress in recent decades. The discussion, it seems, ground down to a squabble over the prevalence of statistical techniques in AI and a call for a revival of work on the sorts of rule-governed models of symbolic processing that once dominated much of AI and its sibling, computational linguistics.

Briefly, from the early days in the 1950s up through the 1970s both disciplines used models built on carefully hand-crafted symbolic knowledge. The computational linguists built parsers and sentence generators and the AI folks modeled specific domains of knowledge (e.g. diagnosis in elected medical domains, naval ships, toy blocks). Initially these efforts worked like gang-busters. Not that they did much by Star Trek standards, but they actually did something and they did things never before done with computers. That’s exciting, and fun.

In time, alas, the excitement wore off and there was no more fun. Just systems that got too big and failed too often and they still didn’t do a whole heck of a lot.

Then, starting, I believe, in the 1980s, statistical models were developed that, yes, worked like gang-busters. And these models actually did practical tasks, like speech recognition and then machine translation. That was a blow to the symbolic methodology because these programs were “dumb.” They had no knowledge crafted into them, no rules of grammar, no semantics. Just routines the learned while gobbling up terabytes of example data. Thus, as Google’s Peter Norvig points out, machine translation is now dominated by statistical methods. No grammars and parsers carefully hand-crafted by linguists. No linguists needed.

What a bummer. For machine translation is THE prototype problem for computational linguistics. It’s the problem that set the field in motion and has been a constant arena for research and practical development. That’s where much of the handcrafted art was first tried, tested, and, in a measure, proved. For it to now be dominated by statistics . . . bummer.

So that’s where we are. And that’s what the symposium was chewing over.

Continue reading “Statistics and Symbols in Mimicking the Mind”