Warning: Massive wall of text ahead.
Whew! I’ve now read that piece well enough to begin to have a handle on it. (With the caveat that I’ll never have a handle on the technical aspects of the speech analysis.)
I happen to know a little about speech-processing research, and these authors have taken a novel approach to the subject. If I may summarize: They note the existence of a number of statistical ‘laws’ concerning written speech, three of which they explored in this paper:
–Zipf’s law, which concerns word frequency. This law states that, for a piece of written prose of substantial size, the most common word occurs twice as often as the second most common word, three times as often as the third most common word, etc.
–Heaps’ law, which states that 1) the number of unique words in an essay is a power function of the total word-count of the essay, and 2) the exponent of that power function is a fraction. Thus, the function is negatively accelerated, monotonic, and asymptotic (ie, it rises rapidly at first, then levels off until it is almost flat). This is a fancy way to say that, relatively speaking, the longer an essay is, the fewer new words it has. (As an aside, Zipf’s law is a power function as well.)
–The brevity law, which states that the length of a word is inversely correlated with its frequency-of-use (ie, words we use a lot tend to be shorter than words we don’t use as often).
All three of these are what are called scaling laws, meaning they describe how one aspect of a phenomenon changes (scales) along with another aspect of the phenomenon. Scaling laws are ubiquitous in nature. What is of particular interest is when a scaling law for one phenomenon is found to apply to an entirely different phenomenon, as this implies the existence of some sort of underlying ‘universal law.’ For example, Zipf’s law has been found to apply to many other rankings unrelated to language, such as the population ranks of cities in various countries, corporation sizes, income rankings, and ranks of the number of people watching the same TV channel. (Full disclosure: I cribbed these examples straight from the Wiki article on Zipf’s law.)
Getting back to language-based communication, here’s the rub: All of these laws were identified in, and thus known to apply to, written communication. What was not known is whether they also apply to oral communication (ie, speech). Now, it might seem trivially simple to test whether they apply to speech production—just get a written transcript of speech and analyze it. Easy-peasy, right?
Unfortunately, it’s not that simple. The problem is one of segmentation—identifying where one word stops, and another word starts. In written communication, this is trivial—we put a space between the last letter of one word and the first letter of the next. But speech is not composed of letters; rather, it is composed of phonemes. Phonemes are the ‘building block sounds’ of speech. Every spoken language has its own set of phonemes (some languages have certain phonemes in common, of course) that it uses to construct every spoken word in its lexicon. For example, /b/ and /p/ are phonemes in English; they are the reason the words /b/at and /p/at are different words to our ears. That may seem obvious—how could anyone not hear ‘bat’ and ‘pat’ as different words?—but this obviousness reflects our native-language bias. There are other languages that do not employ these sounds as separate phonemes, and thus speakers of those languages have a great deal of difficulty hearing bat and pat as different words. This becomes readily apparent when you consider certain stereotypic word confusions; eg, a native French speaker who pronounces Walk the dog as Walk zee dog (phonemically, French doesn’t separate /th/ and /z/ the way we do), or a native Japanese speaker who pronounces Fried rice as Flied lice. As native English speakers, we find some of the glottic phonemes in Arabic perplexing and hard to discern, and are even more baffled by the ‘clicking’ phonemes employed in certain African languages.
OK, so what? What does this have to do with analyzing speech for evidence of Zipf’s law, Heaps’ law, etc? As I said, the issue concerns segmentation. While a native speaker segments speech effortlessly, it turns out that, when analyzed ‘objectively’–ie, by looking at the speech signal the way it shows up on an oscilloscope, in terms of its acoustic energy, amplitude, frequency, etc–the physical properties of speech do not segment in a manner consistent with how we hear it. That is, when listening to speech, we effortlessly hear phonemes sequentially constructing words, and hear empty spaces between words. But the acoustic signal of speech looks very little like this. For example, certain phonemes are produced virtually simultaneously (this is called coarticulation). Further, the articulation of a given phoneme (ie, with respect to its acoustic-energy properties) is heavily influenced by the nature of the phonemes that precede and follow it. Finally, there is often no demarcation–no ‘dead space’–between words in the acoustic signal.
This fact—that the acoustic signal of speech does not segment in a manner consistent with its semantic content—is the crux of the dilemma facing the authors of the present study. They want to ascertain whether the acoustics of speech follow the same scaling laws as written prose. But note that these statistical scaling laws are necessarily dependent upon how the signal is segmented–upon precisely what it is that one counts when adding up event frequencies. As mentioned, segmenting written prose is trivially easy. But for purposes of statistical analysis, how should speech be segmented? One obvious answer is to simply have a native speaker of the language listen to the speech and demarcate the acoustic signal into respective words. But note that this is simply reducing speech to written prose, and thus would not tell you anything you didn’t already know (as written prose has already been well-studied in this regard). Further, such listener-based segmentation is inescapably influenced by cognitive biases that result in hearing distinctions in the acoustic signal that, objectively speaking, simply aren’t there.
To get around this dilemma, the authors elected to approach speech analysis from a purely acoustic perspective. That is, they treated the acoustic signal of speech as simply a series of energy bursts, and analyzed those bursts without regard to how they (the bursts) were related to semantically-meaningful properties of the signal. In other words, they made no attempt to divide the signal into words, or even into phonemes; rather, they simply looked at the signal as a bunch of squiggly lines corresponding to bursts of energy of varying amounts, and they subsequently subdivided those bursts into discrete, arbitrary units, each of a different size (with respect to the amount of energy contained therein). These energy-units were then analyzed with respect to their frequencies of occurrence to see if Zipf’s, Heaps’ and the brevity law would apply.
And sure enough, they did. That is, the authors found that when speech was divided into arbitrary (but objectively real) units of energy, the rate of production of those units corresponded to all three of the scaling laws in question. The implications of this finding are well-presented in the Discussion section of the paper.
OK, I’m exhausted. If this megapost generates any interest, I’d be happy to share other thoughts I have. Either way, my thanks to @Aragorn for sharing such an interesting study with us.