Evolution of Human Languages

An international project on the linguistic prehistory of humanity
coordinated by the Santa Fe Institute
Areas of Research
Languages of the World: Etymological Databases
Interactive Maps
Research Focus Area

Evolution of Human Languages

Global Lexicostatistical Database

George Starostin, Russian State University for the Humanities (with the assistance of Alexei Kassian, Ilya Peiros and others)

Lexicostatistics (the methodology of classifying related languages based on the relative amounts of common-origin words in their basic lexicon) and glottochronology (dating the divergence of languages based on the same amounts, under the assumption of a regular rate of change) have been invented by American linguist Morris Swadesh in the 1950s and, since then, have both experienced their ups and downs in the scientific community. Today, they are frequently branded as "controversial", especially in general theoretical works on comparative-historical linguistics. Nevertheless, many specialists in particular language families have continued to rely on them at least partially. And with the recent surge of interest in applying the standards of cladistic methodology to the field of language classification, lexicostatistics has experienced something of a cautious revival in many recent publications.

With EHL's interest in elaborating a formalized universal standard for testing various hypotheses of language relationship and working out a general linguistic taxonomy, it is becoming more and more obvious that lexicostatistics is the single most powerful tool for these purposes. Although lexicostatistical classifications do not intend or presume to replace the traditional method of classifying languages based on shared innovations (in phonology and grammar as well as lexicon), their evidence can be quantified on an objective basis (unlike shared-innovation-based classifications that almost always reflect the subjective intuition of the researcher), and situations in which lexicostatistical classifications contradict the "mainstream" model - which, by the way, arise far less frequently than is sometimes thought - make up for valuable case studies that help to refine, rather than disavow, both approaches.

The Global Lexicostatistical Database (GLD) is a subproject recently undertaken by EHL that promises, in a matter of several years, the compilation, annotation, and online publication of the largest amount of the so-called "Swadesh wordlists" ever put together by one team of linguists. The foundation of the GLD consists of hundreds of "raw" wordlists, compiled from a variety of sources by EHL members over a period of more than ten years. However, all of them are now undergoing serious revision, with a set of new, more rigorous standards of word selection, transcription, and etymologization. The database, much like EHL's etymological databases, will consist of a series of cross-linked "mini"-databases that will not only include wordlists from modern or historically attested languages, but also wordlists for reconstructed proto-languages, with each choice for a "proto-word" annotated and explicated.

Since the use of lexicostatistical data for language classification depends crucially on the correctness of the researcher's etymological decisions, the GLD will be supplemented with possibilities of both manual and automatic analysis of the material, with the resulting language tree produced online at the user's particular request. The different results can then be compared and the discrepancies can be used to correct the procedures.

Even in unfinished form, the GLD promises to be extremely useful for everyone interested in language classification, as well as phonetic and semantic typology and the basic issues of diversification of languages.