There are currently about 6000 languages on our planet,
some spoken by millions, some by only a few dozen people. The primary
goal of the international program known as EHL (Evolution of Human Language)
is to work out a detailed historical classification of these languages, organizing them
into a genealogical tree similar to the accepted classification of biological species.
Since all representatives of the species Homo sapiens presumably share a common origin,
it would be natural to suppose - although extremely hard to prove - that all or most known human
languages also go back to some common source. The only way to proceed here is "bottoms up":
classifying attested languages and dialects into groups, groups into families, families
into "macro-" or "superfamilies" and so on, as far as one can penetrate using comparative-historical and
cladistic methodology. Most existing classifications, however, do not
look behind some 300-400 language families that are relatively easy to discern. This
restriction has natural reasons: languages must have been spoken and constantly
evolving for at least 40,000 years (and quite probably more), while any two
languages separated from a common source usually lose almost all superficially
common features after some 6,000-7,000 years.
Nevertheless, despite widespread skepticism and reluctance to tackle the
problem, there are a number of scholars who believe that these obstacles are
not insurmountable. Research has been going on over the past several decades
that appears to indicate that larger genetic groupings are not only possible,
but indeed quite plausible. It can be shown that most of the world's language
families can be classified into roughly a dozen large groupings, or macrofamilies.
Two sorts of evidence can be used for this purpose:
1) The science of historical linguistics has developed a very powerful tool -
the comparative method - that allows the reconstruction of unattested language
stages, so-called proto-languages, based on systematic comparison of their present
day descendants. With the gradual accumulation of this data over the past 200 years,
it has become evident that, while modern languages may vary significantly, protolanguages
in many cases tend to be much more similar to one other. Thus, modern English, Finnish,
and Turkish may have very little in common (and what little there is is practically
indistinguishable from chance), but their respective ancestors - Proto-Indo-European,
Proto-Uralic and Proto-Altaic - appear to have many more common traits and common vocabulary.
This means that it is possible, in theory and on practice, to extend the time perspective and
reconstruct even earlier stages of human language. In fact, much of this research
has already been conducted.
2) Where a detailed reconstruction of the proto-language is impossible to achieve (e. g.
because of insufficient data) or requires more time and effort than can be spared,
it is still possible to build somewhat weaker models of language evolution based on
a combination of manual and automatic analysis of limited corpora of data. Of all types
of linguistic data that can be used for historical purposes, it is the so-called "basic lexicon"
that generally persists the longest over time. Focusing our attention on the comparison
of small groups of words, such as the Swadesh wordlist, and tracing their evolution on micro- and
micro-levels, reduces the amount of "noise" (such as borrowings, from which no language is free) and
helps strengthen the case for many proposals of long-range relationship.
Based on these theoretical considerations, the particular work that goes on within EHL is being
carried out in three main directions:
I. Reconstruction of proto-languages and compilation of computerized etymological dictionaries
(databases) in accordance with the traditional comparative method. A large set of such databases has
already been open to public access for a long time and is gradually being enlarged as more data become
available and more analytical work is performed on various language families. The set
currently includes data on comparative Indo-European, Uralic, Altaic, Dravidian,
North Caucasian, Yeniseian, Sino-Tibetan, Indo-European, Austroasiatic, Chukchee-Kamchatkan,
Eskimo, Semitic, and several families collectively known as Khoisan languages. Many more databases,
in particular those on specific language families of Africa and America, are in the stage of preparation.
EHL also occupies a generally benevolent position towards attempts to prepare etymological
databases for those deep-level macrofamilies whose daughter proto-languages have been already
reconstructed to general satisfaction. The current set of databases already includes such databases
for three major macrofamilies of the Old World: Eurasiatic (Nostratic), Sino-Caucasian, and Afroasiatic.
Exploration of macrofamily connections for Africa, America, and the Pacific region is also well on the way.
II. Lexicostatistical testing of both traditional and new theories of language relationship. To emphasize the
tremendous importance of lexicostatistics in determining the proper historical relations between languages, EHL has
currently launched, as one of its subprojects, the construction of the Global Lexicostatistical Database (GLD) that
will contain properly assembled and annotated Swadesh wordlists for the majority of the world's languages, as well
as for reconstructed proto-languages on all levels, based on rigorous methodological procedures.
III. Procedures for automatic data handling. An important issue in historical linguistics is the amount of subjectivity on the part of the researcher when
hypotheses on unattested ancestral stages of languages are concerned. According to the collective opinion of
historical linguists working within EHL, none of the existing models and algorithms that have been proposed for
language classification purposes have managed to take into account all of the necessary factors responsible for
historical evolution, making "manual" handling of the data irreplaceable. Nevertheless, EHL stll sees the
elaboration of such models as an integral part of the project. Improved, more elaborate algorithms of automatic
classification and even reconstruction are being worked on within the EHL team; EHL participants also exchange
data and experience with several other working groups conducting research in the same direction.
Besides its theoretical goals, one of the major purposes of EHL is to provide specialists and enthusiasts around the
world with as much information on the history of language(s) as possible. To that purpose, all of the databases, as
soon as they reach "usable" shape, are made public. EHL provides wordlists and etymologies for many languages and
language families that are poorly known and data on which is almost impossible to find in any kind of open access
system. EHL participants have also scanned, recognized, and converted to database format some of the major existing
etymological dictionaries, such as Pokorny's Indo-European etymological dictionary.
The Evolution of Human Language project was originally founded in 2001, due to the joint efforts of Murray Gell-Mann,
Sergei Starostin (1953-2005), and Merritt Ruhlen, a generous grant from the John D. & Catherine T. MacArthur
Foundation, and plenty of support from the Santa Fe Institute. Back then, the experience of the EHL team did not
extend significantly beyond professional work on several large families of the Old World and their prehistorical
connections. Today, the EHL team is integrating data from all of the world's major and minor language stocks in
order to push our knowledge of linguistic prehistory as far back as possible. Once the assembled data have been
properly organized and their analysis, combining sound traditional methodology with modern cladistic methods,
completed, EHL's classification aspires to become a solid reference model for linguists, historians, anthropologists,
geneticists and everyone even remotely interested in human prehistory.