There are currently about 6000 languages on our planet, some spoken by millions, some by only a few dozen people. The primary goal of the international program known as EHL (Evolution of Human Language) is to work out a detailed historical classification of these languages, organizing them into a genealogical tree similar to the accepted classification of biological species. Since all representatives of the species Homo sapiens presumably share a common origin, it would be natural to suppose - although extremely hard to prove - that all or most known human languages also go back to some common source. The only way to proceed here is "bottoms up": classifying attested languages and dialects into groups, groups into families, families into "macro-" or "superfamilies" and so on, as far as one can penetrate using comparative-historical and cladistic methodology. Most existing classifications, however, do not look behind some 300-400 language families that are relatively easy to discern. This restriction has natural reasons: languages must have been spoken and constantly evolving for at least 40,000 years (and quite probably more), while any two languages separated from a common source usually lose almost all superficially common features after some 6,000-7,000 years.

Nevertheless, despite widespread skepticism and reluctance to tackle the problem, there are a number of scholars who believe that these obstacles are not insurmountable. Research has been going on over the past several decades that appears to indicate that larger genetic groupings are not only possible, but indeed quite plausible. It can be shown that most of the world's language families can be classified into roughly a dozen large groupings, or macrofamilies. Two sorts of evidence can be used for this purpose:

1) The science of historical linguistics has developed a very powerful tool - the comparative method - that allows the reconstruction of unattested language stages, so-called proto-languages, based on systematic comparison of their present day descendants. With the gradual accumulation of this data over the past 200 years, it has become evident that, while modern languages may vary significantly, protolanguages in many cases tend to be much more similar to one other. Thus, modern English, Finnish, and Turkish may have very little in common (and what little there is is practically indistinguishable from chance), but their respective ancestors - Proto-Indo-European, Proto-Uralic and Proto-Altaic - appear to have many more common traits and common vocabulary. This means that it is possible, in theory and on practice, to extend the time perspective and reconstruct even earlier stages of human language. In fact, much of this research has already been conducted.

2) Where a detailed reconstruction of the proto-language is impossible to achieve (e. g. because of insufficient data) or requires more time and effort than can be spared, it is still possible to build somewhat weaker models of language evolution based on a combination of manual and automatic analysis of limited corpora of data. Of all types of linguistic data that can be used for historical purposes, it is the so-called "basic lexicon" that generally persists the longest over time. Focusing our attention on the comparison of small groups of words, such as the Swadesh wordlist, and tracing their evolution on micro- and micro-levels, reduces the amount of "noise" (such as borrowings, from which no language is free) and helps strengthen the case for many proposals of long-range relationship.

Based on these theoretical considerations, the particular work that goes on within EHL is being carried out in three main directions:

I. Reconstruction of proto-languages and compilation of computerized etymological dictionaries (databases) in accordance with the traditional comparative method. A large set of such databases has already been open to public access for a long time and is gradually being enlarged as more data become available and more analytical work is performed on various language families. The set currently includes data on comparative Indo-European, Uralic, Altaic, Dravidian, North Caucasian, Yeniseian, Sino-Tibetan, Indo-European, Austroasiatic, Chukchee-Kamchatkan, Eskimo, Semitic, and several families collectively known as Khoisan languages. Many more databases, in particular those on specific language families of Africa and America, are in the stage of preparation.
EHL also occupies a generally benevolent position towards attempts to prepare etymological databases for those deep-level macrofamilies whose daughter proto-languages have been already reconstructed to general satisfaction. The current set of databases already includes such databases for three major macrofamilies of the Old World: Eurasiatic (Nostratic), Sino-Caucasian, and Afroasiatic. Exploration of macrofamily connections for Africa, America, and the Pacific region is also well on the way.

II. Lexicostatistical testing of both traditional and new theories of language relationship. To emphasize the tremendous importance of lexicostatistics in determining the proper historical relations between languages, EHL has currently launched, as one of its subprojects, the construction of the Global Lexicostatistical Database (GLD) that will contain properly assembled and annotated Swadesh wordlists for the majority of the world's languages, as well as for reconstructed proto-languages on all levels, based on rigorous methodological procedures.

III. Procedures for automatic data handling. An important issue in historical linguistics is the amount of subjectivity on the part of the researcher when hypotheses on unattested ancestral stages of languages are concerned. According to the collective opinion of historical linguists working within EHL, none of the existing models and algorithms that have been proposed for language classification purposes have managed to take into account all of the necessary factors responsible for historical evolution, making "manual" handling of the data irreplaceable. Nevertheless, EHL stll sees the elaboration of such models as an integral part of the project. Improved, more elaborate algorithms of automatic classification and even reconstruction are being worked on within the EHL team; EHL participants also exchange data and experience with several other working groups conducting research in the same direction.

Besides its theoretical goals, one of the major purposes of EHL is to provide specialists and enthusiasts around the world with as much information on the history of language(s) as possible. To that purpose, all of the databases, as soon as they reach "usable" shape, are made public. EHL provides wordlists and etymologies for many languages and language families that are poorly known and data on which is almost impossible to find in any kind of open access system. EHL participants have also scanned, recognized, and converted to database format some of the major existing etymological dictionaries, such as Pokorny's Indo-European etymological dictionary.

The Evolution of Human Language project was originally founded in 2001, due to the joint efforts of Murray Gell-Mann, Sergei Starostin (1953-2005), and Merritt Ruhlen, a generous grant from the John D. & Catherine T. MacArthur Foundation, and plenty of support from the Santa Fe Institute. Back then, the experience of the EHL team did not extend significantly beyond professional work on several large families of the Old World and their prehistorical connections. Today, the EHL team is integrating data from all of the world's major and minor language stocks in order to push our knowledge of linguistic prehistory as far back as possible. Once the assembled data have been properly organized and their analysis, combining sound traditional methodology with modern cladistic methods, completed, EHL's classification aspires to become a solid reference model for linguists, historians, anthropologists, geneticists and everyone even remotely interested in human prehistory.