Moscow Lexical Database (MosLex) is a collection of basic vocabularies of languages all around the world. MosLex comprises annotated basic wordlists of living, recently extinct, and ancient languages, as well as those of reconstructed proto-languages. In fact, MosLex follows the philosophy of George Starostin’s Global Lexicostatistical Database project and can be viewed as its offspring. A wordlist appears in MosLex if it meets the following key conditions: it is of high lexicographic quality, it is detailed in linguistic and philological elaboration, and complies with our semantic standards and our overall methodology. The semantic specifications for synchronic wordlists – those compiled for a recorded language at a given stage of its development (e.g. Old English, Middle English, modern English) – follow ones provided in Kassian et al. 2010. ‘The Swadesh wordlist. An attempt at semantic specification’ (Journal of Language Relationship 4), with minor adjustments developed by our team during the last decade. Wordlists for proto-languages are (re)constructed from the data of synchronic wordlists by using the method proposed in Kassian, Zhivlov & Starostin. 2015. ‘Proto-Indo-European-Uralic comparison from the probabilistic point of view’ (Journal of Indo-European Studies 43/3-4). A typical list in the MosLex database is a 110-item Swadesh wordlist – one that contains 100 classical Swadesh concepts plus 10 additional concepts from the second part of the 200-item Swadesh wordlist as proposed by the late Sergei Starostin (see Kassian et al. 2010 for details). Lists with different concept make-up are also allowed.

We use three transcription systems to represent the words in our database:

  1. The first and basic one is the transcription adopted by the Global Lexicostatistical Database. Its features were designed for easy use by historical linguists: e.g., affricates are treated as a single unit, not as a “stop + fricative” cluster.
  2. Secondly, each entry is always provided with a transcription in the International Phonetic Alphabet (IPA).
  3. Optionally, a form may be provided with its orthographical representation(s) in its respective vernacular. These are given in {braces}, because, unfortunately, the majority of modern fonts do not support the angular brackets (glyphs “⟨”, U+27E8; and “⟩”, U+27E9).
For example: the Udi word for ‘to stand’ is represented in the database as a triple: čur-p-sun / {чурпсун} / t̠͡ʃur-p-sun.

We also adopt the following notation principles of the Global Lexicostatistical Database: 1) we use the hyphen “-” to separate a meaningful morpheme (usually the root) from the morphological elements which follow it; and 2) we use the equal sign “=” to separate a meaningful morpheme from the morphological elements which precede it. E.g., the Rutul form for ‘to come’ is quoted as y=iqʼ-ɨ-r, which means that iqʼ is the root (this morpheme is the subject of the further comparisons), y- is a prefix, while -ɨ-r is a sequence of suffixes.