Johann-Mattis List (MPI for the Science of Human History, Germany)

Although linguistics has always been a highly data-driven discipline, scholars have been ignoring the importance of standards which hold across subfields and language families. This becomes the more evident, the more the digital data on the languages in the world are growing. As of now, the majority of data produced by linguists is only accessible to experts who understand the idiosyncrasies of annotatation practice in specific subfields or specific language families. This hampers both qualitative and quantitative investigations on linguistic diversity. The Cross-Linguistic Data Formats initiative of the Max Planck Institute for the Science of Human History tries to tackle these problems by establishing standards for the representation and analysis of linguistic data. In contrast to efforts of data collection in the field of natural language processing which are heavily biased towards the major languages (English, Chinese, Arabic, etc.) our goal is to provide strictly cross-linguistic standards, amenable to all forms of linguistic diversity and not excluding any of the 7000 languages spoken today and in the past. In the talk, we will present the major strategies we use to develop standards and best practices, as well as the major challenges and pitfalls we have to cope with.