The DTA 'base format': A TEI-subset for the compilation of interoperable corpora

Alexander Geyken, Susanne Haaf, Frank Wiegand; Proceedings of KONVENS 2012 (LThist 2012 workshop), pp. 383-391, September 2012.


This article describes a strict subset of TEI P5, the DTA ‘base format’, which combines the richness of encoding noncontroversial structural aspects of texts while allowing only minimal semantic interpretation. The proposed format is discussed with regard to other commonly used XML/TEI schemas. Furthermore, the article presents examples of good practices showing how external corpora can either be converted into the DTA ‘base format’ directly or after cautiously extending it. Thus, the proposed encoding schema contributes to the paradigm shift recently observed in corpus compilation, namely from private encoding to interoperable encoding.

