Comparing variety corpora with vis-à-vis - A prototype system presentation

Stefanie Anstein; Proceedings of KONVENS 2012 (Main track: poster presentations), pp. 243-247, September 2012.


In this paper, the prototype system Vis-À-Vis to support linguists in their comparison of regional language varieties is presented. Written corpora are used as an empirical basis to extract differences semi-automatically. For the analysis, existing and adapted as well as new tools with both pattern-based and statistical approaches are applied. The processing of the corpus input consists in the annotation of the data, the extraction of phenomena from different levels of linguistic description, and their quantitative comparison for the identification of significantly different phenomena in the two input corpora. Vis-À-Vis produces sorted candidate lists for peculiarities of varieties by filtering according to statistical association measures as well as using corpus-external knowledge to reduce the output to presumably significant phenomena. Traditional regional variety linguists benefit from these results using them as a compact empirical basis -- extracted from large amounts of authentic data -- for their detailed qualitative analyses. Via a user-friendly application of a comprehensive computational system, they are supported in efficiently extracting differences between varieties e.g. for documentation, lexicography, or didactics of pluri-centric languages.

