Manual and semi-automatic normalization of historical spelling - case studies from Early New High German

Marcel Bollmann, Stefanie Dipper, Julia Krasselt, Florian Petran; Proceedings of KONVENS 2012 (LThist 2012 workshop), pp. 342-350, September 2012.


This paper presents work on manual and semi-automatic normalization of historical language data. We first address the guidelines that we use for mapping historical to modern word forms. The guidelines distinguish between normalization (preferring forms close to the original) and modernization (preferring forms close to modern language). Average inter-annotator agreement is 88.38% on a set of data from Early New High German. We then present Norma, a semi-automatic normalization tool. It integrates different modules (lexicon lookup, rewrite rules) for normalizing words in an interactive way. The tool dynamically updates the set of rule entries, given new input. Depending on the text and training settings, normalizing 1,000 tokens results in overall accuracies of 61.78–79.65% (baseline: 24.76–59.53%).

[pdf] [bibtex]