Using information retrieval technology for a corpus analysis platform

Carsten Schnober; Proceedings of KONVENS 2012 (Main track: poster presentations), pp. 199-207, September 2012.


This paper describes a practical approach to use the information retrieval engine Lucene for the corpus analysis platform KorAP, currently being developed at the Institut für Deutsche Sprache (IDS Mannheim). It presents a method to use Lucene’s indexing technique and to exploit it for linguistically annotated data, allowing full flexibility to handle multiple annotation layers. It uses multiple indexes and MapReduce techniques in order to keep KorAP scalable.

