Digitised historical text: Does it have to be mediOCRe?

Bea Alex, Claire Grover, Ewan Klein, Richard Tobin; Proceedings of KONVENS 2012 (LThist 2012 workshop), pp. 401-409, September 2012.


This paper reports on experiments to improve the Optical Character Recognition (ocr) quality of historical text as a preliminary step in text mining. We analyse the quality of ocred text compared to a gold standard and show how it can be improved by performing two automatic correction steps. We also demonstrate the impact this can have on named entity recognition in a preliminary extrinsic evaluation. This work was performed as part of the Trading Consequences project which is focussed on text mining of historical documents for the study of nineteenth century trade in the British Empire.

