A supervised POS tagger for written Arabic social networking corpora

Rania Al-Sabbagh, Roxana Girju; Proceedings of KONVENS 2012 (Main track: oral presentations), pp. 39-52, September 2012.


This paper presents an implementation of Brill's Transformation-Based Part-of-Speech (POS) tagging algorithm trained on a manually-annotated Twitter-based Egyptian Arabic corpus of 423,691 tokens and 70,163 types. Unlike standard POS morpho-syntactic annotation schemes which label each word based on its word-level morpho-syntactic features, we use a function-based annotation scheme in which words are labeled based on their grammatical functions rather than their morpho-syntactic structures given that these two do not necessarily map. While a standard morpho-syntactic scheme makes comparisons with other work easier, the function-based scheme is assumed to be more efficient for building higher-up tools such as base-phrase chunkers, dependency parsers and for NLP applications like subjectivity and sentiment analysis. The function-based scheme also gives new insights about linguistic structural realizations specific to Egyptian Arabic which is currently an under-resourced language.

[pdf] [bibtex]