This paper explains the challenges pertaining to Urdu stemming and presents a rule-based prototype with a few rules implemented for Urdu to motivate the intricacies. It shows that Urdu stemming is quite challenging because of Urdu’s diverse nature and because Arabic and Farsi stemmers cannot be used for Urdu. Dictionary-based errorcorrecting schemes used by other stemmers cannot be applied to Urdu because of the lack of machine-readable resources. There has not been any work published regarding Urdu stemming or morphological analysis in the IR community even though interest in Urdu is growing. The goal of this paper is to show the challenges in writing an Urdu stemmer, not to present a stemmer.
Content
Author and article information
Contributors
Kashif Riaz
Conference
Publication date:
August
2007
Publication date
(Print):
August
2007
Pages: 1-6
Affiliations
[0001]University of Minnesota
4-192 EE/CS Building
200 Union Street SE
Minneapolis, MN 55455