Sarcasm detection has emerged due to its applicability in natural language processing
(NLP) but lacks substantial exploration in low-resource languages like Urdu, Arabic,
Pashto, and Roman-Urdu. While fewer studies identifying sarcasm have focused on low-resource
languages, most of the work is in English. This research addresses the gap by exploring
the efficacy of diverse machine learning (ML) algorithms in identifying sarcasm in
Urdu. The scarcity of annotated datasets for low-resource language becomes a challenge.
To overcome the challenge, we curated and released a comparatively large dataset named
Urdu Sarcastic Tweets (UST) Dataset, comprising user-generated comments from
(former Twitter). Automatic sarcasm detection in text involves using computational
methods to determine if a given statement is intended to be sarcastic. However, this
task is challenging due to the influence of the user’s behavior and attitude and their
expression of emotions. To address this challenge, we employ various baseline ML classifiers
to evaluate their effectiveness in detecting sarcasm in low-resource languages. The
primary models evaluated in this study are support vector machine (SVM), decision
tree (DT), K-Nearest Neighbor Classifier (K-NN), linear regression (LR), random forest
(RF), Naïve Bayes (NB), and XGBoost. Our study’s assessment involved validating the
performance of these ML classifiers on two distinct datasets—the Tanz-Indicator and
the UST dataset. The SVM classifier consistently outperformed other ML models with
an accuracy of 0.85 across various experimental setups. This research underscores
the importance of tailored sarcasm detection approaches to accommodate specific linguistic
characteristics in low-resource languages, paving the way for future investigations.
By providing open access to the UST dataset, we encourage its use as a benchmark for
sarcasm detection research in similar linguistic contexts.