ScienceOpen: research and publishing network

For Researchers

Search
Advanced search

34

views

    

0

recommends

0

shares

Record: found
Abstract: found
Article: found

Is Open Access

A POS Tagger for Code Mixed Indian Social Media Text - ICON-2016 NLP Tools Contest Entry from Surukam

Preprint

Author(s): Sree Harsha Ramesh , Raveena R Kumar

Publication date Created: 2016-12-31

Read this article at

ScienceOpen ArXiv

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Building Part-of-Speech (POS) taggers for code-mixed Indian languages is a particularly challenging problem in computational linguistics due to a dearth of accurately annotated training corpora. ICON, as part of its NLP tools contest has organized this challenge as a shared task for the second consecutive year to improve the state-of-the-art. This paper describes the POS tagger built at Surukam to predict the coarse-grained and fine-grained POS tags for three language pairs - Bengali-English, Telugu-English and Hindi-English, with the text spanning three popular social media platforms - Facebook, WhatsApp and Twitter. We employed Conditional Random Fields as the sequence tagging algorithm and used a library called sklearn-crfsuite - a thin wrapper around CRFsuite for training our model. Among the features we used include - character n-grams, language information and patterns for emoji, number, punctuation and web-address. Our submissions in the constrained environment,i.e., without making any use of monolingual POS taggers or the like, obtained an overall average F1-score of 76.45%, which is comparable to the 2015 winning score of 76.79%.

Related collections

Most cited references 7

Record: found
Abstract: not found
Conference Proceedings: not found

Feature-rich part-of-speech tagging with a cyclic dependency network

Kristina Toutanova, Dan Klein, Christopher Manning … (2003)

0 comments Cited 173 times – based on 0 reviews

Record: found
Abstract: not found
Conference Proceedings: not found

Biomedical named entity recognition using conditional random fields and rich feature sets

Burr Settles (2004)

0 comments Cited 36 times – based on 0 reviews

Record: found
Abstract: not found
Conference Proceedings: not found

Overview for the First Shared Task on Language Identification in Code-Switched Data

Elizabeth Blair, Suraj Maharjan, Steven Bethard … (2014)

0 comments Cited 24 times – based on 0 reviews

Author and article information

Journal

Publication date Created: 2016-12-31

Article

ArXiV ID: 1701.00066

SO-VID: fdd58879-2723-436f-9f47-37004a409cd8

License:

http://arxiv.org/licenses/nonexclusive-distrib/1.0/

History

Custom metadata

Comments 4 Pages, 13th International Conference on Natural Language Processing, Varanasi, India

Categories cs.CL

ScienceOpen disciplines: Theoretical computer science

Data availability:

ScienceOpen disciplines: Theoretical computer science

Comments

Comment on this article