34
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      A POS Tagger for Code Mixed Indian Social Media Text - ICON-2016 NLP Tools Contest Entry from Surukam

      Preprint
      ,

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Building Part-of-Speech (POS) taggers for code-mixed Indian languages is a particularly challenging problem in computational linguistics due to a dearth of accurately annotated training corpora. ICON, as part of its NLP tools contest has organized this challenge as a shared task for the second consecutive year to improve the state-of-the-art. This paper describes the POS tagger built at Surukam to predict the coarse-grained and fine-grained POS tags for three language pairs - Bengali-English, Telugu-English and Hindi-English, with the text spanning three popular social media platforms - Facebook, WhatsApp and Twitter. We employed Conditional Random Fields as the sequence tagging algorithm and used a library called sklearn-crfsuite - a thin wrapper around CRFsuite for training our model. Among the features we used include - character n-grams, language information and patterns for emoji, number, punctuation and web-address. Our submissions in the constrained environment,i.e., without making any use of monolingual POS taggers or the like, obtained an overall average F1-score of 76.45%, which is comparable to the 2015 winning score of 76.79%.

          Related collections

          Most cited references7

          • Record: found
          • Abstract: not found
          • Conference Proceedings: not found

          Feature-rich part-of-speech tagging with a cyclic dependency network

            Bookmark
            • Record: found
            • Abstract: not found
            • Conference Proceedings: not found

            Biomedical named entity recognition using conditional random fields and rich feature sets

              Bookmark
              • Record: found
              • Abstract: not found
              • Conference Proceedings: not found

              Overview for the First Shared Task on Language Identification in Code-Switched Data

                Bookmark

                Author and article information

                Journal
                2016-12-31
                Article
                1701.00066
                fdd58879-2723-436f-9f47-37004a409cd8

                http://arxiv.org/licenses/nonexclusive-distrib/1.0/

                History
                Custom metadata
                4 Pages, 13th International Conference on Natural Language Processing, Varanasi, India
                cs.CL

                Theoretical computer science
                Theoretical computer science

                Comments

                Comment on this article