16
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Detecting Potential Topics In News Using BERT, CRF and Wikipedia

      Preprint

      Read this article at

      Bookmark
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          For a news content distribution platform like Dailyhunt, Named Entity Recognition is a pivotal task for building better user recommendation and notification algorithms. Apart from identifying names, locations, organisations from the news for 13+ Indian languages and use them in algorithms, we also need to identify n-grams which do not necessarily fit in the definition of Named-Entity, yet they are important. For example, "me too movement", "beef ban", "alwar mob lynching". In this exercise, given an English language text, we are trying to detect case-less n-grams which convey important information and can be used as topics and/or hashtags for a news. Model is built using Wikipedia titles data, private English news corpus and BERT-Multilingual pre-trained model, Bi-GRU and CRF architecture. It shows promising results when compared with industry best Flair, Spacy and Stanford-caseless-NER in terms of F1 and especially Recall.

          Related collections

          Author and article information

          Journal
          26 February 2020
          Article
          2002.11402
          b65baad6-9467-4ab6-96a7-ee18f87a6ad4

          http://arxiv.org/licenses/nonexclusive-distrib/1.0/

          History
          Custom metadata
          6 pages, 6 tables, 1 figure, 2 examples. This is a report based on applied research work conducted at Dailyhunt
          cs.CL

          Theoretical computer science
          Theoretical computer science

          Comments

          Comment on this article