Detecting Potential Topics In News Using BERT, CRF and Wikipedia

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

For a news content distribution platform like Dailyhunt, Named Entity Recognition is a pivotal task for building better user recommendation and notification algorithms. Apart from identifying names, locations, organisations from the news for 13+ Indian languages and use them in algorithms, we also need to identify n-grams which do not necessarily fit in the definition of Named-Entity, yet they are important. For example, "me too movement", "beef ban", "alwar mob lynching". In this exercise, given an English language text, we are trying to detect case-less n-grams which convey important information and can be used as topics and/or hashtags for a news. Model is built using Wikipedia titles data, private English news corpus and BERT-Multilingual pre-trained model, Bi-GRU and CRF architecture. It shows promising results when compared with industry best Flair, Spacy and Stanford-caseless-NER in terms of F1 and especially Recall.

Related collections

Author and article information

Journal

Publication date Created: 26 February 2020

Article

ArXiV ID: 2002.11402

SO-VID: b65baad6-9467-4ab6-96a7-ee18f87a6ad4

License:

http://arxiv.org/licenses/nonexclusive-distrib/1.0/

History

Custom metadata

Comments 6 pages, 6 tables, 1 figure, 2 examples. This is a report based on applied research work conducted at Dailyhunt

Categories cs.CL

ScienceOpen disciplines: Theoretical computer science

Data availability:

ScienceOpen disciplines: Theoretical computer science

Detecting Potential Topics In News Using BERT, CRF and Wikipedia

Read this article at

Abstract

Related collections

Blockchain in Healthcare Today

Author and article information

Journal

Article

History

Custom metadata

Comments

Comment on this article

Similar content 489