PROJECTING NAMED ENTITY TAGS FROM A RESOURCE RICH LANGUAGE TO A RESOURCE POOR LANGUAGE

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

Named Entities (NE) are the prominent entities appearing in textual documents. Automatic classification of NE in a textual corpus is a vital process in Information Extraction and Information Retrieval research. Named Entity Recognition (NER) is the identification of words in text that correspond to a pre-defined taxonomy such as person, organization, location, date, time, etc. This article focuses on the person (PER), organization (ORG) and location (LOC) entities for a Malay journalistic corpus of terrorism. A projection algorithm, using the Dice Coefficient function and bigram scoring method with domain-specific rules, is suggested to map the NE information from the English corpus to the Malay corpus of terrorism. The English corpus is the translated version of the Malay corpus. Hence, these two corpora are treated as parallel corpora. The method computes the string similarity between the English words and the list of available lexemes in a pre-built lexicon that approximates the best NE mapping. The algorithm has been effectively evaluated using our own terrorism tagged corpus; it achieved satisfactory results in terms of precision, recall, and F-measure. An evaluation of the selected open source NER tool for English is also presented.

Related collections

Author and article information

Contributors

Norshuhani Zamin: Malaysia

Alan Oxley: Malaysia

Zainab Abu Bakar: Malaysia

Journal

Title: Journal of Information and Communication Technology

Publisher: UUM Press

Publication date (Electronic): April 23 2013

Volume: 12

Pages: 121-146

Affiliations

[1 ]Faculty of Science and Information Technology, Universiti Teknologi PETRONAS Bandar Seri Iskandar, 31750 Tronoh, Perak, Malaysia

Article

DOI: 10.32890/jict.12.2013.8140

SO-VID: 9c8335ad-8b29-4188-b2a5-b219a96b2212

License:

All content is freely available without charge to users or their institutions. Users are allowed to read, download, copy, distribute, print, search, or link to the full texts of the articles in this journal without asking prior permission of the publisher or the author. Articles published in the journal are distributed under a http://creativecommons.org/licenses/by/4.0/.

History

ScienceOpen disciplines: Communication networks,Applied computer science,Computer science,Information systems & theory,Networking & Internet architecture,Artificial intelligence

Data availability:

ScienceOpen disciplines: Communication networks, Applied computer science, Computer science, Information systems & theory, Networking & Internet architecture, Artificial intelligence

To submit to the journal, click here

PROJECTING NAMED ENTITY TAGS FROM A RESOURCE RICH LANGUAGE TO A RESOURCE POOR LANGUAGE

Read this article at

Abstract

Related collections

Journal of Information and Communication Technology

Author and article information

Contributors

Journal

Affiliations

Article

History

Comments

Comment on this article

Similar content 13

Cited by 1