Named Entity Patterns across News Domains

A new event tracking approach is proposed based on the identification of named entity (NE) patterns such as Who, What, Where and When, and their relationship with news domains such as Politics, Economy, Government and Entertainment. This research comprises three parts. The first part uses a set of user studies to identify NE patterns and their relationship with news domains. Second part is to design a prototype system based on NE patterns. The final part evaluates the prototype event tracking system. This paper described the first part which is to evaluate the importance of NE across news domains. We have achieved a better understanding on NE patterns by identifying the distribution of NE across news domains.


INTRODUCTION
Topic Detection and Tracking (TDT) applies to the detection and tracking of events from a stream of stories. This stream may or may not be pre-segmented into stories, and the events may or may not be known to the system, where the system may or may not be trained to recognize specific events. This leads to the identification of three technical tasks to be addressed in the present TDT study. These are, namely, the tracking of known events, the detection of unknown events, and the segmentation of a news source into stories.
The tracking task associates incoming stories with events known to the system. An event is defined by its association with stories that discuss that particular event. Thus each target event is characterized by a list of stories which discuss it. Each successive story must be classified according to whether or not it discusses the target event in the tracking task. Therefore, the study corpus is divided into two parts, with the first part being the training set and the second part being the test set. Each of the stories in the training set is flagged as to whether it discusses the target event or not, and these flags (and the associated text of the stories) are the only information used for training the system to classify correctly the target event.
The TDT tracking task is fundamentally similar to an information filtering task [9]. Each begins with a representation of a topic and then monitors a stream of documents, making decisions as they arrive. Each document is assigned a score for that topic and, if the score is high enough, it is retrieved. Filtering simulates interacting with the user to supervise the process, whereas tracking operates as if the user were not there. Systems may be adaptive in that they "guess" that a story is on topic, but they do not receive human confirmation that they were correct.
Recent research in TDT has investigated named entities rather than words because TDT investigates the organization of information by event rather than by subject [2,4,5,6,7,8]. This short paper has identified NE patterns across news domains so that we can improve the event detection and tracking.

EVENT TRACKING BY USING NAMED ENTITY APPROACH
In this section, we review the literature concerning the usefulness of using named entity (NE) patterns across domains to improve the performance of an event tracking system. As mentioned above, tracking is basically a simpler version of the classic IR filtering task. Thus, an increasing number of information retrieval and machine learning techniques have been applied. These include k-Nearest Neighbour (kNN) classification, Decision Tree induction, a variety of Language Modelling (LM) approaches and relevance based filtering methods [1,10], and systematic analysis of their behaviour on event tracking has only just begun.
A TDT system draws a distinction between events and topics. As Yang et al. [10] note, "The USAir-427 crash is an event but not a topic, and airplane accidents is a topic but not an event". Studies by various authors [10,3,7] have emphasized the importance of having a better understanding and a clear definition of an event. Indeed, it appears that event is a term which easy to understand at the intuitive level but hard to define precisely. An event comprises at the very least what happened, where it happened, when it happened, and who was involved. In the Information Extraction field, within Message Understanding Conferences (MUC) has presented an approach to define an event using a Scenario Template (ST) task. The main goal of ST is to extract pre-specified event information and relate the event information to particular organisation, person, or other entities involved in the event [1].
The identification and extraction of NE word patterns, relate to the questions about Date (when), Location (where), Person (who) and Organisation (what/who) is crucial in TDT because detection of named entity allow us to characterize and detect event in document. This is particularly true for event topics. Our motivation is to have a clear understanding on NE patterns in event tracking.
We concluded that event is defined as a unique circumstances or a condition that involved the integration of the W's elements as shown in Table 1 below:

NE Description NE example Who
This element refers to the actor or person that takes part in the event.
Alex Salmond won the Gordon constituency in the 2007 Scottish Parliament election.

Where
This element refers to the places or location that the event takes place.
Scotland was the location for the 2007 Scottish Parliament election.

When
This element refers to the date or time that the event takes place.
3rd May 2007 was the election day.

What
This element refers to the subject, occasion, body or activity that involved in the event.
The SNP won the 2007 Scottish Parliament election. We note that the opinion patterns in sentences are used for improving the performance of novelty detection for opinion topics while NE plays a more important role for event topics. However, simple uses of NE seems to be not very helpful for improving the performance of TDT system, therefore, the challenging questions are how to effectively make use of this NE, and what kinds of additional and critical information patterns will be effective.
Kumaran and Allan [4] observed that certain categories of news are better tackled using solely NEs such as Elections, Accidents, Violence and War, New Laws, Sports News, and Political and Diplomatic Meetings. While for other news categories, it was better to use topic terms. Thus, it is important for this pilot study to experiment with other general news domains and this strongly motivates the work on NE patterns and their relationship with the news domain which may contain important and relevant information. In addition, there have been no studies to date which focus on the distribution pattern of NE across news domains.

RESEARCH GOALS AND OBJECTIVES
Our overall research programme will consider the constraints and issues in the tracking task and come out with a good event tracking system. The specific objectives of this research are to: 1. Identify NE patterns across news domains; 2. Design a prototype of an interactive and adaptive event tracking system based on NE patterns; 3. Experiment and to evaluate the system performance, in terms of precision at top ranks.
This paper tackles objective 1 in order to have a better understanding of NE patterns by identifying the distribution of NE across news domains.

PILOT STUDY
A pilot study was undertaken to support the first objective of the research. It consisted of two phases; the first was conducted in early February 2007, while the second took plan in early March 2007. 10 postgraduate students from the Scottish Centre for Journalism Studies (SCJS), University of Strathclyde were selected in this study.

Objectives
The objectives of this pilot study were to: a. Identify NE patterns (Who, Where, When and What) across news domains (Politics, Economy, Government and Entertainment); b. Measure the importance level of NE across news domains; c. Measure the level of agreement in the keywords given by the respondents.
These objectives will guide us to design the prototype system by investigating NE patterns from the frequency and the importance aspect. That is, are some types of NE more important and more frequent in some domains?

Instruction and online questionnaire
An online questionnaire was placed on a server in the department of Computer and Information Sciences (CIS) at University of Strathclyde with the link as https://devweb2006.cis.strath.ac.uk/~masnizah/cgi-bin/survey/index.htm. It was designed using HTML and CGI scripting as a front end to receive the data via email. Respondents were given the instructions including the description of named entities, as described in Table 1, and the privacy statement at the main page. They were asked to read each document. They were then required to give the important keywords that best describe the document and tick their importance on a scale of 1 to 5, with 1 being the least important and 5 being most important. Then the respondents had to classify the keywords according to which type of named entity is being mentioned. Finally at the end of the online questionnaire, respondents were also given an openended question as an opportunity to supply any comments or suggestions regarding the pilot study.

Corpus and distribution
We constructed a corpus of 40 documents chosen from the CNN News, Associated Press and Scotsman.com News. Each domain consisted of 10 documents. The sources and documents were current stories related to Scotland or Glasgow as all respondents were living in Glasgow such as documents P5 and P10 in Politics; E1, E4 and E8 in Economy; G1 in Government; and Et3 in Entertainment. The documents are shown in Table 2  A total number of 12 documents was given to each respondent from this pilot study. The documents' distribution was based on a repeated Latin square. The reason for this was to have a balanced distribution of the documents to every respondent such that each respondent receives documents from the four domains and every document will be viewed by 3 respondents. As a result, this distribution allow the comparison of the keywords from the same document and support the third objective of the pilot study which is to measure the level of agreement among respondents.

Results and discussion
A total of 557 data points or keywords from the 40 documents was analysed and to identify: 1. NE patterns across news domains; 2. The importance level of NE and news domains; 3. The level of agreement in the keywords and news domains.  (3) =17.1 and p = 0.001. A Chi-square test was performed and there is strong evidence to indicate a relationship between NE and news domains was found, x 2 (9, N = 557) =32.1 and p < 0.001. This shows that the distribution pattern on the type of NE is domain dependent across the Politics, Economy, Government and Entertainment. Table 4 summarizes the distribution of the importance level of NE across domains. Findings showed that for the Very Important level of NE, What was the highest occurrence of NE across domains and the highest percentage of What NE was in Economy (63.4%). While for the Important level of NE, again, What was the highest occurrence of NE in Economy, Government and Entertainment except in Politics, where Who NE has a higher percentage (36.4%). For the Fairly Important level of NE, What was the highest occurrence of NE in Politics, Economy and Government except in Entertainment, where Who NE has a higher percentage (44.0%). Surprisingly, for the Quite Important level of NE, Where has the highest percentage in Politics (44%) and Entertainment (100%), What was the highest occurrence of NE in Economy (100%) and Who; What NE shared the same percentage (50%) in Government.

The importance level of NE and news domains
We used the Spearman's rank correlation test to predict the correlation between the importance level of NE and news domains. The findings indicated that there is no significant relationship between the importance level of NE and news domains where p > 0.001. The importance level is domain independent and it shows that respondents are giving different weightage on NE. Although What is the top NE but it is not necessarily Very Important across news domains.

Level of agreement in the keywords and news domains
The level of agreement in the keywords given by the respondents was calculated by using the overlap value [9] which is the intersection of keywords divided by the union of keywords from the same document. Thus, an overlap of 1.0 means perfect agreement and an overlap of 0.0 means none of the respondents agreed with the keywords given. We used stemming to identify the intersection of keywords such as launched, launching and launch were count as a single term. We also conflated different keywords referring to the same context and meaning, such as   The overlap value from this study showed the variation in the individual judgements of the keyword from the document. Thus, this will guide us for the selection of documents and to measure the system performance by the keywords given.

CONCLUSION
Recent research in TDT has used information on NE and this requires more effort to understand all the NE. Thus, our study has investigated the type and the use of NE in news domains. We have achieved a better understanding on the correlation of the type of NE across domains has been achieved. This study revealed that there is a significant difference in the distribution of NE within domains. What is the most dominant type of named entity followed by the Who named entity, across news domains. This is interesting since the results give clues on what are the named entities that user would expect more to occur within domains.
We found that there is no significant relationship between the importance level of NE and news domains, indicating that the importance level is domain independent. This revealed the respondents perception on how often the named entity occurred and how important the named entity across domains. When has 0% of occurrence in Government but it can be Very Important type of named entity.
While the level of agreement in the keywords among the respondents is likely to be higher in Entertainment compare to other domains. The measure of agreement based on the overlap value is substantial to give a better understanding on the documents selection and how to use the keywords given to measure the system performance.
The findings have key implications for developing a prototype of a system design that should adapt on the user profile and enhance the understanding of the user needs. Further research is needed to investigate the specific news to see the importance of When and Where NE.