News Selection with Topic Modeling

There are numerous news articles coming to news aggregators and important news are selected to be presented on the front-page. There are two types of news selection for the front-page of news aggregators: personalized and public news recommendation (selection). This study examines public news recommendation that aims to satisfy all users' interest on the front-page. Public news recommendation is mainly done by meta-features like news popularity. A different approach that exploits the news content is introduced in this work. The main target is to select important (significant) news articles while providing diversification in the selected news topics. A new approach based on topic modeling is developed for this purpose. Results show that it is hard to achieve satisfactory level of precision when content-based public news recommendation is applied. However, precision of topic modeling-based approach is noticeably better than precision of random news recommendation. Topics of selected news are also diversified by using topic modeling.


INTRODUCTION
There are numerous news articles coming to news aggregators from lots of news sources.Typical examples of news aggregators are Google News and Yahoo!News.Important news articles are selected (recommended) to present on the frontpage (main-page) of news aggregators.The recommendation process should take scalability of system (millions of users) and amount of news churn (insertions and deletions) into account [1].Therefore, there is a need for fast and effective news recommendation approaches.
What is important news and what makes news important?These questions are beyond the scope of this study.Instead, news aggregators simplify the task into two types of news recommendation on the front-page: personalized and public news recommendation.
Personalized news recommendation aims to satisfy a specific user's news interests while preserving importance of news.Mostly, a profile model associated with the user is generated and candidate news articles are filtered on user's profile model whenever user logins to the system.User's past history and similar users' system activities are exploited to generate the model.Public news aggregators simply assume that popular news articles are important.Popularity is mostly measured by meta-features like number of clicks.This study examines public news recommendation, but with a different approach that exploits the news content instead of meta-features.The motivation of using news content is to explore rich text content to get useful information and to be adaptable for cases when there are no metafeatures but only news content.
News aggregators should take the following factors into account while selecting (recommending) news for the front-page: • Importance ranking: News should be ranked according to their importance.Importance is an abstract definition for public front-pages.
Popularity is one measure to find importance.
Importance in this study measures how a news article represents a cluster/topic.
• Diversification: News agenda should be presented in the front-page as much directions or viewpoints as possible.This study aims to increase topic variety on the front-page to obtain diversification.
• Length of front-page: News aggregators have a limited space to present the most important news articles while providing enough diversification on the front-page.This study examines 10-length front-pages.
This study develops a new method based on topic modeling that considers all of the above factors for public news recommendation.The following section gives a brief related work on news recommendation.Section 3 explains the details of topic modeling-based approach.Section 4 gives the experiment setup and results.Section 5 concludes the study with some key future points.

RELATED WORK
Recommendation is mainly categorized into collaborative and content-based filtering.Collaborative filtering aims to exploit similar users' activity on the system whereas content of item (news article) is processed in content-based filtering.A typical collaborative filtering example for news recommendation is early version of Google News [Das et al., 2007].Users with the similar click history are obtained and the news they read are recommended.Later, recommendation approach of Google News is changed to adapt both collaborative and content-based filtering [Liu et al., 2010].A content-based approach that models user's information profile is mixed with the previous collaborative click history method.User profile is built on user's news interests by using news articles that particular user read before.
Content-based filtering of news articles is also examined in various studies.Common point of these studies is to create user profile implicitly or explicitly.Getting feedback from users after they read news articles is one way for explicit user profiling [Bilsus, Pazzani, 1999;Tan et al., 1998].
Other studies spy user activities to create user profile implicitly [Good et al., 1999].
This study examines content-based filtering in a different approach.Proposed method does not spy on user activities nor get feedbacks from user; thereby it is much like a selection method rather than a recommendation approach.The text content of news article is processed without any user information.It can be used for public front-pages.
Topic modeling has been recently studied widely and also applied to news domain [Newman et al., 2006].The aim is to model a collection of texts into groups/topics.A popular topic modeling approach, Latent Dirichlet Allocation (LDA) [Blei et al., 2003], is used in this study.As far as I know, there is no study that adapts topic modeling into public news recommendation.

TOPIC MODELING-BASED NEWS SELECTION
Topic modeling-based news recommendation for public front-pages of news aggregators is illustrated in Figure 1.The following steps are the main tasks of the algorithm: (i) Pre-processing news articles, (ii) Topic modeling of pre-processed news articles with LDA algorithm, (iii) Calculating document importance values using the topic model and ranking documents according to their importance, (iv) Calculating topic importance values using the topic model and ranking topics according to their importance, (v) Selecting (recommending) news articles using the rankings from ( 3) and ( 4) based on priority scheduling.

Pre-processing
Pre-processing includes stemming and removing stop-words.Stemming strategy is to use just first 5 characters of all words, which is approved to show good results in Turkish text [Toraman et al., 2011].
Most common word list in Turkish is obtained from [Can et al., 2008].Front-page news articles

Topic Modeling with LDA
Latent Dirichlet Allocation (LDA) algorithm assigns latent topics to word in a given text collection representing the text collection in the vector space model [Salton et al., 1975].Briefly, LDA determines two basic distributions that are word distribution for each topic and topic distribution for each document.These distributions can be used for determining topic label of each document and also ranking representative words for each topic.LDA algorithm learns a topic model; however this study does not use train-test paradigm, instead distributions obtained from the model are exploited to develop new importance models for documents and topics.
A sample topic~word distribution obtained from LDA algorithm is listed in Table 1.[Newman, 2005].The formula (2.1) can then changed to calculate for only the k words that have high distribution weights as the following: The value of k can be determined by Pareto Principle that implies 80-20 Law [Newman, 2005].It means 80% of words that have high value of distribution weight are in the first 20% of all words distribution.

Calculating Topic Importance
Assume M as the number of topics.The value of M must be given to LDA algorithm as input.There are various studies to determine number of topics in a text collection.This study simply assumes M by inspring from clustering studies [van Rijsbergen, 1979] as the following: For each topic L i (1 ≤ i ≤ M), topic importance is calculated by using distribution weights of each word in topic distribution and also importance of documents in this topic.Note that all words appear in all topic distributions; but their weights would differ.Assume |L i | as total number of words in topic and t ij as each word in topic (1 ≤ j ≤ |L i |), D as total number of documents in topic L i .Distribution weight of word t ij in topic Li is w ij.The topic Li's importance topic_imp k (L i ) is calculated as follows:

𝐷
Left side of the formula represents how much words in the topic are important by counting their distribution weights.Right side of the formula simply adds importance values of documents that are member of the topic.

Front-page News Selection Strategy
Using the topic and document importance values obtained from previous steps, priority schedulingbased news selection strategy is applied to satisfy both importance and diversification of news articles in a limited space of front-page.
Priority scheduling [Silberschatz et al., 2008] states CPU must serve waiting processes in a limited time.Each process have a priority and work time.
CPU starts with the process that has the highest priority and serve until it finishes.The other processes are also served in the same manner.In this study, CPU is the proposed algorithm for public news recommendation and process is a topic.Each topic has a demand of working its most important news articles on the front-page.The proposed algorithm serves each topic's important news articles by considering topic's priority, topic's demand and the length of front-page (assumed 10 in this study).
The following steps describe the strategy for news selection to front-pages: (i) Ranked documents according to their importance are obtained as in (2.2).
(ii) Each topic's priority is assigned as topic importance calculated as in (2.4).
(iii) Each topic has a demand of presenting its important news articles.This demand value is calculated as the following: a. Calculate weight of importance (woi) for topic i as: b. Calculate constant value that is needed to represent a news article in front-page: Calculate demand of topic (dot) as: Demand values (dot) are calculated as described in the strategy above.d ij represents the j th important news article in topic i.

Experiment Setup
LDA algorithm is implemented with MALLET library [MALLET, 2013].Number of iterations in LDA are set to 1000 and Dirichlet parameters α and β to find distributions are set to their default values (0.01).Documents are represented in vector space model and algorithms proposed in this study are implemented in Java.
Dataset contains 15,844 Turkish news articles from 36 different days BilFront-2009[BilFront, 2013].Each day has approximately between 400 and 500 news articles except two days that have approximately between 200 and 300 news articles.Each news is labeled between 1 to 4 to show their Front-page representation (length = 10 news articles):  importance by three different annotators.Values 1 and 2 are for unimportant news whereas 3 and 4 imply important news.
For each 36 days in dataset, independent experiments are conducted three times and average is taken.This is because each day is assumed to have a separate news agenda.Random news selection is used as a baseline method and random experiments are repeated 20 times.We can not use click count information or some other meta-feature since dataset includes only the news content.
Evaluation of experiments are based on precision.The evaluation scores between 1 and 4 of annotators scaled to binary-important and unimportant by discreting scores equal and bigger than 2.5, and lower than 2.5, respectively.Then precision is calculated as follows: =  ≥2.5  where n ≥2.5 is the number of selected news in frontpage that has average annotator scores higher than 2.5 and  is the length of front-page.Precision values of 36 days are obtained and then dataset precision is calculated as follows: 36 where (  ) is the precision of i th day.

Experiment Results
Table 2 shows dataset precision results for different versions of proposed algorithm called NST (news selection with topic modeling).

Repeated
words: Document importance calculations (2.1) and (2.2) are done by considering all words in the document regardless of their repetitions.Table 2 gives experiment results of (2.1) with repetitions (NST rep ) and without repetitions (NST).It seems repetitions are useful to consider in the calculations.
Most important k words: (2.2) considers only k words that have highest weights.Table 2 gives results of considering k=500 words (NST 2.2 ) and all words (NST 2.1 ).Using all words has worse result than using a part of words.
Pareto Principle: The value of k can be determined by Pareto Principle as the first 20% of all words.Table 2 lists experiments different k values and also using only Pareto value (NST pareto ).Results show that Pareto Principle seems to work well as a generalized decision.
News selection strategy: News selection can be done either as in Figure 3 (NST priority ) or by choosing the most important news article from each topic, which resembles democracy, called (NST democracy ).Table 2 shows that using priority scheduling based strategy works better than democracy.
Comparison with random selection: NST best assumes repeated words, best k words with Pareto Principle, and priority scheduling-based strategy.Baseline method selects news articles randomly without any limitations except the length of frontpage.It is seen that NST best works much better than random selection in terms of precision.

DISCUSSION and CONCLUSION
This study introduces a novel algorithm called NST that provides topics modeling-based news selection for public front-pages of news aggregators.It does not recommend news articles by using user's or similar users' past history.Instead, NST just uses the news content.

ACKNOWLEDGMENT
This study is supported by TÜBİTAK (The Scientific and Technological Research Council of Turkey) grant no.111E030.

Figure 1 :
Figure 1: Main tasks of news selection with topic modeling.

Figure 2 :
Figure 2: A sample topic-word distribution weights for an arbitrary topic obtained from a random day in dataset.Words are ranked in distribution weight order.
iv) Priority scheduling puts the most important news article of the topic with the highest priority in the first place.Other slots are served by considering topic priorities and their demands.A sample news selection based on this strategy is displayed in Figure3.There are four different topics and their weight of importance values (2.5) and demand values (2.7) are given in the figure.

Figure 3 :
Figure 3: A sample news selection strategy based on priority-scheduling.

Table 1 :
A sample topic~word distribution obtained from LDA algorithm on news articles of 12.09.2009.Stemming first 5 characters is applied.English translations or explanations of Turkish words based on non-stemmed versions for each topic are given inside parentheses for giving inspiration on topic meanings.Assume number of documents in text collection N and importance value for each document d i (1 ≤ i ≤ N) is calculated by using topic distribution value of each word in d i .Topic distribution value of each word is obtained from the d i 's topic distribution.This measures intuitively document's importance in its topic by calculating how words in d i represent topic of d i .Assume L i as topic of d i , |d i | as the number of words in d i after pre-processing and t ij as each word in d i (1 ≤ j ≤ |d i |).Distribution weight of word t ij in topic Li is w ij and LDA gives this distribution as output.The following formula imp(d i )

Table 2 :
[Newman, Block, 20069r topic modeling-based news selection (NST) with different scenarios.Selecting (recommending) important news articles among a text collection is a hard task when only the news content is used.Experiment results show that proposed algorithm NST works better than random news selection.Proposed algorithm also provides diversified front-pages intuitively since it employs topic modeling and priority scheduling-based strategy selects news from various topics.Some future points are the followings:• Topic and document importance can be calculated with different methods.For measuring document importance, traditional Information Retrieval approaches like tf-idf[van Rijsbergen, 1979] can be used.•Differentmethods to find the number of topics should be examined.This study assumes a simple method inspiring from clustering studies.•Othertopic modeling methods like pLSA[Newman, Block, 2006] can be examined.• Effects of different front-page lengths should be examined.• Proposed algorithm can be compared with other content-based or collaborative filtering approaches.• Diversity between topics is considered in this study.However, diversity between documents can be employed to improve the model.• Novelty detection for news articles can be used to get better front-pages in terms of news agenda coverage.• Experiments should be repeated for statistical tests to compare different configurations in terms of significance.