Semantic Models for Re-ranking in Question Answering

This paper describes a research aimed at unveiling the role of Semantic Models in Question Answering. In these systems questions and answers are often expressed in quite different languages, so our objective is to bridge this ”lexical chasm” adopting semantic representations. The aim of the research is to find out if Semantc Models are useful for this task and if they can improve the answer re-ranking performance. We have carried out an initial evaluation of a subset of the semantic models on the CLEF2010 QA dataset, showing their effectiveness. We also did a first attempt in combining them by means of Learning to Rank algorithms.


INTRODUCTION
Question Answering (QA) emerged in the last decade as one of the most promising fields in Artificial Intelligence.By exploiting techniques borrowed from IR and Natural Language Processing (NLP), QA systems are able to answer users' questions expressed in natural language with either the exact answer or with short passages of text containing the exact answer.
One of the pitfalls of QA systems is the difference between the language in which users ask the questions and the language of the answers, the so called "lexical chasm" (Berger et al. (2000)).We believe that a semantic representation of both question and answers would "bridge the lexical chasm", so we want to adopt Semantic Models (SM) in order to re-rank the candidate answers in a more sensible way.Based on the usefulness that those models show in other tasks, we think that SMs can have a significant role in improving current state-ofthe-art systems' performance in answer re-ranking.
The general aim of this research is to find out if SMs are useful for answer re-ranking.The questions it wants to answer are: • Are additional semantic features useful for answer re-ranking?Does their adoption improve systems' performance?
• Which of them is more effective and under which circumstances?
• Do semantic features bring information that is not present in the bag-of-words and syntactic features?
• Is there any Learning-to-Rank (MLR) algorithm that exploits semantic features more than others (has more relative or absolute improvement by their adoption) and why?
The paper is structured as follows.A brief overview of related work is reported in Section 2, while the proposed methodology is described in Section 3. Some preliminary experiments are shown in Section 4 and the description of the future work closes the paper.

RELATED WORK
Much of the work in QA has been done on factoid questions, where answers are short excerpts of text, usually named entities, dates or quantities.Factoid QA systems rely heavily on information extraction techniques in order to obtain the specific answer, including the adoption of linguistic patterns.
In the last few years non-factoid QA has received more attention.It focuses on causation, manner and reason questions, where the expected answer has the form of a passage of text.Depending on the structure of the corpus, the passages can be single sentences, groups of sentences, paragraphs or short texts.
The presence of annotated corpora from Text REtrieval Conference (TREC) and Cross Language Evaluation Forum (CLEF) makes it possible to use machine learning techniques to tackle the problem of ranking the passages for further extraction in factoid QA (Agarwal et al. (2012)).In non-factoid QA the training data adopted is of different types, like hand annotated answers from Wikipedia used by Verberne et al. (2011), small hand built corpora adopted by Higashinaka and Isozaki (2008), Frequently Asked Questions lists employed in Soricut and Brill (2006) and Yahoo!Answers Extracted corpus used by Surdeanu et al. (2011).As mentioned, just few experiments adopted semantic features for answer re-ranking, but their use showed significant performance improvement.There are still several different possible semantic features that have not been taken into account so far and we want to find out if their use could be helpful.For example features coming from SM like Distributional Semantic Models (DSMs) (Turney and Pantel (2010)), Explicit Semantic Analysis (ESA) (Gabrilovich and Markovitch (2007)) and Latent Dirichlet Allocation (LDA) (Blei et al. (2003)) have never been applied to the task so far.

METHODOLOGY
We are going to test if these insights are correct starting from the design and implementation of a QA framework that helps us to set up several systems with different settings.
We have already built the cornerstone: Question-Cube (Molino and Basile (2012)) is a multilingual QA framework created using Natural Language Processing and Information Retrieval techniques.
Question analysis is carried out by a full-featured NLP pipeline.The document search step is carried out by Lucene, a standard off-the-shelf retrieval framework that allows TF-IDF, Language Modeling and BM25 weighting.All the passages in the top documents from Lucene are considered as candidate answers.The answer re-ranking component is designed as a pipeline of different scoring criteria.We derive a global re-ranking function by combining the scores with CombSum (Shaw et al. (1994)).The CombSum function can be replaced by MLR algorithms if available training data is available.More details on the framework and a description of the main scorers is reported by Molino and Basile (2012).
The next step is the implementation of different MLR algorithms in order to combine the features obtained by the scoring criteria with linear and non-linear models and replace the CombSum function.QA datasets are usually really skewed.
For each question, only one correct answer is usually given and thus Pairwise MLR algorithms could be more effective than Pointwise and Listwise approaches.Verberne et al. (2011) showed that this still has to be proved, so we implemented a whole collection of different MLR algorithms inside the framework including Pointwise, Pairwise and Listwise algorithms.This will allow us to compare their performance on the non-factoid QA task and to find out if they exploit the additional information given by the semantic features in different ways.
As a proof of concept we implemented some scoring criteria based on DSMs in order to realize if their adoption as unique rankers or combined with simple similarity and density criteria would improve ranking over the one obtained with classic Information Retrieval weighting schemes.

Distributional Semantic Models
Distributional Semantic Models (DSMs) represent word meanings through linguistic contexts.The meaning of a word can be inferred by the linguistic contexts in which the word occurs.
The idea behind DSMs can be summarized as follows: if two words share the same linguistic context they are somehow similar in meaning.For example, in analyzing the sentences "drink a glass of wine" and "drink a glass of beer", we can assume that the words "wine" and "beer" have a similar meaning.
Using that assumption, the meaning of a word can be expressed by the geometrical representation in a semantic space.In this space a word is represented by a vector whose dimensions correspond to linguistic contexts surrounding the word.The word vector is built by analyzing (e.g.counting) the contexts in which the term occurs across a corpus.Some definitions of context may be the set of cooccurring words in a document, in a sentence or in a window of surrounding terms.
The earliest and simplest formulation of such a space stems from the use of the Vector Space Model in IR.
Semantic space scalability and independence from external resources resulted in their practical use in many different tasks.For example they have been applied in several linguistic and cognitive tasks, such as synonyms detection by Landauer and Dumais (1997), semantic priming by Jones and Mewhort (2007), automatic construction of thesauri by Sch ütze and Pedersen (1995) and word sense induction by Sch ütze (1998).
Our DSMs are constructed over a co-occurrence matrix.The linguistic context taken into account is a window w of co-occurring terms.Given a reference corpus, the collection of documents indexed by the QA system, and its vocabulary V , a n × n cooccurrence matrix is defined as the matrix M = (m ij ) whose coefficients m ij ∈ R are the number of co-occurrences of the words t i and t j within a predetermined distance w.
The term × term matrix M, based on simple word co-occurrences, represents the simplest semantic space, called Term-Term co-occurrence Matrix (TTM).
In literature, several methods to approximate the original matrix by rank reduction have been proposed.The aim of these methods varies from discovering high-order relations between entries to improving efficiency by reducing its noise and dimensionality.We exploit three methods for building our semantic spaces: Latent Semantic Analysis (LSA) (Deerwester et al. (1990)), Random Indexing (RI) (Kanerva (1988)) and LSA over RI (LSARI) (Sellberg and J önsson (2008)).LSARI applies the SVD factorization to the reduced approximation of M obtained through RI.All these methods produce a new matrix M, which is a n × k approximation of the co-occurrence matrix M with n row vectors corresponding to vocabulary terms, while k is the number of reduced dimensions.Molino et al. (2012) give more details.
We integrated the DSMs into the framework creating a new scorer, the Distributional Scorer, that represents both question and passage by applying the addition operator to the vector representation of terms they are composed of.The vectors of the single words came from the different M, so changing the method to construct it we obtain one different scorer for each semantic space.This way it is possible to compute the similarity between question and passage by exploiting the cosine similarity between the summed vectors.
The simple scorers employed alongside with the ones based on DSMs in the evaluation are: the Overlap Scorer, a scorer that counts the term overlap between the question and the candidate answer; the Exact Sequence Scorer, a scorer that counts the number of consecutive overlapping terms between the question and the answer, and the Density Scorer, a scorer that assigns a score to a passage based on the distance of the question terms inside it as proposed by Monz (2004).All the scorers can use linguistic annotations like stems, Part-of-Speech Tags, lemmas, Named Entities and combinations of annotations as features instead of simple words.

EVALUATION
The goal of the evaluation is twofold: (1) to give empirical support for the thesis that DSMs could be effective into our QA system and (2) to provide a comparison between the different DSMs.
The evaluation has been performed on the Re-sPubliQA 2010 Dataset adopted in the 2010 CLEF QA Competition (Penas et al. ( 2010)).The dataset contains about 10,700 documents of the European Union legislation and European Parliament transcriptions, aligned in several languages including English and Italian, with 200 questions.
The first metric adopted in the evaluation is the accuracy a@n (known in literature as success@n), calculated considering only the first n answers.If the correct answer occurs in the top n retrieved answers, the question is marked as correctly answered.In particular, we take into account several values of n =1, 5, 10 and 30.Moreover, we adopt the Mean Reciprocal Rank (MRR) as well, that considers the rank of the correct answer.
The framework setup used for the evaluation adopts Lucene as document searcher, and uses a NLP Pipeline made of a lemmatizer, a Part-of-Speech tagger and a named entity recognizer.
The different DSMs and the classic TTM have been used both as scorers alone, which means no other scorers are adopted, and combined with a standard scorer pipeline.The composition of the standard pipeline includes: the Terms Overlap (TO) scorer, the Lemma+Part-Of-Speech Overlap (LPO) scorer, the Lemma+Part-Of-Speech Density (LPD) scorer, the Exact Term Sequence (ET) scorer.
Moreover, we empirically chose the parameters for the DSMs: the window w of terms considered for computing the co-occurrence matrix is 4, while the number of reduced dimensions considered in LSA, RI and LSARI is equal to 1000.
The performance of the standard pipeline, without the distributional scorer, is shown as a baseline.The experiments have been carried out both for English and Italian.Results are shown in Table 1 for English and in Table 2 for Italian.
The results in the rows marked as "alone" refer to DSMs used as unique rankers, while the results reported in the "combined" part of tables refer to the CombSum of TO, LPO, LPD, ET and the specified DSM scorers.
Both tables report the accuracy a@n computed considering a different number of answers, the MRR and the significance of the results with respect to both the baseline ( † ) and the distributional model based on TTM ( ‡ ).The significance is computed using the non-parametric Randomization test as suggested by Smucker et al. (2007).The best results are reported in bold.
Considering each distributional scorer on its own, the results show that all the proposed DSMs are better than the TTM, and the improvement is always significant.The best improvement for the MRR in English is obtained by LSA (+180%), while in Italian by LSARI (+161%).
As for the distributional scorers combined with the standard scorer pipeline, the results show that all the combinations are able to overcome the baseline.For English we have obtained an improvement in MRR of about 16% compared to the baseline and the result obtained by the TTM is significant.For Italian, we have achieved an even higher improvement in MRR of 26% compared to the baseline using LSARI.
The slight difference in performance between LSA and LSARI suggests that LSA applied to the matrix obtained by RI produces the same result of LSA applied to TTM, but requiring less computation time, as the matrix obtained by RI contains less dimensions than the TTM matrix.
Those results give a preliminary answer to some of the research questions: they suggest that the semantic features can be useful for answer reranking because they improve the performances significantly and that LSA and LSARI are the most effective among the adopted ones.

Preliminary MLR experiment
A preliminary experiment with MLR algorithms has been carried out separately from the main evaluation.The features we employed are the outputs of the scorers adopted in the previous experiment, a really small number, but the aim of the experiment was to find out if a better combination of the same scorers of the main evaluation could lead to better results.
The experiment was carried out using the RankNet (Burges et al. ( 2005)) MLR algorithm, performing a 10-fold Cross Validation on the same dataset of the main evaluation.We did four runs with the 4 fixed features coming from the standard scorers described in Section 4 and changing the fifth feature among the four different DSMs scorers.The best average score of MRR on the 10 different folds is 0.68 for English and 0.605 for Italian obtained with the LSARI DSM.Far for being significant, this little MLR experiment still encourages us to follow the path of semantic features combined with MLR algorithms.

FUTURE PLANS
There are several future steps to follow in order to answer the research questions.Carrying out the reported evaluation, we discovered that some of the semantic features obtained from DSMs can be useful for answer re-ranking both alone and combined with other features.
What we still don't know is how effective they can be inside a MLR setting and we still don't know if this can be generalized to other datasets.
To this purpose, the following activities will be carried out: • To add more MLR algorithms for re-ranking.More MLR algorithms are fundamental in order to carry out a comprehensive comparative analysis.
• To experiment further the usefulness of other semantic features, coming from ESA, LDA and even more semantic models.This could help in catching different aspects of the semantics of questions and answers that DSMs alone do not cover.
• To incorporate other state-of-the-art linguistic features in order to realize if the information they convey overlaps with the information from the semantic features.Good candidates are the ones proposed by Verberne et al. (2008), that cover lexical and syntactic information, and the ones proposed by Surdeanu et al. (2011).
In particular, the translation based features are really effective and can help to "bridge the lexical chasm" (Berger et al. (2000)).
• Other operations for combining vectors coming from the applied DSMs will also be investigated, in order to tackle more deeply the semantic compositionality problem.For doing so Mitchell and Lapata (2010) used operators like product, tensor product and circular convolution, and their adoption for our task can be helpful.
• Once all the features are ready, a MLR algorithm comparison will be carried out, in order to find out which algorithms take more advantage from the semantic features.An ablation test will be used to understand how much of the improvement is obtained thanks to the semantic features.
Alongside with those steps, different datasets will be collected, focusing mainly on non-factoid QA.The Yahoo! Answers Manner Questions datasets are a good starting point, but also non-factoid questions from Webclopedia (Hovy et al. (2000)) can be helpful.The aim is to compare directly to state-ofthe-art systems in order to find out if the semantic features can lead to better results in a general setting.
Another dataset will be collected with aligned English and Italian non-factoid questions with answers taken from Wikipedia.The answers will be posted by the users of Wikiedi 1 and the dataset will contain textual answers in the form of paragraphs from Wikipedia pages, their relevance judgment (the number of votes from the users) and a feature list that will contain the output of the different scorers.

Table 1 :
Evaluation Results for English

Table 2 :
Evaluation Results for Italian