for Community Question-Answer

A decision making process is simpliﬁed with the help of recommender systems. Recommender systems process the knowledge sources and the information actively to collect data in order to build useful recommendations. These recommendations suggest suitable items to the user based on the analysis performed on the user’s preferences and constraints, both implicit and explicit. Using the content-based ﬁltering approach of recommender systems, this article suggests an innovative idea to annotate the question asked by the user on a QA forum with suitable tags. This article presents a novel scheme to suggest the relevant tags by eﬀectively analyzing questions from a clustered knowledge pool and then ranking the tags according to their relevance. This scheme aims at providing meaningful, trustworthy and persuasive recommendations which will stratify the question in the appropriate domain of a QA forum.


Introduction
The community Question-Answer(cQA) websites are large repositories of valuable knowledge. These websites are developing rapidly with the immensely growing source of information in varied areas. StackOverflow, Quora, Yahoo!Answers are the typical examples of cQA services with widespread acceptance. However, rapid growth provides better opportunities as well as puts forward new challenges. Every cQA website aims at providing immediate and satisfactory answers to any question asked by the user. One of the major challenges is directing the new questions to the right community of experts. Question annotation is one of the transformative processes which is basically used to classify and direct the question to the community where it can be answered by the right group of experts. Annotating Questions with appropriate tags proves useful. Firstly, the user is enabled to give multiple tags to a single question. These assigned tags summarize the question at high level. Secondly, tags help in putting the questions to the feeds of the related domain followers and experts which leads to fast and accurate answers. However, the incompleteness of tagging is usually observed which is caused by incomprehensive question understanding or informal tagging behaviours. For example, some user-provided tags in the vocabulary are often biased towards personal perspectives or specific contextual information. This severely hinders the performance of the tag-based systems. As a result, the performance of the question search will be degraded due to the absence of potentially relevant tags which are used to expand the question. Also, it obstructs the efficient question routing to the appropriate domain.
One such similar contribution is [1] which makes use of Adaptive probabilistic hypergraph learning in order to infer semantically close question space. It further uses various similarity scores to rank the tags. In our approach, we've utilized high confidence tags (discussed later in the paper) and number of common tags in order to obtain a reduced relevant question space. We further find semantic closeness (using a hybrid, trained semantic score algorithm discussed later) of each question in this space with the question being asked by the user in order to obtain a final sorted list of relevant questions. The uncommon tags from this sorted list of relevant questions is sorted further based on confidence values to obtain the final list of Top-N tags to be recommended to the user.
This article aims at generating automatic relevant tags which tries to tackle all the difficulties. We have used lexical properties on the similar question space to drill down the knowledge source to find out the relevance score between the question and the tags. We have developed a scheme which will enhance question annotating process using both: content analytics and history of tagging behaviour from the knowledge source.
The main contributions of this research are as follows.
• This work uses the hybrid similarity measure obtained by combining various measures of semantic similarity, thus postulating a new hybrid score for semantic relatedness. This hybrid score would train itself and adapt according to the diversity of the question-space in the corpus.
• It proposes an idea of high confidence tags. We create a subset of tags from the tag space comprising of tags pre-associated with the ones extracted from question title and that given by the user (if any). Thus, the questions from the knowledge pool containing these tags have a higher level of confidence for tentative semantic relatedness.
• Reducing the time complexity for fetching the top recommendations, this scheme strategically abridges the domain space with high confidence tags followed by ranking with hybrid semantic measure. This not only provides more accurate tag recommendations but also suggests cognate questions.
The remainder of this article is structured as follows. Section 2 introduces and describes the methodology followed for the recommendation of relevant tags. Section 3 gives the evaluation and the analysis of the approach as well as gives the performance highlights followed by our concluding remarks in Section 4.

Question Preprocessing
The user question string is processed first to obtain the tags which are directly present in the question. The processing of the question was done with the help of Stanford core NLP library. Question Preprocessing involves converting the given user question into a standard format as described below:

Lemmatizing
While comparing two words, a common problem arises as both words may not be in the same form. Hence, we have applied standard lemmatization algorithms to fetch the base form of the word.

Stop Word Removal
Nearly all question titles have few common recurring words. These words add no value to their semantic relatedness. Hence, it is better to remove such words before computing score. We have maintained a list of all the stop words, and remove the word from the question which is present in the list.

Tokenization
The user may not always separate two words in his question title with a space. He might use many other punctuation marks as delimiters or terminators. Hence we need to identify correct word boundaries. We have used standard tokenizing libraries to do so.

Punctuation Removal
The user question might have punctuation marks like "!", "?", ".", etc. We need to remove them as with them present word identification becomes difficult.

Stemming
The problem is analogous to that in the case of lemmatization. So stemming is the process of reducing words to their word stem, base or root form.

Creating vectors and scoring
We need to find a numeric semantic score between the user asked question(Q u ) and a question(Q c ) from our corpus. All the semantic similarity algorithms mentioned below return a relatedness score between two words. Thus we need a way to compare two sentences (Q u & Q c ). Now to build vector "V u ", every term in "Q n " is compared with "Q u ". If an exact match is found, then that vector dimension gets the maximum value. Else if not a match, the vector dimension gets the largest of all the semantic scores calculated for that term in "Q n " against every term in "Q u ". [4] [5] [6] Similar process is repeated to get vector "V c ". Now, the net score between in "V u " & "V c " is calculated using Cosine Similarity as follows:

Semantic Closeness/Relatedness
Knowledge-Based Similarity is one of semantic similarity measures that bases on identifying the degree of similarity between words using information derived from semantic networks. WordNet is the most popular semantic network in the area of measuring the Knowledge-Based similarity between words; WordNet is a large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets), each expressing a distinct concept. Synsets are interlinked by means of conceptual-semantic and lexical relations. Knowledge-based similarity measures can be divided roughly into two groups: measures of semantic similarity and measures of semantic relatedness. Semantically similar concepts are deemed to be related on the basis of their likeness. Semantic relatedness, on the other hand, is a more general notion of relatedness, not specifically tied to the shape or form of the concept There are six measures of semantic similarity; three of them are based on information content: Resnik (res) [7], Lin (lin) [8] and Jiang & Conrath (jcn) [9]. The other three measures are based on path length: Leacock & Chodorow (lch) [10], Wu & Palmer (wup) [11] and Path Length (path).
We need to get a numeric equivalent of how closely related two questions are. We create vectors for the questions based on the union set of keywords based on both the questions. Then, calculate the cosine similarity score. But this would consider words like 'love' and 'like' as distinct. Hence to have a better understanding of relatedness, we analyse all the extracted keywords and compare them token by token through a synset -which would give a score based on token similarity. We would factor in this along with the cosine similarity score to establish a more realistic comparison of two questions.

Training of Hybrid Similarity Measure
We have seen the different semantic relatedness calculating algorithms above. Each of them has its own merits and demerits. As we are considering a wide range of questions which handle many types English language constructs, there is a need for a hybrid method. For example, word1: love; word2: romance; word3: hate Algorithms like Wup which are based on path lengths give a higher score for word1 and word3 as they are more closely related. But our requirement was to have a higher score for word1 and word2.
Using only one algorithm won't give us the satisfactory results.Hence we need to use combination of different algorithms. Firstly, we need to choose the combination of algorithms, and how much weightage does each algorithm gets [15].
Here is the approach to get the weighted combination of algorithms: • Create a weight vector W of size six initialized to zero.
• Out of the approximate 1.15 million questions D in out corpus, we have randomly chosen N questions.
• For each question N i of these N questions: -Create a set (T ) of tags (t) which were associated with that question.
-Create a dataset of all the questions who at least have one tag from T Where T q is the set of tags associated with q -From this Q set, select at random M questions, and for M j of them: * Run all the six semantic algorithms to find the score between N i & M * Store these results in individual CSV files along with their respective index j.
• Thus we will have N CSV files, and for each of them: -Sort each of individual algorithmic scores in ascending order and store their respective indices in vectors (V 1-6 ), one vector for every algorithm.
-Create a set δ man Manhattan distances between the vectors.

Extraction of tags from Question title
This involves searching for presence of multi-word (eg: ruby on rails) and single word (eg: ruby) tags from the Knowledge Base mentioned in the question title. These tags are suggested to the user who selects one or more relevant tags. The user must enter at least one relevant tag explicitly if no tags could be extracted from the question title or if none of the tags suggested to the user from question title were relevant. Let the final list of tags be T e . Given a question Q in the knowledge base containing tags T q , Common Tags (CT q ) = T e ∩ T q and Uncommon Tags (U T q ) = T q -CT q Let's say that tags T 1 , T 2 , T 3 could be extracted from the title of user question Q u . Further let's assume all these tags are relevant i.e. the user chooses all these tags and asks for further recommendations.

Finding High Confidence Tags
• Let T q = (Tq 1 , Tq 2 .....Tq 10 ) be the set of tags obtained from the question title and the tags entered by the user (if any).
• Let T = (T 1 , T 2 , ..., T n ) be the list of the total tags in the knowledge base. The basic motivation behind finding high confidence tags is to find a subset of T wherein each tag Ti in this subset occurs many times with one or more subsets of T in the knowledge base, i.e. there is a high confidence that Ti occurs with T in the user question Keeping the sample dataset in reference, we compute Top-M high confidence tags for each non-empty subset of { T 1 , T 2 , T 3 } (refer Table 3). The confidence values of the tags thus obtained is computed as the Max confidence value obtained for the corresponding tag. (refer Table 2).  The total confidence value (CV q ) and number of common tags (refer 2.3) are calculated for questions in the Knowledge base containing at least one tag common with the list of tags T e (refer 2.3). These questions are then sorted based on the order (total confidence value, Number of common tags) resulting in sorted list of relevant questions.
Keeping the sample dataset in reference, the above Venn diagram summarizes the set of relevant questions and number of common tags with Q u . The total confidence values of each relevant question is then computed as shown in Table 4 (substituting confidence values from Table 2) and the questions are sorted based on (Total Confidence Value, Number of Common tags).
Keeping the sample dataset in reference and considering N =6, Top-7 questions corresponding to 6 uncommon tags (refer 2.3.1). The semantic score is computed for each question in this list as shown in Table 4.

Reduction of the extracted set of relevant questions
The list of relevant questions from 2.4.1 needs to be further reduced in order to apply hybrid similarity measure (refer 2.2.3) in real time environment (an alternative to this is to make use of distributed processing). This is achieved by selecting the Top-M questions Q m that contribute to N uncommon tags (refer 2.3).
Hybrid Semantic score is found for each question in the list Q m with the user question, the list Q m is further sorted based on the semantic score to obtain sorted reduced list of relevant questions SQ m

Finding the sorted list of N relevant tags
The uncommon tags in SQ m are finally ranked as follows: 1. Tag T 1 belongs to relevant question RQ 1 and tag T 2 belongs to relevant question RQ 2 . If RQ1 is higher in SQ m , T 1 gets a higher rank than T 2 and vice versa.
2. Tag T 1 and Tag T 2 both belong to the same relevant question RQ. In this case, we use the confidence level of the tags T 1 and T 2 to decide the ranking between them i.e if HCV t1 > HCV t2 then T 1 gets a higher rank than T 2 and vice versa.

Evaluation and analysis
The proposed implementation was thoroughly tested with over 50,000 questions already asked on StackOverflow website. The tags suggested by our system was compared with the tags already present with the question on StackOverflow.

Accuracy Measure
We define the accuracy as: where, • A is the accuracy measure • U ST (User Selected Tags) has the tags selected by the user from the suggested tags • T RF T (Tags Found in Ranked Tags) are tags for the question found in the recommendation • T N F (Tags Not Found) are the tags in the question title or recommended by our system but are present in the question where m i is the rank of the tag t i by our system and N T is the number of tags in our system where M i is the number of question having tag t i and N Q is the total number questions in our corpus. It has a weight of 0 as the absence of the tag should not contribute to the accuracy.

Performance
Assuming that the average question has 5 tags on StackOverflow, of which two were derived from the title and 2 were recommended by our system and 1 tag was left out We can understand from the graph/image that even if our system fails to give one of the recommended tag, the drop in the accuracy is depended on how prevalent was that tag in the corpus.

Conclusion
This article describes the problems associated with tagging questions in a cQA manually thereby suggesting a novel scheme in order to recommend Top-N relevant tags to the user taking help of predefined knowledge base of questions. The procedure involves preprocessing the user question in order to convert it into a standard format (refer 2.1), extracting tags from question title (refer 2.3.1) and utilizing these to find high confidence tags (refer 2.3.2). This is followed by obtaining relevant question space and sorting it based on two criteria (refer 2.4.1), reducing the relevant question space further (refer 2.4.2) and applying trained hybrid semantic score algorithm (refer 2.2) between the user question and each question in the reduced relevant question space, post which the resultant questions are sorted based on the semantic score computed. The uncommon tags (refer 2.3.1) obtained from the final list of questions is further sorted based on two criteria (refer 2.4.3) to give the Top-N relevant tags. These tags are recommended to the user. Additionally, the recommender also suggests relevant questions to the user as the final sorted reduced list of relevant questions.
Our proposed scheme can benefit cQA's by reducing the biased tagging of questions and incomplete tagging of questions tremendously. Furthermore, it will also help the users by increasing reach of the question to relevant users thereby ensuring quick response. It also helps the user in that the user would just have to think of one tag (in the worst case), the rest of the tags can be selected from the recommendations. Distributed computing can be leveraged in huge knowledge bases to ensure real-time usage of the recommender. Furthermore, the recommender currently doesn't infer any knowledge from the question content. In-depth research regarding such an enhancement in the recommender remains to be investigated.