Survey on Ontology learning from Web and open issues

With the continual increase of the volume of available information on the Web, information access and knowledge management become challenging. Thus, adding a semantic dimension to the Web, by the deployment of ontologies, contributes to solve many problems. In the context of the semantic Web, ontologies improve the exploitation of Web resources by adding a consensual field of knowledge. The need for using domain ontology for information retrieval (IR) has been explored by some approaches to better answer users’ queries. However, ontology in IR system requires a regular updating, especially the addition of new concepts and relationships. In fact, IR systems are generally based on few number of domain ontology that cannot be extended. This paper proposes a survey of main several approaches of ontology learning from Web. In a previous work, we have proposed an incremental approach for ontology learning using an ontological representation called "Metaontology”. In this paper, we describe a how the processes of semantic search and ontology learning from texts can collaborate for learning of multilayer ontology warehouse.


INTRODUCTION
Adding a semantic dimension to the Web [1], by the deployment of ontologies, contributes to solve many problems: knowledge sharing, semantic access to Web resources and information retrieval.In fact, ontologies improve the exploitation of Web resources by adding a consensual knowledge.The need for using domain ontology for information retrieval (IR) has been explored by some approaches to better answer of users' queries.However, ontologies in IR system requires cannot be extended.During this last decade, several approaches of ontology learning have appeared and proposed a partial automatization of knowledge acquisition from structural, semi structural or unstructured data sources (data base, knowledge base, texts, etc.).In this paper, standing from the fact that a unique data source cannot cover all concepts of a target domain of knowledge and that Web is a rich textual source, we have chosen to consider the Web as learning corpus from which domain ontologies are extracted.These ontologies will be used in semantic search systems.The main objective of this work is to make the semantic search engine more flexible and autonomous to construct their domain ontologies from relevant documents in an incremental manner.Then, we choose to combine ontology learning from text and semantic search technology to propose a domain independent approach to automate ontology learning from Web documents.
This paper is organized as follows.Section 2 presents the related work to ontology learning approaches from Web.In section 3, ontology position in most known types of semantic search system is described.Limits and open issues are presented later.In section 4, we precise main objectives and our previous Work that lead us to propose an incremental Ontology learning approach in semantic Web search systems.Our approach is proposed also with an illustration scenario.Finally, we conclude and give some perspectives for this research work.

ONTOLOGY BUILDING FROM WEB
Ontology learning (OL) is defined as an approach of ontology building from knowledge sources using a set of machine learning techniques and knowledge acquisition methods.
OL from texts is a specific case of OL from Web and has been widely used in the community of engineering knowledge since texts are semantically richer than the other data source type.These approaches are generally based on the use of textual corpora.This one should be a representative of the domain for what we are trying to build ontology.By applying a set of text mining techniques, granular ontology is enriched with concepts and relationships discovered from textual data.In such approach, human intervention is required to validate the relevance of learned concepts and relationships.
In the last decade, with the enormous growth of Web information, Web has become as important source of information for knowledge acquisition: due to its huge size and heterogeneity.This has been the cause of mainly two categories of OL approaches: ontology learning from textual content of the Web, ontology learning from online Web ontologies, from web dictionary and from Web heterogeneous sources.

Ontology learning approaches from Web documents
OL from Web documents require the same techniques used before for ontology extraction from texts.Several approaches are based on eliminating tags from documents to obtain plain texts on which traditional text mining texts could be applied.We propose to classify these approaches to domaindependant OL and incremental OL.

Domain-Dependent approach for Ontology learning from textual documents
OL approaches from Web content consists generally in enriching a small ontology called "minimal" or "granular" with new concepts and new relationships using text mining techniques.
Learning ontologies from texts has been widely used in the community of knowledge engineering.This is in particular the work of: [2,3,6,7,9,14,16,17,18,19,20,22,26,29,32,38].However, no sufficiently detailed methodology has been presented to assist the learning process ontology.Indeed, the literature is limited to the presentation of guidelines more or less general.Thus, for each approach, it is important to know the aims and scope of the learning process, its main stages, the sources of knowledge used in learning, the main techniques applied in the process, re-usability of ontologies existing and the study of its feasibility.These approaches to ontology learning from text are generally based on the use of a corpus of texts.This corpus should be representative of the domain of the ontology.Using a set of techniques, we try to project in the ontology knowledge contained in texts by extracting concepts and relations.
Besides of ontology learning from texts, ontology learning from Web appears to be a second category in domain-dependant OL.
The most known approaches exploit the textual Web content to enrich concepts using Wordnet [2].Several approaches described in [2] and [14] enrich ontologies from Web documents.
Another approach is proposed in [17] in order to reduce the terminological and conceptual ambiguity among members of a virtual community.This approach proposes the discovery of concepts and relations from the Web sites and lead to the development by the system OntoLearn [25].
In these approaches, domain knowledge a priori is required.For this reason, they are dependent to the domain of the ontology and the collection of Web documents related to this domain need user intervention.

Incremental approach for Ontology learning from Web documents
On the other hand, other approaches are dedicated to the ontology building from Web, which is based on the generation of taxonomies without the use of knowledge or a priori or processing techniques of natural language and use of large corpus or thesaurus.The same approach were improved in [32] to an incremental approach of ontology learning from Web.In [32], a study of several types of available Web search engine and how they can be used to assist the learning process (searching web resources and compute IR measures).The learning process proposed by this approach is based on four steps:  Taxonomic learning: the user starts to specify keyword used as a seed for the learning process from Web using a web search engine, the output of this step is one-level taxonomy, a set of verbs appearing in the same context as extracted concepts. No-taxonomic learning: verb list and keywords are used as bootstrap for construction domain related patterns and to construct query to search engine. Recursive learning: The two previous learning stages are recursively executed for each discovered concept. Post-processing step consists in refining and evaluating the obtained ontology.This approach is domain independent and incremental.In the same context, our previous work was done [8].We have proposed an incremental approach of ontology learning from Web.We combined many text mining techniques and use an ontology-based IR System to classify the web documents.

Web Structure mining-based approach for ontology learning from Web
In [34], the underlying assumption behind this work is that the noun phrases appearing in the headings of a document as well as the document's hierarchical structure can be used to discover the concepts and taxonomic relations from documents.
A system that supports this approach is implemented and applied on a set of Arabic agricultural extension documents.It takes as input a root concept, analyzes all input documents' heading structure, extracts concepts from headings and builds a taxonomical ontology [35] In this section, several approaches of ontology learning from web were detailed.Ontology extraction from texts belongs to this same work.
Since, many semantic Web documents appeared on the Web and new semantic search engines are developed to search them, several approaches are interested to ontology construction by aggregating online ontologies.This will be the subject of the next section.

Ontology learning approaches from Web ontologies
The idea about online ontology building from Web is not a new one [13].Harnessing RDF files on the Web might be the first step towards achieving true reuse.
In [13] an approach for learning ontology from RDF annotations of Web resources was proposed.To perform the learning process, a particular approach of concept formation is adopted, considering ontology as a concept hierarchy, where each concept is defined its extension by a cluster of resources and intension by the most specific common description of these resources.A resource description is a RDF subgraph containing all resources reachable from the considered resource through properties.
Stojanovic [36] presents an approach for an automated migration of data-intensive web sites into the Semantic Web.They extract light ontologies from resources such as XML Schema or relational database schemata and try to build light ontologies from conceptual database schemas using a mapping process that can form the conceptual metadata annotations that are automatically created from the database instances.
[38] presents an approach TANGO (Table Analysis for Generating Ontologies) to generating ontologies based on HTML table analysis.TANGO.
Providing support for reuse during ontology development from specific ontology libraries has been studied before (e.g.[9,12]).However, the objective was mainly to enable users to reuse or import whole ontologies or ontology modules.They provided no support for ranking available ontologies, or for extracting and merging the ontology parts of interest, or for evaluating the resulting ontology.
In [25], a framework for integrating multiple ontologies from structured web pages into a common ontology is proposed.A universal similarity paradigm reflecting the implicit coherences among the ontologies is presented.Ontology alignment and construction methods are described.According to [25], the output ontology will follow users' configuration such as their preferred structure and filtering threshold.It facilitates deep annotation and interoperation in structured web pages from heterogeneous systems [25].
Several approaches were proposed to use ontology search engines or ontology meta-search engines to build ontologies by aggregating many searched domain ontologies.There are an increasing number of online libraries for searching and downloading ontologies.Examples of such libraries include Ontolingua, Protégé, and DAML.Few search engines have recently appeared that allow keyword-based search for online ontologies, such as Swoogle and OntoSearch.
In [4], a new approach consists in searching online ontologies for representations of certain concepts, ranks the retrieved ontologies according to some criteria, then extract the relevant parts of the top ranked ontologies, and merge those parts to acquire the richest domain representation as possible.
We don't deny that such approaches could lead easily to have many domain ontologies but some problems still remain.In fact, we are still worry about many issues:  The reliability of existent Web ontology is not an evidence  The availability of ontologies to be reused in terms of numbers and variety  The quality of output ontology depends on the quality of input ontologies. The use of Ontology searching, ontology ranking, ontology mapping, ontology merging, and ontology segmentation methods make this approach more complex.

Ontology building from Web dictionary
"Wikipedia mining" is a new research area which is recently addressed.
In [27], Web thesaurus construction method based on Wikipedia mining is proposed.By analyzing 1.7 million concepts on Wikipedia, a very large scale association thesaurus which has more than 78 million associations was constructed.To avoid NLP problems, link structure mining is applied to Web-based dictionaries [27].

Other hybrid approaches
In [24], a new method for learning ontologies combining heterogeneous sources of information and various processing techniques associated with each of them to improve the detection of potential useful knowledge.First, it extracts the core vocabulary to the domain using a parsing process.The underlying idea of the method is that the combination of all these additional sources of evidence improves the accuracy of the OL process.Thus, the extracted terms are analyzed at five different levels at this moment: chunk, statistical, syntactical, vis²ual and semantically.The experimental results obtained processing a set of HTML documents belonging to two domains, Universities and Economics, have shown the potential benefit of its use to learn or enrich ontologies following an unsupervised learning approach.

Limits and opens issues
The state of the art presented in the previous section allowed us to release the limits of most of approaches which are based on text mining techniques.We notice the absence and the difficulty of evaluation of the approaches and the tools of ontological engineering: Indeed, each approach is developed by applying techniques allowing the enrichment of ontology with new concepts and new relations from texts.These techniques are then implemented in a tool.In this work, we did not find a comparative study of the used techniques to deduce the best.This is explained by the fact that there must be the experimentation of these approaches for the same corpus relating to the same field and written in a given language.So the Web could be a common corpus for testing such techniques and offer to ontology engineers to adjust extraction rules of ontology for each domain.
Until now, it difficult to propose a domainindependent approach for learning of networked ontologies.Besides, modularity is not respected in these approaches.
Then, a motivation to use semantic search for ontology learning is explained in the following.
Our study on ontology learning process and semantic search process enabled us to conclude that collaboration between the two processes could be useful to have both "incremental ontology building" and "performed search".We illustrate the different relations of collaboration or resemblance that could exist between the two processes in figure 1.
In fact, recent approaches tend toward building a graph-based query for query formulation.
Thereby, in the case of absence of the appropriate domain ontology for user within the semantic search system, this first submitted query would be assimilated to a "seed ontology" regarding ontology learning process.Moreover, the step of semantic disambiguation problem is one of the problems handled in ontology refinement.Lexical resources as linguistic ontology or thesaurus are used to fulfill this task.Also, the association of terms query with ontology concept according to the appropriate sense.etc.Moreover, finding relevant document for a query represents the same problem for ontology enrichment.In fact, Web documents from which domain ontology will be extracted should be relevant to the domain of ontology not to have any irrelevant concepts or relationship discovered.Besides, query reformulation with enriched ontology could ameliorate the search by providing users with additional information to constraint his query.

Figure 1. Combining semantic search process and ontology learning process
The enriched ontology can contain more relevant concepts, relations, instances, or axioms.So, a further collaboration between these two steps will be profitable.
Finally, ontology validation in the combined process would be the result of tow type of collaboration: collaboration between searchers having same search goals and an indirect collaboration between searchers and ontology engineer.The idea behind the collaboration between these two processes consists in enabling each contextual semantic search engine to be more flexible and autonomous by discovering others domain ontologies from Web documents.For instance, ontology-based request is the result of the mapping of the ontological concepts with a query written in natural language.In the case of absence of target domain ontology, a possible ontology could be extracted from text-based query.So, we assimilate a possible query to an initial minimal ontology.This one could be enriched from the selected Web documents tagged as relevant ones by users.Other domain ontologies could be discovered and existent ones will be enriched with the use of terms in query formulation and relevant Web documents selected by target users.

Motivations based on our previous work
On one hand, any process of ontology learning from text depends on the relevance of the textual corpus besides of applied machine learning techniques.On the other hand, the main purpose of semantic search is providing users with the most relevant Web documents according to their query and with the use of specific domain ontology.Starting from this fact, we can affirm that semantic search can be a useful way to perform ontology learning from Web content.In this context, an approach presented in [8] was proposed to use ontology-based search engine [5] to collect textual sentences from which new concepts and new relations are discovered.
In [8], we have proposed a distinguishable and incremental process based on three phases: an initialization phase, an incremental phase of domain ontology learning and finally, a phase of analysis of the results.Indeed, the initialization is dealing with the preparation and the pretreatment of the data sources which are made of a minimal ontology, a metaontology, the linguistic ontology "Wordnet" and a set of Web documents relating to the target domain.The second phase is characterized by its incremental and iterative aspect.Each iteration is made of two successive steps.
The first one is the alimentation of a metaontology [10] and the second consists in applying the axioms related to ontology element learning.The first step consists in applying the techniques specified by the Metaontology to instantiate metaconcepts and metarelations.These techniques are applied according to process described in [8].The second step consists in discovering new concepts, new relations, and new axioms related to a domain (see [8] to have more details).
Our approach leads to the implementation of the OntoCosemWeb prototype and we have used it to build tourism ontology.We have also developed an online information retrieval based on this ontology to collect and classify the results selected by users.These results are used it as the input of "OntoCosemWeb prototype" [11].
For this reason, our motivation lies in to integrate an ontology learning task into the semantic search process and to define how the two processes could collaborate to build more domain ontologies from selected documents and, by the way, ameliorate the semantic search.

Towards semantic search approach for incremental ontology learning from Web
According to [13], the problem in contextual semantic search systems resides on building a new domain ontology which has not been defined before.Standing from the fact that the Web is an enormous information source and a dynamic, we have the idea to integrate ontology learning process in the search process.To fulfill this motivation, many objectives are fixed such as:  Modularity and reuse of learned ontologies;  Scalability and evolution of ontology building;  facility of learning axioms on ontology modules by linking the search request to search results;  Personalization of the built ontology.
In fact, to have networked ontologies in a multi contextual search engine is a key requirement to cover user needs.However, when many domain ontologies are used by a semantic search system, taking consideration of modularity aspect make easy the management task of these ones.In many cases, a search query can be translated to an ontology module (a subpart of ontology).These modules could be reused by other users to express a similar query or to enrich it with new concepts, instances, or relations.So, any search system will become multi-contextual and more adaptable to user's queries.The searcher will participate also in ontology building by selecting the more relevant documents.These ones will be the input of ontology learning process to enrich the initial submitted query.

The ontology Warehouse
Ontologies Warehouse is made up of four levels of ontologies (Figure 2).The first layer represents the topic ontology.It is an ontological classification of topics, domain and contexts, regardless of the used language.Each topic T can be the subject one or more domains D, it depends on the position of the topic in the hierarchy.
The second layer represents a set of networked domain ontology schema.Each Domain ontology Od is a networked modules M. a Module M is seen as a dimension in the domain ontology which consists of a main concept C with its common properties (relations with others concept i).Proprieties of a concept C1 are defining as the more frequent relations that characterize C1 and that are used in query interfaces and relevant Web document.So, a Module M1 could be in many ontologies and in relation with other modules.For example, the module having as main concept "conference" could be in many domain ontology (computer science, physics, mathematical, etc.), as we can find conferences related to many domains.A concept C is the following tuple (id, {(ti, language, context)} i=1..n, state, credibility Degree) where:  Id: is a concept identifier associated to a sens regardless of the terminological labels and the language referencing it. {(ti, language, context)}i : is a set of triple (t, language, contexte) where t is nominal phrases referencing the concept in a targed language and used in specific context which can be the topic that represent the concept role in a specific domain. State: is the state of the discovered concept.A discovered concept from text could be "new candidate", "validated", "rejected", "average candidate". Credibility degree: is a degree of the correctness of the concepts according to his module, we are working in our future work to determine this degree with the observation of the usability of a concept in semantic search.

Figure 2. Multilayer ontology warehouse
Then, to each user, a personalized view of domain ontology is associated which represents the most used ontology fragment in their search activity besides of their used terminology.

Learning process based on CBR
The combined process is represented by this process.The user selects an existing topic from the Topic ontology, if it is a new one, he can create the topic and place it in the appropriate position in the ontology.Then he will formulate a search goal.
According to [25], we distinguish a lot of type of search goal resumed in three categories: navigational goal, informational goal and resource search.We will use the type of search goal selected by user to better understand the search purpose.To each goal search, a set of graph patterns are affected.These graphs will be instanced by users according to their request with the ontological elements.This step is important to construct an initial core ontology module that will be enriched by relevant Web documents.We insist on the fact that each search goal will be translated into ontology module characterized by a target concept that we called "main concept" and others concepts that characterize this one and restrict the search.After some iteration of using semantic search system, ontology learning process doesn't enrich immediately the underlying ontology with all discovered concepts.
When there is a doubt about adding these concepts, some searchers submit their graph-based query by enriching an antecedent similar query with one of these concepts.This act could be a hidden way of validation of some discovered concepts.The indirect collaboration consists on the maintenance of domain ontology and ontology views classified by search goals by ontology engineers.Starting from this convergence, we can imagine that the combined process will be as described in figure 1.

Query Graph pattern vs ontology module pattern.
According to the type of search goal selected by user, a set of graph patterns were designed (table1).For example, if the user goal is a navigational search then the patterns presented in figure will be instantiated and the searched node will be marked by "X?".

Populated ontological module (query)  annotations of URLs
type of search goal, C: the main concept which is subject of the search, G: instantiated graph).The formalism of representation will be treated in future Work.
A case is represented by a problem and its solution.A problem is equivalent to the search request.The solution is the set of URLs found by the user.The base case is displayed to the user.
Otherwise, if similar cases exist, a new case is added to the base case.A query is sent to a search engine.When the user selects the relevant documents, a new event is added to the base case and a process of text mining is applied to selected documents to enrich the ontology module on the associated request.

Illustrating example
We suppose that a user wants to know the URL of the workshop WISM 2009.The type of search goal is a navigational search.The user selects from the topic ontology, by searching the term workshop, he will find that there are no topics related to system modeling.So, a new insertion of a new topic in the topic ontology is done.We suppose also that we have modular domain ontology related to computer science in the ontology Warehouse.But the concept workshop doesn't exist.So the formulation of this first request will be a new core ontology module to be enriched in computer science ontology.The main concept of this request is "workshop" (figure 3).

Figure 3. Initial goal search formulation
The disambiguation step will be held by user with the senses delivered by Wordnet or using online linguistic resource as Wikipedia.This step is important to collect the synonyms and hyponyms in order to instantiate the metaontology [22] with this contextual information.Since, the case base is empty, a query is submitted to a search engine and the user selects the Web document corresponding to the WISM workshop.This document will be the input of ontology learning phase of the process described in.

Table 2. Lexico-syntactic patterns
The enriched ontology is presented by figure 4.

Case study and first Experimental Observation
In this section, two ontologies will be compared (Figure 5).The first one is an ontology resulting from our previous approach "OntoCoSemWeb" [11].This approach is based on the metaontology which is based on extraction of all textual elements from Web documents which are imported by a search engine.The second is modular ontology resulting from the approach described below, using a modified version of OntoCosemWeb [11].The number of errors in discovered concepts and learned patterns has been compared.Noise in learning results was incredibly decreased by the first iteration.So, the combination of the two processes can produce a more relevant Web document from which only an ontology fragment (module) will be enriched.This has also an effect on processing time.

Conclusion
In the present paper, we focused on the possible combination of semantic search approaches and ontology learning methods to facilitate the integration of personalized and evolutionary ontology building in semantic search systems.We have proposed a framework with an illustration scenario.The originality of our proposal consists in applying ontology technology with information retrieval based on case base reasoning and combining ontology learning with semantic search based on case base reasoning.
The main contribution of this work is to facilitate the Web semantic engineering using semantic search and ontology learning from Web document and to link the request of users to ontology modules constructed by using their selection of relevant documents.

Figure 5 .
Figure 5.Comparison of noise in learning process between our previous approach and a combined one.

Table 1 .
search's goal and graph patterns