Management and analysis of chinese database extracted knowledge

China is an arising country, not only economicaly, but also scientifically. Being aware of the day to day evolution of this emerging country implicates to be able to read the local news, in Chinese langage. In this article we propose to use classical data-mining process tools in an original utilization for analyzing raw datas in order to procure knowledge for business intelligence (BI) application. The aim of this method is, not only to process Chinese datas, but also to create Intelligence by the analyze of the evolution over time of the interactions between specific object within the dataset (key-words, authors, affiliation, so on). The behavior of the environment in the analyzed field will thus be clearly legible throught a summarized representation of the raw datas, thus becoming knowledge. This work focus to provide a new theoretical framework technology for the retrieval information and the management of the associated knowledge, in a BI application. In this paper, we show how to use the data-mining tool and clusters analysis methodology to extract knowledge from a Chinese scientific database, without being able to read Chinese characters.


INTRODUCTION
This work lies within the scope of the research orientations of the CNRS Research Program in Competitive Intelligence.It aims to use tools for automatic data processing developed by the French public research, and in the case of this article, Tétralogie (Dousset, 1988), VisuGraph (Loubier, 2007) and Xplor (Ghalamallah, 2007) tools, for an application to new scientific and technical information "territories": the Chinese information.The knowledge of China's perpetual change dynamic proves to be imponderable and a need for the survival and the competitiveness of a company.Companies and institutions must be provided with the means of deciphering the strategic issues which are profiled in their environment under penalty of seeing their development slowed down on this area.A forward-looking information retrieval and analysis will prove very useful in order to gain a more relevant visibility of the Chinese environment and its behavior.The results will form part of a comprehensive business intelligence (BI) approach within the company and help the decision maker to better understand its environment.During the second half of the twentieth century, the development of informatics and communication technology has facilitated the transition to the information age, allowing the emergence of a new industry: the one of knowledge.This industry is mainly driven by databases, which are the containers of human knowledge in various fields of knowledge (Dousset, 1988).Since China opening to the international market in the early 1980's, it had to adapt its industry in order to take root in the international trade and become competitive on the global marketplace.The rapid development of the Internet in China, from the second half of 1990, permits the emergence of the industry of kowledge that had, as elsewhere, to be structured.The database industry is one of the most important sector for scientific and technical information and can be used as an indicator to measure the belonging of a country to the information age.The information industry in China matured together with the emergence of the Internet and the impressive development of scientific research in China.China has thousands of databases, which is a source of information largely untapped in the West in the BI processes.In the context of the strategic scanning, VisuGraph is a tool particularly adapted to the macroscopic analyses.Indeed, it is able to detect the strong signals, the weak signals and tendencies from a corpus of documents collected for a precise subject.The elaborate information results, represents a synthesis obtained by various methods of data analysis and diffused via graphic visualizations.But because of the different strategic analyses that we have already carried with this software, it appeared that the end users of the produced analyses want, in addition to the general and strategic aspect (general knowledge), more precise views on certain points.In order to satisfy their specific needs for more precise information on elements, which they have already identified (competition, new products or processes, potential partners,…) or in order to discover other elements.Many experts and decision makers are demanding for more details while processing the elements that represent their traditional environment.These elements should contain more precise information about key words, the different actors, the prospective partners and markets that they're coveting for.
The 2nd BCS-IRSG Symposium on Future Directions in Information Access In addition, in the business intelligence (BI) context (Ghalamallah and al., 2007), the majority of the strategic information comes from relational sources and the relevance of extracted knowledge usually depends on considering data evolution and their interactions, we propose for our macroscopic analyses a computerized decision-making system with perspective to automate the on-line processing of relational information and to propose analysis and navigation tools oriented to business intelligence (BI) (microscopic).VisuGraph and Xplor two complementary systems provides strategic analyses on corpora of textual information resulting from the most various sources like: on line databases, Cds, the visible and invisible Web, the news, the press, linking sites, intern databases, etc.In this article, we present à different experimentations of these systems tested on Chinese data.

PROPOSITION
Business intelligence (BI) tools enable organizations to understand their internal and external environment through the systematic acquisition, collation, analysis, interpretation and exploitation of information.Two classes of intelligence tools are describe define (Carvalho and al., 2001).The first class of tolls is used to manipulate massive operational data and to extract essential business information from them.The second class of tools, sometimes called competitive intelligence (CI) tools, aims at systematically feeding the organizational environment in order to make possible to learn about it and to take better decisions in consequences.CI depends heavily on the collection and analysis of qualitative information.This article focuses on the second class of tools, where information is mainly gathered from public sources such as the web, databases, CDROM... Fuld (Fuld, 2000) describes the CI cycle in five steps: Planning and direction: this step is related to the identification of questions and decisions that will drive the information gathering phase.Published information collection: search of a wide range of sources, from government fillings to journal articles, vendor brochures and advertisements.Primary source collection: this step is related to the importance of gathering information from people rather than from published sources.Analysis and production: transformation of the collected data into meaningful assessment.
Report and inform: delivery of critical intelligence in a coherent and convincing manner to corporate decision makers.Our approach includes the main phases: analysis and production, report and inform which can be automated by using information technologies (Carvalho and al., 2001).Three big steps handle the data processing and their evolution during a given period: the raw data-gathering, the transformation of the raw data into relational information (pre-knowledge), the extraction of knowledge out of pre-knowledge.

Requirements Formulation
The first step of the BI or CI cycle is the expression by the decision maker of his informational need.A first work has been done about the identification of his needs and targets.Mostly, the requirements are irrelevant or unclear.As this paper is more about the treatment of the data we will just define the informational need with the help of keywords.Our aim here is to prove the validity of the approach and to demonstrate that there is a gap between the existing information and the useful information.

Information Collect and Information Processing
Information processing is based on Knowledge Discovery in Databases (K.D.D.).It defined as « the non-trivial process of identifying valid, potentially useful and ultimately understandable patterns in data», (Fayyad and al, 1996).For this paper, the extracted corpus was created by articles collected for their link with "wheat" from 2004 to 2008 (seven semesters).Data treatment is divided into five steps: 1 rst phase: collecting data 2 nd phase: filtering data 3 rd phase: cleaning and processing data according to constraints imposed by some tools, algorithms or users 4 th phase: crossing the data, providing pre-knowledge.
5 th phase: Interaction and data visualization (with the tools called VisuGraph and Xplor).
Our approach has been tested on data extracted from various western databases: SCI, Medline, Pascal, etc…Today, we are testing it with a corpus extracted from a Chinese database: VIP.The process is the same as usual with other databases except that we had to adapt the treatment to the Chinese ideograms.
VIP is a commercial database that provides scientific and technical information.The headquarters of this company is based in Chongqing, at the very center of China.Created in 2000, the base has become the largest Chinese commercial database; they inter alia signed a strategic partnership with Google aimed to adapt "Google Scholar" in Chinese version.In 2005, VIP has been linked to the State information networks, news and publications services which enables it to develop a strategic step on the information industry market.
The 2nd BCS-IRSG Symposium on Future Directions in Information Access The database contains more than 12 000 Chinese periodicals and has some 17 million items ordered in eight categories: social sciences, natural sciences, engineering, agricultural sciences and agronomy, medicine and health, economy, education, science and technical, information science and documentation.It is daily consulted in China by tens of millions of readers dispatched in more than 5000 structures: universities, schools, research centers, hospitals, businesses centers, etc… Its reputation has grown because of the completeness of the provided information and also because of the minimal cost of the access.The database is online and the launch of a query is free.The full article (PDF format) can be downloaded through payment with a subscriber's account.An article costs an average of 1 euro.In order to remain on the market of Chinese information industry and position itself to face its rival CNKI (public database developed by Qinghua University), VIP has diversified its offer.In addition of being a scientific and technical literature provider, it also records regularly renewed substantive topics: intellectual property, innovation in Korea, etc… recently have also been started on a statistical overview field about the launched query results.
A last point: the tab "english" which was previously proposed has been recently deleted.More and more difficult thus to access to the contained informations… Anyway, the structure of the data is strictly enforced for all articles; a systematic indexation of the articles realize the possibility for querying by crossing fields and the treatment of the data by infometric tools is thus possible.This is an example of of bibliographic presentation of an article:

FIGURE1: Bibliographic presentation of an article
Information is synthesized in co-occurrence matrices, used in the various modules proposed by Tetralogie The basic units of analysis are the term, the field (author, keywords, address, date …) and the document.A field is a basic preset beacon for semi-structured data, as for example author, date, addresses, organization.A field, can have just one value (newspaper) or have different values (author, keyword …).Data can result from the crossing of two fields, sub-fields or groups of fields in order to obtain co-occurrence matrices.For each of these matrices, crossing between two entities reveals the metric value of the bond between them.Whatever the entity type (author, newspaper,…), it is possible to cross three fields simultaneously.To consider the temporal aspect, the third dimension represents time.
Crossings between two entities are carried out over several homogeneous temporal segments (or periods), in order to analyze the changes induced in time like: absolute changes, relative changes, accelerations, implosions, clusters evolution, etc… The last step is data visualization and data analysis, giving the following results.

Tags Meaning
The example in figure 1 shows co-occurrence matrices between authors obtained by Tetralogie.Crossing between the same author give his publications count for each time slice (in bold in the tables below).Crossing between two different authors shows shared publications (co-authors).

Analyzes
This analysis is based on the results of tools for the decision but did not detail their principle of functionality.We invite the interested reader to refer to our research work, to justify development and the principle of each of these features (tufte and al, 1983), (Mothe and al, 1998).
We decide to focus our analysis on the development of authors in the field of wheat in China during the last seven years.In this context, we propose two tools to analyze data, called "VisuGraph" and "Xplor."With these two tools, the main stages of Business Intelligence (BI) are processed.The originality of these tools based on the submission of a complete system to make both the micro (Xplor) and macro (VisuGraph) analysis, offering a global vision and a specific vision according to the needs of decision-maker (Ghalamallah and al, 2007).

Macroscopic analysis
Data visualization allows providing as much information as synthetics, which are rarely explained in the raw data.Data representation is an excellent vehicle for analysis of complexity of numerous data ( Tufte, 1983) (Tufte, 1980), (Tufte, 1997).Marks's works on display ( Marks and al, 2005) reveal that a graph may be, clearly, more than two hundred nodes while a computer screen can not display more than twenty consecutive lines, resulting from a search engine classic.Thus, it becomes easier to analyze arcs between summits, as well as different groupings summits.The overall display documents crossed the keywords can reveal information not visible in the raw data.Work in visual perception has shown that the human being has a unified global configuration of elements or gestalt-perception of a scene, before paying attention to its details ( Myers, 2000).Work of Tufte (Tufte, 1983) and Bertin (Bertin, 1977) have shown how to exploit, in an intuitive or ad hoc way, these characteristics.
To illustrate the potential of this time-based analysis by VisuGraph tool, we have analyzed the pace of dynamics of actual event data with the study of complex networks of Chinese market relationships as they evolve.
Based on temporal co occurrences matrices on Chinese writers, time graph is drawn through VisuGraph tool.In this graph, nodes represent authors and links represent collaboration between two authors.Each period is visualized on this representation.Each author is represented by a histogram which each bar corresponds to the metric value of the author for a specific period (Loubier, 2007).Thus, the first bar histograms corresponds to the first period, the second in the second period, etc… Each period is likened to a summit characterizing the year (2002,2003,2004,2005,2006,2007,2008).These are placed on the edge of the window display, equidistant from each other.Peaks representing authors are placed strategically, according to their belonging to different periods.Thus, a certain typology is different.More an author is specific to one year, the more it will be located near the landmark symbolizing this year.The nodes located near a landmark are authors who have written for this specific period.A node located between two landmarks symbolizes the author's persistence on two periods (Loubier, 2007).
In order to detect the most important actors, we apply a filter on data.Based on a specific value, given by the user, every node's value under the threshold is not visible.By this way, the graph is more readable.The authors located in the center of the graph belong to several periods.The authors are persistent, continually working on this area.The authors are located on the outskirts characteristics to one (or two maximum) periods.Under this chart evolves, we applied a filter to retain only the perpetrators of the most important.It notes the presence of a central core, revealing the presence of authors during the eight periods.However, there is also a strong presence of major authors in the field, during 2006 and 2007.It notes that those authors who began to publish in the first year 2002 have become pioneers and persist on other periods.The strongest circle contains The 2nd BCS-IRSG Symposium on Future Directions in Information Access some of the most important actors.These nodes are connected so we can see that the most important actors work together during the different period.We are interested in those authors who are the most important area.These authors are characterized by their large size histograms.This means that these authors have published more during recent years.Moreover, these authors are more connected, which shows their many collaborations.We print only their names so as not to overburden the drawing.We circle these authors in order to better visualize in the graph.We then get the next figure.  .

Microscopic analysis
Cooccurrences matrix presented before are used for this analysis.We are going to transfer these matrices in the form of BDD to feed the web portal XPLOR.Once the BDD is online, we define different areas of analysis for the user.To complete the study macroscopic achieved with the tool VisuGraph, we developed an indicator of the evolution of the sponsors, through the portal Xplor.To get the evolution of the ten best writers in the existing database on the seven periods.The principle is to zoom on matrices created during the macroscopic analysis.
For the display of our results, we offer several types of outputs and graphic output in tabular form.
For our experiments on the study of wheat field in china changing authors in this field are represented according to the following figure:  We can also view this table in the form cross-table:

FIGURE 2 :
FIGURE 2: Publication co-occurrence matrices for four authors {A, B, C, D}, obtained by Tetralogie treatment for four periods consecutively.

FIGURE 3 :
FIGURE 3: Data visualization about Chinese authors working from 2002 to 2008.
-IRSG Symposium on Future Directions in Information Access

FIGURE 4 :
FIGURE 4: Top 10 of authors about wheat between 2002 and 2008.

FIGURE 5 :
FIGURE 5: Evolution of the ten most published authors about wheat between 2002 and 2008.
Evolution of the ten most published authors about wheat between 2002 and 2008.2nd BCS-IRSG Symposium on Future Directions in Information Access