A Flexible Architecture for Content and Concept Based Multimedia Information Exploration

Traditional hypermedia systems can be extended to allow content based matching to give more ﬂexibility for user navigation, but this approach is still limited by the capabilities of multimedia matching technology. The addition of a multimedia thesaurus can overcome some of these limitations by allowing multimedia representations of concepts to act like synonyms in the query process. In addition, relationships between concepts allow navigation within the context of a semantic scope. The use of agents that independently examine the information in the system can also provide alternative methods for query evaluation. This paper presents a ﬂexible architecture that supports such a system and describes initial work on implementation


Introduction
Traditional hypertext systems allow textual information to be arbitrarily linked so that users can navigate between related parts of the information in the system.Many of these systems can use multimedia information such as images, sounds and video clips that can also be linked.In such hypermedia systems the links are usually created using specific locations in particular documents.
Some systems, such as Microcosm [7], also allow links to be created by specifying the text that forms one anchor of the link.These links are created once and can be followed from any location where the text occurs.To achieve this, the system has to examine documents currently being viewed by the user and look for matches between text in the documents and text that forms anchors of links within the system.Where matches occur the system can highlight the text as a source anchor for a link.This is a form of content based navigation.Navigational links are dynamically created by matching the content of currently viewed documents with the content of previously created link anchors.For textual documents and links this is relatively easy to implement since identifying likely text strings and comparing them with link anchors is often straightforward.Content based navigation for multimedia documents and links involves identifying and comparing selections of images, video and sound with each other.This is still an unsolved problem in the general case and implementations are limited in their capabilities.The remainder of this paper begins by describing some of the reasons for these limitations and suggests that a multimedia thesaurus can improve the flexibility of a multimedia information system.It continues to describe some related work in the areas of content based retrieval and thesaurus supported navigation.
A novel architecture for a multimedia information system is proposed together with the design for MAVIS 2, the research system currently being developed by the Multimedia Research Group at the University of Southampton.
MAVIS stands for Multimedia Architecture for Video, Image and Sound and this is the second project by that name.The aim of the MAVIS 2 project is to explore the use of a multimedia thesaurus and intelligent agents to support navigation by content and semantic relationships between concepts in multimedia information systems.MAVIS 2 is currently being implemented and some discussion of this process is given with an example of the implementation so far.

The Multimedia Matching Problem
The problem of matching multimedia selections is addressed by content based retrieval systems [5,15,10].The aim of these systems is to find objects that are similar to an example object indicated by the user.There are several aspects to the multimedia matching problem that make it a challenge that remains unresolved.

Properties
For each medium there are different properties that can be measured that represent different ways in which two objects can be considered similar.For example, images can be compared on the basis of colour, shape or texture and audio segments can be compared by frequency, tempo and rhythm.When a user asks for a similar object the appropriate types of similarity may need to be indicated.

Signatures
Each property can be represented by different signatures or feature vectors.There are many algorithms for calculating and comparing specific types of features and some may be more appropriate and effective than others in a particular situation.

Similarity
It is often the case that the properties that a computer can use to compare objects do not correspond to the properties that humans use to compare objects.Humans can apply world knowledge to recognise objects in scenes and sounds in audio that requires a level of media understanding that computers cannot achieve yet.Content based retrieval systems have to combine results from existing measures and attempt to match objects.
Despite these problems there are many content based retrieval systems, existing as both stand alone systems and incorporated into database products.
The techniques of content based retrieval can be used in a hypermedia system to provide multimedia content based navigation [12].This allows previously created links to be made available to the user where the objects that the user is viewing are similar to the objects from which the original links were made, subject to the constraints of the underlying matching methods.This allows many links to be made available to the user without the author having to create them explicitly.

Thesaurus Support for Navigation
Content based navigation provides a useful extension to a hypermedia system, but it still requires that there is some sort of similarity between link anchors and selections within users' documents.There are many situations where a given link might be appropriate but the matching methods cannot recognise the similarity between a selection in a document and a link anchor.Different views of the same object or different examples of the same type of object may appear dissimilar.
For example, a top view and a side view of the same table can look very different.An office chair and an easy chair may not be similar in shape or colour, but they are both chairs.Similar examples exist for other media, such as words spoken in different voices, alternative versions of the same song or the same object moving in different ways in a video clip.Some of these relationships can be represented by using a thesaurus.For a given term a traditional thesaurus gives synonyms (words with similar meanings) and semantically related terms (usually broader and narrower terms).
Challenge of Image Retrieval, Newcastle, 1999 The thesaurus idea can be extended to include multimedia information about each term.It can include different multimedia representations with each term that can be used to illustrate different views, types or aspects of a concept.The particular representations that are appropriate will depend on the application for which the thesaurus is designed.
Similarly, the semantic relationships between terms (or concepts) can also be extended to include, for example, spatial relationships (if the concepts represent objects) or temporal relationships (if the concepts represent stages of a process).The terms in the thesaurus can become a more general semantic network of concepts where relationships appropriate to an application can be represented.
When such a multimedia thesaurus (MMT) is combined with multimedia matching technology it is possible to offer more flexible facilities for content based navigation and retrieval [11].For example, if a query matches with one representation of a concept then other representations, information associated with other representations, information about the concept and information about semantically related concepts can all be offered to the user as the results of a query.
The MMT can also offer more flexibility in the searching process itself.Concepts in the thesaurus are related semantically, so it is possible to express the scope of a search in terms of the relationships between concepts.For example, a search can be constrained to consider only narrower terms of a matching concept or it can be expanded to include a broader set of concepts and associated information.
Another possibility is an automatic synonym search, where a match with a thesaurus entry can trigger parallel searches using synonym multimedia representations from the thesaurus as queries for the searches.In this way an original query starting from one view of an object may yield results that are associated with other views or other examples of the same type of object.
The user can also match against representations associated with thesaurus concepts and then use the structure of the thesaurus to explore the semantic relationships between concepts.This allows browsing at the concept level where the user can interactively examine the information associated with each concept.
The combination of hypermedia links, content based matching and a multimedia thesaurus offer numerous possibilities for the user looking for information.One challenge presented by this work is devising a user interface that allows the user to express clearly how they wish to use the system while keeping simple operations simple to perform.

Related Work
Multimedia information retrieval is a widely researched area and there are many commercially available database systems with this functionality.This section briefly describes some of the systems that are available.

QBIC
The Query by Image Content (QBIC) system developed by IBM [5,16] allows images from a collection to be retrieved via visual cues such as colour distribution or texture.This searching can be combined with other cues such as keywords.Though it is purely a retrieval system and does not include hypermedia functionality, it does face the same matching problems described earlier.QBIC has been integrated into IBM's DB2 database in the form of multimedia extenders that allow a content based multimedia match to form part of a normal database query.

Informix Visual Information Retrieval DataBlade Module
The Informix Software DBMS can be extended with a Visual Information Retrieval system, allowing images to be retrieved via visual cues.Keywords can be used to aid the search.These systems can retrieve images by similarity measure, but do not take into account any semantic information other than keywords.Used as a supplement to hypermedia functionality however they become a valuable navigation tool.

The MAVIS project
The original MAVIS project [12] extended Microcosm's generic link capability [7] to allow source anchors to exist in media other than text.The user can retrieve the source anchors of links via media-specific clues (such as colour distribution or texture for images).The source anchors in the system are then searched and a ranked list of matching source anchors is presented, allowing the user to follow each link.

Semantically Linked Hypermedia
Use of semantic information in hypermedia/information retrieval queries is an ongoing research topic.Cunliffe et al have proposed a semantic hypermedia architecture [4].Binary relationships between index terms or media item identifiers are stored in a Binary Relational Store.The user can query this by specifying a piece of media held in the store and optionally a relationship.Relationships pertaining to that piece of media and any specified relationship will then be presented to the user.

Content-Oriented Integration
The Content-Oriented Integration work by the C & C Laboratories, NEC USA Inc. [8,9] allows the user to search for media via media-based clues, as well as allowing conceptual-style navigation.Media items are connected to conceptual representations held in the system.These conceptual representations are linked to each other allowing navigation on a conceptual level as well as a media level.It also allows object-based navigation whereby a number of objects and their relationships can be used in a query.Conceptual navigation is much like MMT navigation in MAVIS 2.

Link Clustering
Link Clustering is a technique with similar goals to the MMT in MAVIS 2. A link cluster has several source anchors and several destinations.If any of the source anchors are reached all of the destinations of the cluster are presented to the user.In addition, arbitrary relationships are allowed between clusters that are available for navigation between clusters after a cluster has been reached.This technique is currently being researched by Crowder et al [3].

The Okapi Project
Retrieval using text-based thesauri has been researched for many years, though widespread use of electronic thesauri for this purpose developed more recently.Thesauri are typically used either as a means of indexing documents according to the preferred form of a particular term, or used for synonym substitution in a flat text search.
The Okapi project researched at City University London [13] involves the testing of various text-based information retrieval techniques including the use of thesauri.Beaulieu [2] concluded that thesaurus use, both explicit and implicit, was beneficial to the retrieval process.

Automatic Authoring
In [1] Agosti et al describe a method for automatically generating links between existing multimedia documents.Text portions of the documents are analysed to extract index terms and these are automatically related to concepts in a pre-existing concept network.Links between index terms and links between the documents themselves are created using statistical methods based on occurrences in documents.Links from multimedia portions of the documents are inferred if there are links from closely related text portions such as captions or supporting text.

MAVIS 2 Design
MAVIS 2 uses a four-layered data architecture to describe the relationships between the objects that form the information in the system.This is shown in figure 1 and described in the following sections.All the relationships between objects can have arbitrary information associated with them, allowing for application specific attributes and weightings.

Raw Media Layer
This consists of representations of all the raw media objects that are usable within the system.A raw media object contains a reference to a media file such as a web page or an image file and some information about the type of the object.Raw media objects are referred to by higher level selection objects.
Challenge of Image Retrieval, Newcastle, 1999

Selection Layer
The selection layer contains selection objects that describe part of a raw media object.For example, a selection may refer to an area of an image, an extent in a piece of text or a time segment in a piece of music.Several selections can refer to different parts of the same raw media object, avoiding the need to duplicate large quantities of media data.The reference from a selection to raw media also allows selections to be presented to the user in their original context.
In order to compare two selections, signatures (feature vectors) are derived from the portion of raw media to which the selections refer and the signatures are compared with each other.Signatures are not stored explicitly at this layer but are calculated and cached with an association to the selection from which they were derived.A selection may have many signatures depending on the feature extraction algorithms that are available and any two selections can be matched by comparing any or all of the types of signatures that they have in common.

Selection Expression Layer
This layer consists of selection expressions that describe combinations of selections and information about which properties of each selection are relevant.Simple combinations of selections may be constructed by means of a GUI, for example the colour of selection S1 and the shape of selection S2.By default, all properties of an unqualified selection are considered relevant.
In MAVIS 2 selection expressions may be source anchors of links.The destinations of these links may be either a concept in the conceptual layer (described in the following section), or another selection expression.The latter allows traditional multimedia content-based navigation.Typically the destination of such a link will be a selection expression pertaining to a single piece of media.

Media/Concept Associations
These associations relate the conceptual layer to the selection expression layer.Although they are essentially bidirectional, the way they will typically be used is determined by their type.The association may be one or both of two types

Lead-in
The selection expression will by default be used for matching with a view to reaching a concept.Also it will be offered to the user when they ask for alternative representations (like synonyms in a text thesaurus) of the concept.
These are analogous to the textual terms in a controlled language thesaurus, whereby equivalent terms "lead in" to the preferred term for that concept.However, in the MMT the concept itself is an abstract representation, thus the preferred term is a lead-in itself.This preferred form is flagged as the preferred representation, and typically associates a text selection expression with the concept.This term is displayed when the concept needs a concrete representation, for example in a concept browser.

Relevant document
The selection expression will by default be offered to the user when they ask "tell me about this concept."These are analogous to documents indexed by a traditional thesaurus.
A selection expression can be both a lead-in and a relevant document, in which case it is possible to match with it to find a concept and to use it to illustrate a concept.An example might be a photograph of someone associated with a concept that represents that person.

Conceptual Layer
This consists of abstract representations of concepts.No media or keywords are held in this layer.Relationships between these concepts are represented by Concept/Concept relationships which may be traversed in either direction, although the context of the relationship may differ in each direction, for example specialisation in one direction and generalisation in the other.
Though this layer by default will use classic thesaurus-style relationships (broader/narrower, related) a richer set of semantic associations can be expressed by defining new sets of relationships.An example might be spatial relationships to describe the relative locations of objects with respect to each other.

MAVIS 2 Queries
Navigation in MAVIS 2 takes the form of queries.Typically a user will provide a selection expression (already existing or specifically created) and information about what type of result they want (whether they want concepts to be returned, other selection expressions, or links or any combination of these) and how the MMT and alternative matching methods (such as independent agents) should be involved.
The query is processed and the results are presented to the user.The user can interact with the results and find out why each result is present and also reorder the results based on various criteria.The user can then either follow links, view other selection expressions or view and browse concepts in the MMT.
The three main parts to a MAVIS 2 query are described in the following sections.

A Query Selection Expression
The user specifies a query selection expression that specifies a combination of selections and properties against which other selection expressions that fall into the scope of the query will be matched.An initial tool for building selection expressions allows the user to specify a set of selections and properties that they wish results to be similar to and a set of selections and properties that they wish results to be different from.In other words, the user can effectively say "I'm interested in objects like this, but not like this."After the query has been processed this query selection expression will be modifiable, allowing the user to refine their query, perhaps augmenting it with selection expressions that were results from a previous query.For example if an object matched well with the query selection expression, but the user decides the object is not relevant, the object can be added to the set of selections in the query expression that the user is not interested in matching.

A Query Scope
Usually the user will have some idea of what they want from the query.It is the scope of the query that governs which selection expressions in the system are compared with the query selection expression.There are a number of possibilities.
Find objects similar to that specified by the query selection expression.This is a basic content-based retrieval (CBR) query.
Find links that have been made from objects similar to this object.In this case the query selection expression will be matched against the selection expressions that are source anchors of links.This is a traditional multimedia content-based navigation (CBN) query.
Find out which concepts, if any, are associated with this object.In this case the query selection expression will be matched against selection expressions which are lead-ins to concepts.Individual MMTs may be switched on and off, as the user might be interested only in the subjects covered by particular MMTs.

MMT/Agent Control Parameters
Even if the user does not want a concept from the MMT returned from the query, the MMT can be used to assist content-based navigation and retrieval.The use of the MMT to augment queries may be explicit or implicit.Both techniques have been shown to be useful in text-based retrieval scenarios [2].If the query selection expression is determined to be associated with a particular concept then the MMT can intervene in two ways.
Synonym substitution can perform parallel searches using alternative representations of the concept.These may be different views of the same object or examples of the same type of object.These searches may provide results that are just as relevant to the original query but which would have been missed by the matching methods that are available.
Note that the representations used in the subsequent searches may be in a different medium to the original query.This allows an initial text based query to give images as results.This may not always be desirable so part of the scope allows this feature to be restricted to the same medium as the initial query.
Semantic scoping allows the scope of the query to be specified in terms of relationships between concepts.If thesaurus relationships are being used then a query could be constrained to only consider objects associated with narrower concepts than the one matched.Similarly, a query could be expanded to include broader concepts and their children.
The user can also control how intelligent agents are utilised in the matching process.These agents provide alternative matching methods to the exhaustive method that is normally used.The agents are trained on signatures from selections and, given this information, they can discover novel associations and build schema relating selections in the system (for example, examples of similarly coloured tables) with each other using single or multiple feature extraction techniques.efficiently retrieve selections which are deemed to be of the same classification given examples of other objects using the schema built during training.
It is also possible for an agent to make use of the explicit associations between selection expressions and concepts in the MMT to provide a direct mapping from feature vectors to concepts that can be more efficient than the exhaustive nearest neighbour comparisons that are otherwise required.

The Matching Process
In order to evaluate a query and generate results MAVIS 2 must use content based matching methods to compare the query selection expression with selection expressions in the system.Each selection expression may refer to several selections and these may be qualified, indicating that only certain properties of the selection are important.Comparing two selection expressions can be achieved by comparing sets of selections on the basis of common properties and two selections are compared by comparing signatures (feature vectors) derived from them.
Two signatures from the same algorithm can be compared to give a similarity measure, but these measures cannot be compared with those derived from another algorithm since the measurement scales are arbitrary and algorithm dependent.One solution to this is to order results based on signature comparison and then merge lists of results (from several signatures and several properties) on the basis of ranking within the signature derived lists.The ranking within the lists reflects the relative strength of a match rather than an arbitrary number.
There are several approaches to merging these intermediate lists to produce a final list of results and priority can be given to matches based on quality or on the number of properties used.The results of a query can be presented to the user as a ranked list.An abbreviated representation of the matching selection expression may be shown together with the properties matched and an indication of the strength of the match.
There are many ways that the user could interact with the results, including specifying exclusion thresholds to constrain the number of results shown, and sorting the list according to which properties are considered most important.From this list the user can find similar selection expressions, view the destinations of hypermedia links or explore concepts which are appropriate to the query.
In addition the matched selection expressions can be used to modify the original query selection expression to express which of the selection expressions represent what the user is interested in and those which the user considers irrelevant.This is a form of relevance feedback, a technique firmly established as a useful tool in information retrieval [6,14].
Challenge of Image Retrieval, Newcastle, 1999

Prototype Implementation
As a research system, one of the goals for the implementation of MAVIS 2 is to develop a system that is flexible enough to allow different parts of the system to be implemented in different ways and to accommodate the varying functionality that these implementations may provide.

Messaging Architecture
The basis of the implementation is a flexible messaging architecture and storage system that allows any part of the system access to the information that it requires.All operations are initiated by sending messages which allows for easy distribution of the whole system across a network.
The design is based around a set of communicating processes that operate independently from each other.A broker process allows messages to be addressed by matching required functionality with the capabilities of running processes.Referring to processes by their functionality allows more than one process to be available for performing a given task.This allows alternative matching methods, such as a computational match and an intelligent agent, to be used in parallel and completely transparently to the sender of the message.
MAVIS messages are encoded using XML and transmitted between processes using HTTP.This gives an automatic remote addressing scheme in the form of URLs and allows integration with other systems that provide an HTTP interface.Each process is implemented in Java and contains a servlet that allows it to act as a web client and server in order to send and receive MAVIS messages.

Processes
A set of processes has been implemented within the architecture described to provide the evaluation of simple queries.The broker and storage processes provide the basic facilities that the other processes require.
A simple viewer that allows selection from web pages is used to import selections into the system and initiate queries.In the initial version queries consist of a single selection.A query evaluator receives a query and assembles a list of selection expressions in the system that fall within the scope of the query.
These are passed to a selection expression matcher that matches expressions on the basis of selections and signatures.When a signature needs to be calculated a message is sent to a signature module process.It is possible that different types of signature may require different methods in order to compare them.
The signature matcher produces lists of distances in feature space for each type of signature that are combined with source information into result messages.A results viewer receives these messages and orders the results before presenting them to the user.
The multimedia thesaurus is a separate process and when it is active (if the user sets the appropriate scope settings in the viewer) it receives the same query message as the query evaluator and transparently provides support for the query.The initial version can return concepts as results and it can initiate its own parallel searches on synonyms of the query to further augment the overall list of results.There is also a concept browser and editor that allows the concept network used by the MMT to be viewed, navigated and modified.

An example
Figure 2 shows an example session using MAVIS 2. The top screenshot shows the process of making a link in MAVIS 2. At the bottom right is the control panel that is used to control which processes are running.Individual processes can be enabled and disabled by clicking on the checkboxes.
On the left is the viewer displaying an image of a plate and "Start Link" has been selected from the menu.An iconic version of the current selection appears (at the top of this picture) to act as a visual reminder of the source of the link.The destination of the this link is a small text file shown in the viewer on the right of the picture.When "Create Link" is selected from the menu a link between the the plate image and the text document is made.This link is available for the user to follow whenever the plate image appears as the result of a query.
Challenge of Image Retrieval, Newcastle, 1999 The bottom screenshot shows the results of performing a query.The user has started from a different plate image shown in the top left of the picture and selected "Perform Query" from the menu.At the bottom left is the results viewer which shows matching selections, available links and associated concepts in three columns from left to right.The third, highlighted result is the image from which the link to the text document was made, as described above.Details of the match, including distance measures and feature names, appear at the bottom of the results viewer for the highlighted result.
The user can click on an associated concept and the concept browser will open at that concept, as shown in the top right of the picture.This shows a small portion of the Art and Architecture Thesaurus concerning plates and related tableware.Broader, narrower and related concepts are shown in a network and the top right of the browser shows selections that have been associated with this concept as lead-ins.The user can use the browser to navigate through the concept layer and view the associated selections.

Conclusions and Further Work
In this paper we have presented an architecture that allows a multimedia thesaurus and agents to be combined with a content based hypermedia system, with the aim of overcoming some of the limitations of multimedia matching and improving the flexibility of content based navigation.
An initial prototype has been implemented that successfully demonstrates the flexibility of the messaging and storage systems by evaluating simple text based queries with optional and transparent support from the multimedia thesaurus using a set of independent processes distributed across a network of machines.
Separate research has been conducted on methods for flexible selection from images and image based signatures to use for comparison of image selections and this work can be integrated into the framework that has been established.Other research in the group is concentrating on selection methods and signatures for audio and video information.
In addition, separate research on suitable intelligent agents will be added in a similar way to the MMT and further work is being carried out that will allow the facilities of MAVIS 2 to be available from a standard web browsing environment.

Figure 2 :
Figure 2: An Example Session