Semistructured Data: The TSIMMIS Experience

In this paper we discuss themanagement of semi-structured data, i.e., data that has irregular or dynamically changing structure. We describe components of the Stanford TSIMMIS Project that help extract semi-structured data from Web pages, that allow the storage and querying of semi-structured data, and that allow its browsing through the World Wide Web. A prototype implementation of the TSIMMIS system as described here is currently installed and running in the database group testbed.


Introduction
At a recent w orkshop on management of semistructured data 15 , the workshop attendants de ned semistructured data as data that does not have a regular and static structure like data found in a relational database but whose schema is dynamic and may contain missing data or types.For example, if we l o o k a t w eather forecasts on the Web, the " elds" and their structure may di er across sites.Even at a single site, some forecasts may be missing information, or may h a ve extra information depending on the geographical location of the a ected region e.g., cities in the Rocky Mountains usually include ski reports in the winter months whereas forecast for tropical resorts do not.However, semistructured data is not just limited to the World Wide Web WWW, but is also found in many other interesting sources including le systems, news wires, electronic mail systems, etc. just to name a few.In addition to occurring natively in the above-mentioned classes of sources, semistructured data is often a "by-product" of the integration process when multiple heterogeneous schemas are involved.In those cases, semistructured rather than "fully structured" data arises because the integrated objects may be based on complimentary, sometimes con icting, and often dynamic information from multiple sources, forcing the integrator to lter, merge, or omit certain elds when performing the integration.
The goal of the Tsimmis project at Stanford 4, 7, 11, 13 is to provide integrated access to a wide variety of heterogeneous data sources e.g., databases, object stores, knowl-edge bases, digital libraries including sources containing semistructured data e.g., WWW, le system.In this paper, we present the Tsimmis approach to managing semistructured data.In particular, we discuss three critical aspects semistructured data management: 1 extracting the intended content from its native source how to get it?, 2 reading the extracted data how to query it?, and 3, exploring the result in a graphical, easy-to-understand manner how t o b r o wse it?.In Tsimmis we h a ve developed components that address all of the above issues and together provide an integrated solution to the problem of managing semistructured data.Several other recent projects have similar goals e.g., Lore 10 , Garlic 3 , Information Manifold 9 , Rufus 14 , but we do not survey them here.
2 Representing Semistructured Data in TSIMMIS For the Tsimmis project we h a ve adopted a simple selfdescribing or tagged object model.Similar models have been in use for years; we call our version the Object Exchange Model, o r Oem 4 .Oemis a exible model that is particularly well suited for representing semistructured data.Data represented in Oem constitutes a graph, with a unique root object at the top and zero or more nested subobjects.The fundamental idea is that all objects, and their subobjects, have labels that describe their meaning.For example, the following object represent s a F ahrenheit temperature of 80 degrees: temp-in-Fahrenheit, int, 8 0 Here, the string temp-in-Fahrenheit" is a human-readable label, int" indicates an integer value, and 80" is the value itself.If we wish to represent a complex object, then each component of the object has its own label.For example, an object representing a set of two temperatures may look like: set-of-temps, complex, f temp-in-Fahrenheit, int, 8 0 temp-in-Celsius, int, 2 0 g Oem is very simple, while providing the expressive p o wer and exibility needed for representing semistructured data from a wide range of heterogeneous sources.Our primary reason for choosing a simple model is to be able to accommodate a wide variety of external data models and to facilitate integration.As pointed out in 2 , a simple model such a s Oem has an advantage over complex models when used for representing and integrating heterogeneous data, since the Figure 1: A section of the HTML source le operations to transform and merge data will be correspondingly simpler.Meanwhile a simple model can still be very powerful: advanced features can be emulated" when they are necessary e.g., subclass superclass relationships, inheritance, etc..For additional information on Oem, please refer to 12 .
3 Extracting Data Continuing with our weather example, let us assume that we h a ve an application that needs to process weather data, such as temperature and forecast, for a given city.As one of its information sources, we w ant t o u s e a W eb site called Intellicast, which reports daily weather data for most major cities across the world.Since this site cannot be queried directly from within another application e.g., What is the forecast for Helsinki for May 7, 1997?" we rst have t o extract the contents of the weather table from the underlying HTML page 1 which is displayed in Figure 1.

The Extraction Process
Our con gurable extraction program parses this HTML page based on the speci cation le shown in Figure 2. The specication le consists of a sequence of commands, each de ning one extraction step.Each command is of the form variables, source, pattern where source speci es the input text to be considered, pattern tells us how to nd the text of interest within the source, and variables are one or more extractor variables that will hold the extracted results.The text in variables can be used as input for subsequent commands.If a variable contains an extracted URL, we can also specify that the URL be followed, and that the linked page be used as further input.
After the last command is executed, some subset of the variables will hold the data of interest.Later we describe how the contents of these variables are packaged into an Oem object.
Looking at Figure 2, we see that the list of commands is placed within the outermost brackets ` ' and ` ', and each command is also delimited by brackets.The extraction process in this example is performed by v e commands.The initial command lines 1-4 fetches the contents of the source le whose URL is given in line 2 into the variable called root.
The `' character in line 3 means that everything in this case the contents of the entire le is to be extracted.After the le has been fetched and its contents are read into root, the extractor will lter out unwanted data such as the HTML markup commands and extra text with the remaining four commands.
The second command lines 5-8 speci es that the result of applying the pattern in line 7 to the source variable root is to be stored in a new variable called temperature.The pattern can be interpreted as follows: discard everything until the rst occurrence of the token h=T Ri `*' means discard in the second table de nition and save the data that is stored between h=T Ri and h=T ABLEi `' means save.The two hT A B L Etokens between the `*' are used as navigational help to identify the correct h=T Ri token since there is no way of specifying a numbered occurrence of a token i.e., discard everything until the third occurrence of h=T Ri.After this step, the variable temperature contains the information that is stored in lines 22 and higher in the source le in Figure 1 up to but not including the subsequent h=T ABLEi token which indicates the end of the temperature table.The underscore at the beginning of the name temperature indicates that this is a temporary variable; its contents will not be included in the resulting Oem object.
The third command lines 9-12 instructs the extractor to split the contents of the temperature variable into chunks" of text, using the string hT RALIGN = lefti lines 22, 30, 38, etc. in Figure 1 as the chunk" delimiter.Note, each c hunk" represents one row in the temperature table.The result of each split is stored in a temporary variable called citytemp.The split operator can only be applied if the input is made up of equally structured pieces with a clearly de ned delimiter separating the individual pieces.If one thinks of extractor variables as lists up until now each list had only one member then the result of the split operator can be viewed as a new list with as many members as there are rows in the temperature table.Thus from now on, when we apply a pattern to a variable, we really mean applying the pattern to each member of the variable, much like the apply operator in Lisp.
In command 4 lines 13-16, the extractor copies the contents of each cell of the temporary array i n to the array city temp starting with the second cell from the beginning.The rst integer in the instruction citytemp 1 : 0 indicates the beginning of the copying since the array index starts at 0, 1 refers to the second cell, the second integer indicates the last cell to be included counting from the end of the array.As a result, we h a ve excluded the rst row of the table which contains the individual column headings.Note, that we could have also ltered out the unwanted row in the second command by specifying an additional *h=TRi condition before the `' in line 7 of  After the ve commands have been executed, the variables hold the data of interest.This data is packaged into an Oem object, shown in Figure 3, with a structure that follows the extraction process.Notice that this sample object re ects the structure of our extractor speci cation le.That is, the root object of the Oem answer will have a label root because this was the rst extracted variable.This object will have c hildren objects with label city temp and so on.Notice that the variables temperature and citytemp do not appear in the nal result because they are declared as temporary variables.

Additional Capabilities
In addition to the basic capabilities described in our example, the extractor has components for automatic handling of HTML tables, for conditional parsing, and other services.The extractor can also follow URLs in the process, extracting data from multiple Web pages into a single Oem object.Overall, we believe the extractor provides natural facilities for extracting data, as well as for structuring it in di erent ways into Oem objects.For more details on the extractor, please refer to 8 .

Querying Semistructured Data
In this section we i n troduce the Lorel query language, primarily through examples.Lorel is an extension of OQL and a full speci cation can be found in 1 .Here we highlight those features of the language that have an impact on the novel aspects of the system|features designed specically for handling semistructured data.Many other useful features of Lorel some inherited from OQL and others not that are more standard will not be covered.

Simple LORE Examples
Our rst example query introduces the basic building block of Lorel: the simple path expression, which is a name followed by a sequence of labels.For example, Root:City:Location is a simple path expression.Its semantics consists of the set of objects that can be reached starting with the Root object, following an edge to objects labeled City, then following an edge to objects labeled Location.Range variables can be assigned to path expressions, e.g., Root:City:LocationX" speci es that X ranges over the set of locations.
Continuing with our European weather example, the following example query retrieves the locations of all cities located in England when evaluated over the sample Oem database shown in Figure 4.At a high level, the query execution engine will nd all objects which satisfy the path Root:City:Location and for each of these will check whether the where clause is satis ed.The result of Query 4.1 is shown here: answer complex f location string Southern" location complex f longitude oat -0.167 latitude oat 51.5 g g The database over which this query is evaluated presents a n umber of irregularities, as discussed earlier.A guiding principle in Lorel is that, to write a query, one should not This query will not yield a run-time error if a Location object has a string value or is complex, or if Country objects are single-valued, set-valued, or even absent for some cities.Indeed, the above query will succeed no matter what the actual structure of the database is, and will return an appropriate answer.Of course, this query was written with some obvious knowledge of how the graph is laid out within our database.In Sec.4.2 we discuss how an end user can discover the structure of the database.
Value comparisons are made after two objects have been coerced into comparable types.That is, if two objects do not have the same type then attempts will be made to coerce the values into comparable types before applying the comparison operator.Any t ypes which cannot be coerced for comparison will not return type errors, but will simply evaluate to false.This reinforces our underlying principle that Lorel does not require precise knowledge of the data and is most useful when dealing with semistructured data.
The system will in fact translate all Lorel queries into OQL-like queries for evaluation.This is done for two reasons: rst, Lorel is based on OQL and thus OQL gives us well de ned semantics for our queries, and second it allows a user familiar with OQL to directly enter an OQL query to be evaluated over the semistructured data.In some sense, Lorel can be viewed as shorthand for OQL, however Lorel also introduces generalized p ath expressions not present within OQL.Generalized path expressions o er a richer form of declarative n a vigation" in Oem databases than simple path expressions.Intuitively, the user loosely speci es a desired pattern of labels in the database: one can specify patterns for paths to match sequences of labels, patterns for labels to match sequences of characters, and patterns for atomic values.A combination of these three forms of pattern matching is illustrated in the following example: Here the expression weather is a label pattern that matches all labels ending with the string weather e.g., weather, Todays weather, o r Tomorrows weather.For path patterns, the symbol j" indicates disjunction between two labels, and the symbol ?" is applied to the parenthesized expression to the left and indicates that the label pattern is optional.The complete syntax is based on regular expressions, along with syntactic wildcards such as ", which matches any path of length 0 or more.Finally, grep"rain" speci es that the data value should contain within it the string "rain.The grep operator is similar to the Unix grep command.We also support like, based loosely on the SQL like, and soundex for phonetic matching.In English this query is asking for the names of all cities where the forecast or outlook of the weather contains the word rain".Figure 5: Sample DataGuide strated in our rst example.It is not possible to do so with general path expressions, which require a run-time mechanism.Indeed, note that if the database contains cycles, then a general path expression may match an in nite number of paths in the data.When trying to match a general path expression against the database, we match through a cycle at most once, which appears to be a reasonable simpli cation in practice.We conclude with an example that illustrates advanced features of the language.The following query illustrates subqueries and constructed results.For every city in the database that satis es the bottom where clause, we will select out the name of the city along with the current temperature, but only if the current temperature satis es the WHERE clause.The result is shown below.Notice that each city which provides a binding for the C variable and satis es the where clause appears within the answer.Of particular interest is the fact that Plymouth does not have a current temp eld within the answer.This is ltered out as a result of the subquery appearing within the SELECT clause.Specically, the Plymouth object does not have a subobject labeled Todays weather.
answer complex f city complex f name string "London" current temp integer 7 g city complex f Name string "Plymouth" g g 4.2 Query Formulation with the DataGuide Since our data does not have an explicit schema, query formulation and query optimization are particularly challenging.Without some knowledge of the structure of the underlying database, writing a meaningful Lorel query may be di cult, even when using general path expressions.One may manually browse a database to learn more about its structure, but this approach is unreasonable for very large databases.Further, without information about the structure of the database, the query processor may be forced to perform more work than necessary.F or example, consider Query 4.1 that nds the locations of all cities whose country is England.Even if no cities have a country subobject, the execution engine would still needlessly examine every city i n the database.A DataGuide is a concise and accurate summary of the structure of an Oem database, stored itself as an Oem object.Each possible path expression of a database is encoded exactly once in the DataGuide, and the DataGuide has no path expressions that do not exist in the database.As an example Figure 5 shows a DataGuide for the sample database shown in Figure 4.Note that atomic values are usually not stored within the leaf nodes of the DataGuide since it is primarily concerned with the structure of the database.In typical situations, the DataGuide is significantly smaller than the original database.A DataGuide plays a role similar to metadata in traditional database systems.The DataGuide may be queried or browsed, enabling user interfaces or client applications to examine the structure of the database.Assuming the role of the missing schema, the DataGuide can also guide the query processor.Of course, in relational or object-oriented systems the schema is explicitly created before any data is loaded; here, DataGuides are dynamically generated and maintained over all or part of an existing database.
In 5 , formal de nitions for DataGuides are provided as well as algorithms to build and incrementally maintain DataGuides that support annotations.Also given is a discussion of how DataGuides aid query formulation in practice and their use for query optimization.
5 Browsing OEM Results through MOBIE The idea behind our browsing tool centers around the need for displaying semistructured objects in a way that makes it easy for the user to grasp their structure and explore their contents when viewing the result of a Tsimmis query.Oem results are typically irregular in structure and nested, containing a top-level root object and zero or more subobjects sometimes referred to as children.Each subobject may itself be a nested object.In general, nested objects are structured like trees or graphs if we allow cycles.Anybody who has worked with nested objects before can attest to the fact it becomes increasingly di cult to understand the contents of a nested object the more its structure increases in complexity i.e., the larger the number of subobjects and the deeper the level of nesting.
For this reason, we h a ve built a system that transforms Oem results into a web" of hyperlinked documents that can be viewed using any WWW browser.An object that is selected for viewing is formatted as an HTML document.If the object is a complex object, the document m a y also include hyperlinks pointing to some or all of the object's substructure depending on the user's preferences.If the object is atomic, it will be displayed by itself.In addition, each document always contains a link to the parent object, unless the selected object is the root of the entire structure.The main contribution of our system is that it gives the user the option to decide which information is to be displayed, how much of the chosen information he or she wants to see, and when.Information is presented one screen at a time, allowing the user to browse complex objects, which m a y b e t o o large to view all at once, in a cafeteria-style " pick-andchoose fashion.This approach to browsing nested objects is analogous to how one uses the table of contents to explore the individual chapters of a book.
An important part of the functionality of our browser focuses on the layout of information on individual pages.Since this is a process that depends heavily on each user's individual preferences as well as the data that is being displayed, we h a ve paid careful attention to design a system that is exible enough so that it can be tailored to satisfy many di erent needs.Our goal was to provide users with choices as to how information is to be displayed: from the overall layout of a screen down to the format of an individual object.Initially, the system uses default settings that maximize the amount of information that can be displayed within the given real-estate of the window.The result can then be improved upon by c hanging the values of session variables, which control the document l a yout, the level of nesting per screen, the number of subobjects per level, etc.By default, session variables control the formatting for the complete object hierarchy.H o wever, by using the label names that refer to a particular object in the hierarchy, the scope of session variables can be limited: from the entire hierarchy, t o a s p eci c substructure, to one object.Although customization of the object display m a y be time consuming in certain cases, the state of the session variables can be saved on a per-user basis and re-used during subsequent sessions.
We h a ve implemented a fully functional prototype system called Mobie Multimedia OBject and Information Explorer, which currently provides the graphical interface to Tsimmis data sources.However, Mobie is not limited to browsing only data from Tsimmis but can be tailored for displaying and formatting structured information from any object-based database footnoteOne can either use a translator for converting data into Oem or modify our algorithm to work with other object-based data models.. Since a complete description of our browser is beyond the scope of this paper, we i n vite the user to obtain the details from 6 .Instead, we will brie y demonstrate some of Mobie's functionality using screen snapshots from a sample interaction with a Tsimmis wrapper connected to the Intellicast weather source via the above mentioned Web extractor.We start our description when the result is returned from the database, omitting such details as how to connect to the database server, transmission of the query and its results, etc.When displaying data, we use the following conventions.Object labels are displayed in bold, object values are italicized.
Underlining indicates the existence of a hyperlink.

Sample Screen Snapshot
Let us assume that we h a ve submitted the following Lorel query asking for all cities in Europe where tomorrow's forecast calls for showers: Query 5.1 Lorel SELECT city temp FROM intellicast:i WHERE i.city temp.tomorrowstemp.forecast= "shwrs" Let us also assume that the answer to this query consists of three cities that are displayed together under one root object, labeled answer.Figure 6 shows the answer object as it is displayed in Mobie.Each object labeled city temp is a complex object exhibiting additional substructure underneath: the objects labeled city, city url, country, todays temp, and tomorrows temp.Note that the rst subobject the city of Helsinki has one additional subobject labeled 5-day forecast that is not present in the other results.The city, city url, and country subobjects are atomic meaning they contain no further substructure.In those cases, the value of the object is displayed.If there The todays temp and city url subobjects on the other hand, are complex objects that contain additional subobjects: forecast, high, low, and date.Labels belonging to complex objects are underlined meaning that a hyperlink exists that will take the user to the document containing only those subobjects.Those subobjects are displayed in a similar fashion.Also note, the value of the city url subobjects is a standard URL that is part of the answer and has been activated by Mobie for loading.

Formatting Options
As mentioned before, the user can control the formatting of objects through various control parameters.These parameters are called session variables and can be accessed from the User Defaults menu.Formatting options fall into two categories: Global Settings", which apply to the whole object structure, and Label-Based Settings" for which the scope can be speci ed based on object labels.See 6 for details and other options.
The following parameters are available for controlling global settings: Maximum levels of sub-objects controls the number of visible levels of subobjects for each object that is displayed.Sub-object indentation controls the amount of indentation used for subobjects.The following parameters are available for controlling label-based settings: Layout controls the overall look-and-feel" of the output when it gets displayed in the browser window.The two options currently available are table and list layout.Number of displayed sub-objects controls the number of subobjects that are displayed on a screen.Label size and value size control the length of labels and values respectively.
Using these options one can format data in the way that best suits it.For example, Figure 7 shows some data formatted as a table.Labels are shown on the left side.If the subobject is an atomic object e.g., the subobjects labeled city, city url, country, and 5-day forecast the rst column starting form the left will contain the subobject's value.If the subobject is a complex object, e.g., the subobjects labeled todays temp and tomorrows temp, the rst column will be empty, and subsequent columns will contain the values of its immediate subobjects.In the latter case, the column headings are the labels of the lower-level subobjects.Note, if there are several complex subobjects with di erent substructure, the table will display the union of all possible headings.
As mentioned before, label-based settings apply to objects.In order to format an object, a formatting choice associated with its label must be de ned.Thus it is possible, for example, to display three or more levels of nesting for the root object, and then reduce the number of visible levels to just one when viewing its subobjects.As another example, one can display the part of a result that contains numerical values as a table but leave the part that is mostly textual in list format.

Conclusion
In this paper we h a ve presented an overview of the Tsimmis approach to accessing and managing semistructured data.In particular, we h a ve described how semistructured data can be obtained from Web pages, how it can be manipulated in a database system, and how it can be browsed.We believe that semistructured data exists in many applications, and exible tools like the ones we h a ve described can be very helpful for managing it.

Figure 2 .
The nal command lines 17-20 extracts the individual values from each cell in the city temp array and assigns them into the variables listed in line 17 country, c url, city, etc..

Query 4. 3
Lorel SELECT C.Name, SELECT X FROM C.Todays weather.currenttemp X WHERE X 10 FROM Root.City C WHERE C.Country = "England"