Fusion Rule Technology





Background

Merging conflicting information

Terms such as "merging", "fusion", and "information integration" can be used in a variety of ways. It is useful, therefore, to start by explaining what we mean by those terms, and to illustrate thereby the kinds of problems that applications built using fusion rules and a knowledgebase might be used to solve. After doing this we will briefly survey some existing approaches to merging or integrating information in order to see how they compare with our approach. It will be our contention that they are all in fact aiming to do something rather different from what we wish to accomplish. The purpose of this chapter, then, is to explain when it would be appropriate to build a fusion system application, and why someone might want to build one.

The kind of merging (or fusion, or integration) of information with which we are primarily concerned in a fusion rules application is to take potentially conflicting information from different sources and, by reasoning about those conflicts, to create a distinct, novel output report that summarizes, or in other ways combines, that information. The upshot of merging in this sense is that a new source of information is created, something that did not previously exist. For information to conflict, however, it must both be about the same thing, and it must be about the same properties or attributes of that thing. If the information is about different things, or about different properties of the same thing, then it cannot be conflicting information; it would, rather, be complementary information. We can of course speak of integrating complementary information, as for example in cases where we bring together information about different properties of the same thing, and this kind of integration can be carried out using fusion rules. But, as we discuss below, there are other technologies that deal with integrating this kind of information. It is with conflicting information, however, that fusion rules come into their own as a technology for information integration.

A couple of examples should make this clearer. In the case of opinion polls for political parties, for example, different reports to be merged both contain information about the same things---the major political parties---and they both contain information about the same properties or attributes of those things---their poll ratings. But the properties ascribed by the two reports to the same things may conflict. If they do, then merging or integrating these two reports, therefore, requires some way of dealing with this conflict, perhaps by taking the mean of the different values, or using the value from the report whose source is preferred.

Similarly, in the case of financial information, different reports that are about the same thing, a particular company and its stock, may all contain the same kind of information about it---information about its stock price and a recommendation to buy or sell. But again, that information may conflict. So once again, if this information is to be merged, something must be done about this conflict (even if it is only to conjoin all the conflicting values, perhaps with their sources, so as to convey the fact that there is a conflict.).

It is in situations such as these that we envisage that an application built using fusion rules and a knowledgebase would be useful.

The Semantic Web

By contrast, several other technologies already exist that claim to integrate or merge information, but they all strike us as attempting to do something somewhat different.

Chief amongst technologies whose aim could be described as integrating information are those associated with the semantic Web, such as the Resource Description Framework (RDF) and RDF schema (RDFS), the Ontology Inference Layer (OIL, also used as an acronym for "Ontology Interchange Language"), and the DARPA Agent Mark-up Language (DAML). The combination of these technologies is intended to make the information on the Web not just machine readable, as at present, but machine understandable, thus increasing the "semantic interoperability" of the Web. If this is to be done, what we need is semantic metadata---data that tells machines what the content on the Web means---rather than metadata that simply tells machines how to display Web content (HTML tags), or what syntactic structure it has (XML).

The consensus is that such semantic metadata is best provided via ontologies, which offer a source of shared and defined terms standing in clear relationships to one another. These ontologies will provide us with the vocabulary to mark up or annotate the semantics of information on the Web. DAML+OIL is an ontology representation language allowing for the specification and exchange of ontologies that is designed to provide this intelligent access for machines to the heterogenous and distributed information found on the Web. In particular, three features suit it to this purpose: (1) The use of a frame or class based modelling framework makes it easier for humans to construct ontologies; (2) The choice of XML and RDF Web languages as a syntax makes DAML+OIL interoperable with existing Web software; (3) The decision to design the language so that its semantics can be mapped to a description logic allows for reasoning in the construction and maintenance of ontologies, something that is very useful when checking the consistency of large ontologies.

Given this account of the purpose of these technologies, they can indeed be said to integrate information in the sense that they enable machines to determine the content or meaning of that information, and thus determine, for example, whether two sources contain information about the same thing. But it is important to note that they do this by acting on or exploiting this new semantic metadata; they do not concern themselves with the data--the actual Web content---itself. The significance of this is that although they can be said to integrate diverse, heterogeneous data, perhaps it would be more accurate to say that they put us in a position of being able to integrate such data. For, knowing what information on a Web site is about is one thing, knowing what to do with it is another. And, as the developers of DAML+OIL have acknowledged, "Defining languages for the semantic Web is just the first step. Developing new tools, architectures, and applications is the real challenge that will follow." (see Fensel, "The Semantic Web and its Languages", IEEE Intelligent Systems, p. 67.)

With respect to our interest in merging information, the relevance of these technologies is that they enable us to establish that the information to be merged is (or is not) about the same subject matter. But having done that, in themselves they say nothing about what to do with that data itself--what we should do if it does not conflict, what we should do if it does. Thanks to the semantic Web, software agents will be able to tell us that one website on Claret contains information about the same thing as another website containing information on the wines of Bordeaux. But suppose these sites also contain differing opinions on the merits of the 2003 vintage. How should we deal with this conflict? In themselves, the languages developed to enable the semantic Web say nothing about what to do here, but this is exactly the kind of task that we have in mind when we speak of merging information. Merging or integrating information, as we are using those terms, is one of those further applications that need to be developed on top of the semantic Web, along with applications such as intelligent search engines and other envisaged applications in e-commerce and knowledge management. Our view of the semantic Web, then, is that it is an enabling technology for information integration. The fusion system that we advocate could be considered an information integration agent that could make use of semantic Web technology.

Information mediators and information integration on the web

The aim of merging, as we choose to understand it, is to take potentially conflicting information from different sources and, by reasoning about those conflicts, to create a distinct, novel output report that summarizes, or in other ways combines, that information. The upshot of merging in this sense is that a new source of information is created, something that did not previously exist.

This aim seems to be substantially different from most extant approaches to merging or integrating information, where the emphasis often seems to be more on extracting information from different sources and presenting it in such a way as to make it easier for the user to find the information she wants. The upshot of merging or integration in this sense is the user finds a pre-existing source of information; no new information is created (although, of course, the information will be new to the user--that's why she's looking for it), rather, existing information is made more useable.

The difference between these two projects can perhaps best be brought out in terms of the familiar entity-attribute model. Our interest in merging is to take information from different sources about the same entity, where that information concerns the same attributes, and then combine it. If the values of those attributes are the same (or similar), not much reasoning will be involved. But where the values of those attributes conflict, a variety of kinds of reasoning using a knowledgebase can be used to resolve the conflict. By contrast, most other projects of integrating information on the web seem concerned with a different task: they start from the same place, with different sources of information about the same entity, but those sources typically contain information about different attributes. Thus these different data sources are conceived of as containing complementary information, not conflicting information. As a result, merging is conceived of, not as the process of resolving a conflict, but as the process of combining the information about the different attributes of a given entity, thus making it easier for the user to find the information she wants.

A couple of examples should make this clearer. One example is the TheatreLoc application (Greg Barish, Craig A Knoblock, Yi-Shin Chen, Steven Minton, Andrew Philpot and Cyrus Shahabi, "TheaterLoc: A Case Study in Information Integration", Information Integration Workshop, Stockholm, Sweden (IJCAI `99), 1999.): here the user chooses to search for restaurants and/or cinemas in a specified city, and is presented with a list of them, together with their locations displayed on an interactive map. The user can then navigate via the list or the map to detailed information on individual theatres or restaurants, including watching previews of the films. In this application, information about a theatre's films is brought together with video previews, and information about the street address of the theatre is used to find the theatre's grid reference, allowing it to be placed on a street map. The end product is a user interface making it easier for the user to find what she wants; a single website now allows the user both to find what's showing, and how to get there. In that sense we can speak of information integration.

A second application (Craig A Knoblock and Steven Minton, "The Ariadne approach to Web-based information integration", IEEE Intelligent Systems, volume 13, number 5, 1998, pp. 17--20.) seeks to integrate the information contained in restaurant review websites such as Fodor's and Zagat's, with information on the health or sanitation status of those restaurants contained on a local government website. Combining this information would allow users to answer such queries as "Find all the Japanese retaurants in Santa Monica with a grade A health rating." Once again, the purpose of integration here is not to reason about conflicting data, but to make it easier for the user to find information which already exists. The end result is that the user is looking at the same Web pages that she could have found manually, but she got their a lot more easily.

Both applications work by using the Ariadne information mediator forming an intermediate layer between the user and the various heteregeneous information sources. The purpose of the mediator is to provide a uniform query interface, abstracting away from the heteregeneous formats of the different sources. The mediator also plans how the sources should be queried and how the data that is retrieved is to be integrated. The extraction itself is done by wrappers. It should be acknowledged that this sort of information integration does involve some reasoning concerning conflicts, but this is confined to determining if information from two sources is in fact about the same entity, say the same film or restaurant. For example, we need to be able to tell that information from a cinema Website about a film called "A bug's Life" is information about the same film that is listed as "Bug's Life, A" on a trailer Website, so that links to the two can be combined. Or, we need to know whether a restaurant listed as "Art's Delicatessen" on one site is the same as one listed as "Art's Deli" on another. The role of reasoning about conflicts thus seems to be restricted to providing robustness to the semantic heterogeneity of different information sources in their description of the entities they are about. These applications do not seem to provide a role for reasoning about conflicts between the values of the attributes of those entities.

There is no question that the sort of issues in integration addressed by these applications are pressing, but the end result of this approach is that it is easier for the user to find one or more of the pre-existing input sources that she's interested in. Resolving conflicts between sources may be required in order to do this, but it remains a subsidiary goal. By contrast, the objective of our approach is that the user gets to see a wholly different, merged or summarized, output report. The focus of our approach is in developing ways of merging or integration that can deal with a whole array of problems that arise with conflicting information of which the examples just mentioned in the case of these web integration applications are only one.

Database integration

The goal of database integration is to provide uniform access to multiple, heterogeneous databases that each have their own associated local schema. Logic-based techniques in data integration, such as global-as-view and local-as-view, offer some ability to relate sources using restricted forms of firstorder logic, and so can be considered as special cases of knowledgebased merging. However, the format of the clauses used are largely limited to defining virtual tables directly in terms of existing tables. We could describe this process of integration as providing a mapping of one schema onto another, so that, for example, data in a column headed location in one table is mapped onto a column headed address in another table. So, "fusion" or "integration" in this context refers to the ``combining" of several database tables into one. These "mappings" can be regarded as a form of ontological knowledge.

If this is how the goal of database integration should be understood, then we can see, once again, that this goal is substantially different from the primary goal of merging that we are advocating. Data integration in the former sense is the combining of information previously available in different applications, and making that same information available to a single application so as to enable easier access and querying. But having brought the information together, this approach in itself has nothing to say about what should be done with it. By contrast, "integration" as we mean to use the term stands for the use of logical inference to create a new piece of information that was not necessarily previously available. As was the case with the semantic web, far from being a competitor, we can view database integration as a useful enabling technology that provides the kind of input that could subsequently be merged in the ways that we envisage.

Conclusion

In one way or another the various technologies we have briefly examined in this section are concerned with giving the user access to an existing piece of information--a piece of information that is already there on the Web, or in a database, but that is difficult to access, perhaps because of the vast amount of information available, or because it's spread accross more than one database. By contrast, what's different about the kind of merging of information that we have highlighted is that it is designed to create a new piece of information---the merged output report---from these existing sources. None of the comments in this section should be taken as criticisms of these technologies. We believe, rather, that they are all aiming to do something somewhat different from the task that we envisage for an application built using fusion rules. Because these technologies are dealing with different issues they should all be considered, not as in competition with fusion rules applications, but as complementary to them. In particular, the problems of information extraction that some of these technologies deal with are problems that are ignored by fusion rules themselves, which simply assume that the information to be merged already exists in XML. Moreover, the problem of deciding whether input reports that appear to be about the same entity really are about the same entity is likewise ignored. Hence, any full-fledged commercial application built using fusion rules would likely require support from some of these other technologies.







Contact a.hunter@cs.ucl.ac.uk or +44 20 7679 7295.

Back to Fusion Rule Technology homepage.