[dbis logo]

.Forschung.Projekte
[Institut fuer Informatik] [Leerraum] [Humboldt-Universitaet zu Berlin]

Conflicts in Data Integration

Contradiction Patterns

Information integration is often faced with the problem that different data sources represent the same set of real-world objects, but give conflicting values for specific properties of these objects. Within [Müller, Leser, Freytag 2004] we present a model of such conflicts and describe an algorithm for efficiently detecting patterns of conflicts in a pair of overlapping data sources. The contradiction patterns we can find are a special kind of association rules, describing regularities in conflicts occurring together with certain attribute values, pairs of attribute values, or with other conflicts. Therefore, we adapt existing association rule mining algorithms for mining contradiction patterns. Such patterns are an important tool for human experts that try to find and resolve problems in data quality using domain knowledge.

Minimal Update Sequences

Assuming that conflicts do not occur randomly but follow specific (but unknown) regularities, contradiction patterns of the form “IF condition THEN conflict” provide a valuable means to facilitate their understanding. We currently also develop a second approach for finding regularities in contradicting data-bases: The detection of minimal update sequences transforming the contradicting databases into each other. Each operation may act as a description of potential systematic difference in data production that lead to the occurring conflicts.

Our idea of using minimal update sequences as descriptions for database differences is best explained by analogy to the usage of the string edit distance in biological sequence analysis.

 

Edit distance of biological sequences (left) versus update distance of databases (right)

The DNA sequence of a gene is a string over a four letter alphabet. To learn about the function of a specific gene in a specific species, biologists search for evolutionary related genes of known function in other species. This evolutionary relatedness (or distance) is proportional to the number of evolutionary events that have occurred to the sequence of a common ancestor, deriving the observed sequences, which in turn is proportional to the number of evolutionary events that would be necessary to turn one gene into another. Using a simple model of evolution encompassing only changes, deletions, and insertions of single bases (i.e., characters of the sequence), the number of evolutionary events is measured by the edit distance between two gene sequences, i.e., the minimal number of edit operations (or evolutionary events) that transform one string into the other.

Similarly, we consider updates, insertions, and deletions of tuples as the fundamental operations for the manipulation of data stored in relational databases. Thus, to assess the “evolutionary relationship” of two databases, we propose to use the minimal number of such operations that turn one databases into the other. We call this number the update distance between two databases. Each sequence of operations as long as the update distance is one of the simplest possible explanations for the observed differences. Following the “Occam’s Razor” principle, we conclude that the simplest explanations are also the most likely. Minimal update sequences therefore give valuable clues on what has happened to a databases to make it different from its original state. The update distance is a semantic distance measure, as it is inherently process-oriented in contrast to purely syntactic measures such as counting differences.

Publikationen

ERROR: Content Element type "page_php_content_pi1" has no rendering definition!

ERROR: Content Element type "page_php_content_pi1" has no rendering definition!



[Punkt]  DFG-Forschergruppe Stratosphere

[Punkt]  DFG-Graduate School SOAMED

[Punkt]  DFG-Graduate School METRIK

[Punkt]  Verweisbasierte Anfrageausführung

[Punkt]  Web of Trusted Data

[Punkt]  Query Optimization in RDF Databases

[Punkt]  DBnovo - Datenbankgestützte Online Sequenzierung



Ansprechpartner

+49 30 2093-3025