|
Dissertationen2004XML Query Execution Engine (XEE) - Speichern, Anfragen, Verändern von XML-Dokumenten auf der Grundlage von Datenbank-Techniken | | Dieter Scheffner
|
|
 | Abstract:In den letzten Jahren gewann das World Wide Web (WWW) als Informationssystem auf der Grundlage des Internet sowohl im kommerziellen als auch im privaten Bereich stark an Bedeutung. Information ist zu einem wichtigen Gut unserer Zeit geworden, das heute bereits "überwiegend in elektronischer Form "über das Internet verbreitet wird. Dieser Entwicklung zur Folge wächst das Volumen an zu verwaltenden und auszutauschenden Daten beständig. Die Suchmaschine Google greift derzeit bereits auf "über drei Milliarden WWW-Seiten für die Informationsgewinnung zurück.
Parallel zur Entwicklung des WWW ist die Entwicklung von Formaten für den Austausch und die Repräsentation von Dokumenten bzw. Daten zu sehen. Zu einem der bedeutendsten heutigen Standardformate hat sich die eXtensible Markup Language (XML) für die Repräsentation und den Austausch von Daten auch über den Bereich des WWW hinaus etabliert. XML findet über den Bereich des WWW Anwendung u.a. im Bereich der Digitalen Bibliotheken, hier insbesondere für die Repräsentation von Dokumenten, und im Bereich des E-Commerce. Bei letzterem kommt XML neben der Repräsentation von Dokumenten wie Produktkatalogen, Verträgen, Flugplänen usw. auch insbesondere die Rolle als Format für den elektronischen Datenaustausch (electronic data interchange - EDI) zu. Aufgrund dieser Tatsache besteht ein dringender Bedarf, große Bestände an in XML-Format vorliegenden Daten effizient und zuverlässig verwalten, speichern, anfragen und verändern zu wollen.
Datenbank-Management-Systeme (DBMS) haben sich als effiziente und zuverlässige Systeme im Umgang mit sehr großen Beständen an strukturierten Daten erwiesen. Die Technologie, auf die insbesondere relationale Datenbank-Management-Systeme aufbauen, wurde über die Jahre hinweg immer weiter vervollkommnet, so dass dem Anwender heutzutage ausgereifte Datenbankprodukte für strukturierte Daten auf dem Markt zur Verfügung stehen.
Daten bzw. Dokumente, die im XML-Format vorliegen, bringen jedoch zwei grundlegende Eigenschaften mit, für die konventionelle Datenbank-Management-Systeme nicht optimal ausgelegt sind. Zum einen gehören XML-Daten im allgemeinen der Klasse der semistrukturierten Daten an, besitzen deshalb oft nur wenig Struktur bzw. die vorhandene Struktur kann starken Änderungen unterliegen. Zum anderen ist bei XML-Daten auch häufig die Reihenfolge der Daten für Anwendungen wichtig, beispielsweise dann, wenn es sich bei den Daten um Dokumente handelt. Von daher ist die Bezeichnung XML-Dokument für allgemein im XML-Format vorliegende Daten angemessener. Semistrukturiertheit und Reihenfolge von Daten stellen die Datenverwaltung mittels eines konventionellen (objektorientierten, objektrelationalen bzw. relationalen) DBMS vor erhebliche Probleme, da sich solche Systeme auf nahezu feste Datenschemata und auf (Reihenfolge nicht-beachtenden) mengenorientierten Datenzugriff stützen. Darüber hinaus unterstützen konventionelle DBMS im Allgemeinen ausschließlich sogenannte strukturbasierte Anfragen, die von der Struktur wie Tabelle und Attribut ausgehend auf die Daten schließen. Eine typische solche Datenbankanfrage hat beispielsweise die folgende Gestalt: Gib alle Namen der Tabelle Mitarbeiter aus. Auf der anderen Seite spielen für die Dokumentanfrage sogenannte inhaltsbasierte Anfragen, wie sie beispielsweise von WWW-Suchmaschinen her bekannt sind, eine wesentliche Rolle. Eine Anfrage dieser Art könnte beispielsweise die folgende sein: Gib alle Dokumente aus, die das Suchwort „XML“ enthalten. Bei solchen (inhaltsbasierten) Anfragen wird im Gegensatz zu den typischen Datenbankanfragen vom Wert auf die Struktur, d.h. mit Bezug auf das Beispiel vom Wert XML auf die gesuchten Dokumente geschlossen. übertragen auf die Abarbeitung derartiger Anfragen im Kontext von konventionellen DBMS ist dies vergleichbar damit, eine Datenbank nach beispielsweise den Namen von Tabellen oder Attributen anzufragen, in denen ein bestimmter Wert vorkommt: Gib alle Tabellen- und Attributnamen aus, die das Suchwort XML enthalten.
Derartige Anfragen werden von konventionellen Datenbank-Management-Systemen derzeit nicht unmittelbar unterstützt und müssen bei Bedarf mit zusätzlichem Aufwand (speziell formulierten, komplexen Anfragen) realisiert werden, letzteres auf Kosten einer effizienten Abarbeitung der Anfrage. Insbesondere der zuletzt genannte Punkt macht deutlich, dass die von objektorientierten, objektrelationalen bzw. relationalen Datenbank-Management-Systemen zur Verfügung gestellten Datenmodelle im allgemeinen nicht im Einklang mit Datenmodellen von Dokumenten stehen. In Bezug auf die Verwaltung von XML-Dokumenten sind demnach Systeme gefordert, die Datenmodelle speziell für XML-Dokumente bereitstellen und effizient implementieren. Derartige Verwaltungssysteme werden häufig als native XML-Datenbanken (native XML databases) bezeichnet. Im Rahmen der vorliegenden Arbeit wird jedoch der Begriff XML-Datenbank-Management-System (XML-DBMS) hierfür verwendet. Notwendige Voraussetzung für eine effiziente Implementation eines Datenmodells sind geeignet konzipierte Datenstrukturen auf der physischen Ebene eines Systems. Insbesondere sind deshalb für Systeme, die hinsichtlich der Größe des zu verwaltenden Datenbestandes skalieren sollen - was insbesondere von XML-DBMS erwartet wird, geeignete Datenstrukturen auf Sekundärspeicherebene notwendig.
Der Entwurf und die Implementation von Datenstrukturen zur Speicherung, Anfrage und Veränderung von XML-Dokumenten auf Sekundärspeicherebene ist neben der XML-Datenmodellierung ein aktuelles Thema der Datenbankforschung, das erst in letzter Zeit in verstärktem Maße aufgegriffen wurde. Das heißt, es sind neue Konzepte insbesondere zum Speichern, Anfragen und Verändern von XML-Dokumenten auf der Grundlage von Datenbank-Techniken zu entwickeln. Die Entwicklung solcher Konzepte und ihrer Realisierung ist zentrales Thema der vorliegenden Arbeit. |
2003Querying Databases Privately | | Dmitri Asonov (GK)
|
|
 | Abstract:People often retrieve information by querying databases. Designing databases that allow a user to execute queries efficiently is a subject that has been investigated for decades, and is now often regarded as a "researched to death" topic. However, the evolution of information technologies and society makes the database area a consistent source of new, previously unimaginable research challenges. This dissertation is dedicated to partially meeting one of these new challenges: querying databases privately.
This new challenge is due to a very fundamental constraint of the conventional concept of querying information. Namely, in the conventional setting, the one who queries (the user) must reveal the query content and, by implication, the result of querying to the one who processes the query (the database server). This constraint seems to be negligible if the user trusts the server. However, the growing population of information providers makes it extremely difficult for users to establish and rely on the trustworthiness of information providers. Indeed, more and more cases are reported wherein information providers misuse the information provided by users' queries against the users, for example by sharing this information with third parties without permission, or by using this information for unsolicited advertisement.
We approach this constraint in a direct manner: If it is difficult to trust the server, we could try to remove the need for trust completely, by hiding the content of the user query and result from the server. This research problem, called Private Information Retrieval (PIR), has been under intensive and mainly theoretical investigation since 1996. These results are classified and analyzed in the first of four parts of this thesis. Our main contribution is considering this problem from a practical angle, as follows.
In Part II, we accept the assumptions and simplifications made in previous related work, and focus on obtaining efficient solutions and algorithms without changing the common model. Namely, we break the established belief that the server must read the entire database for a PIR protocol to answer a query. We further develop our solution by improving the processing and preprocessing complexities of our PIR protocol.
In Part III we extend the common PIR model in two directions. First, we relax the requirement that no information about a query must be revealed. This allows us to offer the user a trade-off between the level of privacy required and the response time for a query. The second extension of the model is done by understanding the economics associated with the PIR problem. Namely, we assumed that information in the database is from different owners. We then consider the problem of distributing royalties between the information owners, given that no information about content of user queries is revealed.
A number of questions remain to be answered before the problem of querying databases privately can be regarded as completely investigated. However, we argue that results presented in the thesis have pushed the state of the art in this area, from the entirely theoretical level to the stage where implementing an applicable prototype can be considered ultimately possible.
Dissertation Humboldt Universität zu Berlin, LNCS 3128. Springer Verlag, Berlin, Heidelberg, 2004.
|
2000AQuES: Ein flexibles, verteiltes System zur Optimierung und Auswertung relationaler Anfragen | |
 | Abstract:Parallel query evaluation for relational database management systems (RDBSM) still remains a challenging problem. Modern systems must show near optimal performance in spite of running in a heterogeneous hardware environment, exploiting different ways of parallelism and dealing with unpredictable system load. This thesis paper presents a dynamic and flexible system addressing the issues of optimization and evaluation of relational queries for a distributed and dynamic environment. In particular, this work consist of: 1) the architecture of a distributed system which was inspired by the concepts of software agents, 2) the architecture and the implementation of a communication infrastructure for the system components, 3) the architecture and the implementation of a new query optimization algorithm, and 4) the concept and the implementation of a new query evaluation engine for parallel execution, which enables runtime optimization of queries. Furthermore, the design supports the extension and the configuration of the system and its components.
Dissertation Humboldt Universität zu Berlin |
Quality-driven Query Answering for Integrated Information Systems | |
 | Abstract:Research and business is currently moving from centralized databases towards information systems integrating distributed, autonomous data sources. With it, research focus has shifted from traditional query optimization to the field of query planning. Query planning is the problem of finding query execution plans across distributed, heterogeneous, overlapping, and autonomous data sources. We argue that for such data sources the main discriminator for different query execution strategies is no longer response time, as it is for database queries, but, more general, the information quality (IQ) of the result. This thesis investigates the usage of IQ-criteria to improve the answering of user queries against integrated information systems. We discuss what kind of IQ-metadata is necessary, how it can be acquired, and, most important, how it can be used to improve the quality of query results and the performance of query planning algorithms. A simple application for these research issues is a meta-search engine that uses existing search engines as its distributed data sources. Other examples include stock information systems, travel guides, and distributed molecular biology databases.
The thesis has three main parts. Part I lays the foundation for the problem of querying Web data sources and shows why IQ-reasoning is helpful. We describe the mediator-wrapper architecture and show how to describe sources and user queries using the concept of the universal relation. Several application examples serve as rationale throughout the thesis. Part II introduces our model of information quality. We present a comprehensive set of IQ-criteria together score assessment methods. Each data source is rated by a set of IQ-criteria, such as completeness, understandability, or accuracy. To compare data sources and query plans qualitatively using multiple criteria, we present appropriate ranking methods, which aggregate IQ-criterion scores to an overall quality value. Part III puts information quality to work by combining query planning with IQ-reasoning. We revise the conventional query planning goal of finding all plans for a query: The new goal is to find the best N plans and use a quality model to quantify the term `best'. We present two algorithms to solve this problem. The first acts as an add-on to any given query planning algorithm, the second explicitly integrates IQ-reasoning into the planning process, thereby speeding up query planning itself. Next, we part from the conventional query planning paradigm of finding different plans for a query, each with a different result. The usage of new outerjoin-type merge operators to combine sources enables a reduction of the paradigm to finding a single, best plan. We concentrate on the completeness criterion describing the amount of data returned by a plan and present two families of optimization algorithms for different real world situations. All algorithms are evaluated using a simulation testbed.
The main contribution of the thesis is the comprehensive integration of information quality reasoning and query planning. Research has recognized the importance of quality reasoning, but, to the best of our knowledge, IQ-reasoning for query planning has not been adequately addressed before.
Dissertation Humboldt Universität zu Berlin, LNCS 2261. Springer Verlag, Berlin, Heidelberg, 2002. |
Querying Semistructured Data Based on Schema Matching  | |
 | Abstract:Most of today's data is still stored in files rather than in databases. This fact has become even more evident with the growth of the World Wide Web in the 1990s. Because of that observation, the research area of semistructured data has evolved. Semistructured data is typically stored in documents and has an irregular, partial, and implicit structure. The thesis presents a new framework for querying semistructured data. Traditional database management requires design and ensures declarativity. The possibilities to design are limited in the field of semistructured data, thus, a more flexible approach is needed. We argue that semistructured data should be represented by a set of partial schemata rather than by one complete schema. Because of irregularities of the data, a complete schema would be very large and not representative. Instead, partial schemata can serve as good representations of parts of the data. While finding a complete schema turns out to be difficult, a database designer may be able to provide partial schemata for the database. Also, partial schemata can be extracted from user queries if the query language is designed appropriately. We suggest to split the notion of query into a "What''- and a "How''-part. Partial schemata represent the "What''-part. They cover semantically richer concepts than database schemata traditionally do. Among these concepts are predicates, variable definitions, and path descriptions. Schemata can be used for query optimization, but they also give users hints on the content of the database. Finding the occurrences (matches) of such a schema forms the most important part of query execution. All queries of our approach, such as the focus query or the transformation query, are based on this matching. Query execution can be optimized using knowledge about containment relationships between different schemata. Our approach and the optimization techniques are conceptually modeled and implemented as a prototype on the basis of Constraint Satisfaction Problems (CSPs). CSPs form a general class of search problems for which many techniques and heuristics exist. A CSP consists of variables that have a domain associated to them. Constraints restrict the values that variables can simultaneously take. We transform the problem of finding the matches of a schema in a database to a CSP. We prove that under certain conditions the matches of a schema can be found without any search and in polynomial time. For optimization purposes the containment relationship between schemata is explored. We formulate a sufficient condition for schema containment and test it again using CSP techniques. The containment relationship can be used in two ways depending on the direction of the containment: It is either possible to reduce the search space when looking for matches of a schema, or it is possible to present the first few matches immediately without any search. Our approach has been implemented into the constraint system ECLiPSe and tested using XML documents. |
1997Event Detection in Active Databases | |
 | Abstract:This thesis presents a work in the research area of Active Databases. The work is influenced by more general event driven control approaches as pursued for workflow management. An Active Database automatically executes an action when specified events and conditions arise. Events are either atomic events or compositions of events. The pioneer Active Database approaches were designed for integrity maintenance of the transaction execution. Recently, Active Data bases leave the transaction context and develop towards more general support for event driven control in long running activities. The thesis presents the Smile approach (Situation Monitoring In Long run ning Environments). Smile is designed for event detection over long periods of time in distributed and partially connected systems. Smile introduces two new concepts: abstract events as representations for event compositions, and explicit dynamic requests for events. The thesis defines a set oriented model for requests and materialized abstract events, based on a common subset of existing event languages for event compositions. The operational semantics of event detection is defi ned by stream based operators. The Smile event detection combines request driven as well as event driven event detection over the whole lifetime of the application. In contrast to the sequential execution model of other Active Databases, Smile event detection is executed in parallel. The thesis investigates advantages and drawbacks of parallel event detection. On the one side, par allelization is a powerful optimization technique, and especially well sui ted for stream based operators. On the other side, parallelization might destroy the timely order of events. Since the timely order is essential for event detection and event composition semantics, the thesis presents a hybrid parallelization strategy which respects the timely order of events.
Dissertation Humboldt Universität zu Berlin, Edition Versal, Berz-Verlag Berlin, 1997. ISBN 3-929470-57-8 |
Last update: Monday, September 4, 2006
|
|