[dbis logo]

.Lehre.Semesterübersicht
[Institut fuer Informatik] [Leerraum] [Humboldt-Universitaet zu Berlin]

Forschungsseminar: Neue Entwicklungen im Datenbankbereich

Dieses Seminar wird von Mitgliedern der Lehr- und Forschungseinheit als Forum der Diskussion und des Austauschs genutzt. Studierende und Gäste sind herzlich eingeladen.

 

Das Seminar findet in RUD25, Raum 3.113 statt.

Termine

Datum Beginn Vortragende(r) Titel
23.04.2012 9:30 Uhr Steffen Zeuch "Overview of modern parallelization techniques"
30.04.2012 9:30 Uhr Steffen Zeuch "Overview of modern parallelization techniques" (cont.)
07.05.2012 9:30 Uhr Jan Hendrik Nielsen "Privacy-Preserving Distributed k-Anonymity"
14.05.2012 9:30 Uhr Rico Bergmann "Asterix on Hyracks/Algebricks"
21.05.2012 9:30 Uhr Olaf Hartig "SPARQL for a Web of Linked Data: Semantics and Computability"
11.06.2012 9:30 Uhr Olaf Hartig "Foundations of Traversal Based Query Execution over Linked Data"
18.06.2012 9:30 Uhr Gerd Anders "A de-novo infrastructure for annotating and analysing protein structures"
25.06.2012 9:30 Uhr t.b.a. t.b.a.
02.07.2012 9:30 Uhr t.b.a. t.b.a.
24.08.2012 11:00 Uhr Daniel Janusz "Privacy Protocol for Linking Distributed Medical Data"
10.09.2012 9:30 Uhr t.b.a. t.b.a.
17.09.2012 11:00 Uhr Matthias Sax "Performance Optimization of Dataflows in Distributed Streaming Systems"
24.09.2012 9:30 Uhr Christian Fiebrig "Generierung und Vergleich von Charakteristika von Umweltsystemen in Zeitmessreihen
mittels Stratosphere"

Zusammenfassungen

"Overview of modern parallelization techniques" (Steffen Zeuch)

Der Vortrag beschreibt die drei wesentlichen Strategien zur Parallelisierung auf den Ebenen der Instruktionen, Daten und Threads, sowie deren Einfluss auf die moderne Softwareentwicklung. Anhand von Beispielen werden verschiedene Probleme aufgezeigt und entsprechende Lösungsansätze präsentiert.

"Privacy-Preserving Distributed k-Anonymity" (Jan Hendrik Nielsen)

Im Rahmen von medizinischen Studien erhobene Patientendaten werden vermehrt digital erfasst und studienübergreifend genutzt. Dieses Vorgehen wird beispielsweise an den Krebsregistern der Bundesländer deutlich. Da in diesen sensible, personenbezogene Daten erhoben werden, welche dezentral verarbeitet, gespeichert und veröffentlicht werden, bekommt das Thema Datenschutz in der Medizin einen zunehmenden Stellenwert.

Bereits in der vorangegangenen Studienarbeit wurde gezeigt, dass die Privatsphäre des Patienten durch den Verbund von öffentlich zugänglichen Informationen gefährdet ist. Thema der Studienarbeit war der Schutz sensibler Daten in einer verteilten Service-Infrastruktur. Dabei wurde als zentrale Instanz eine Trusted Party verwendet, welche den Schritt der Anonymisierung mittels der k-Anonymisierung bewerkstelligte. Dieses Vorgehen stellt zwar eine sichere Methode der Anonymisierung dar, es bietet jedoch keinen Schutz vor der Kompromittierung der Trusted Party selbst. Zu diesem Zweck existieren Methoden, welche es gestatten den Vorgang der Anonymisierung mit Hilfe eines kryptographischen Protokolls, ohne den Einsatz einer zentralen Instanz durchzuführen.

Die Diplomarbeit wird sich mit dem Szenario von vertikal partitionierten Daten und deren verteilten Anonymisierung befassen. Der Vortrag soll das Vorgehen anhand eines Verfahrens für zwei beteiligte Parteien zeigen. Zusätzlich sollen die Schwächen dieses Vorgehens und mögliche Verbesserungen, sowie die Erweiterung auf drei und mehr Parteien diskutiert werden.

"Asterix on Hyracks/Algebricks" (Rico Bergmann)

In this talk the current version of Asterix, a data analytics platform, that is highly scalable and natively deals with semistructured data, is presented. Hyracks (the execution engine) and Algebricks (the algebraic layer in Hyracks) are introduced shortly. Then the talk highlights some aspects of Asterix, such as the Asterix Data Model (ADM) used, the query language AQL and the newly implemented statistics collection component in Asterix.

"SPARQL for a Web of Linked Data: Semantics and Computability" (Olaf Hartig)

The World Wide Web currently evolves into a Web of Linked Data where content providers publish and link data as they have done with hypertext for the last 20 years. While the declarative query language SPARQL is the de facto for querying a-priory defined sets of data from the Web, no language exists for querying the Web of Linked Data itself. However, it seems natural to ask whether SPARQL is also suitable for such a purpose.

In this paper we formally investigate the applicability of SPARQL as a query language for Linked Data on the Web. In particular, we study two query models: 1) a full-Web semantics where the scope of a query is the complete set of Linked Data on the Web and 2) a family of reachability-based semantics which restrict the scope to data that is reachable by traversing certain data links. For both models we discuss properties such as monotonicity and computability as well as the implications of querying a Web that is infinitely large due to data generating servers.

"Foundations of Traversal Based Query Execution over Linked Data" (Olaf Hartig)

Query execution over the Web of Linked Data has attracted much attention recently. A particularly interesting approach is link traversal based query execution which proposes to integrate the traversal of data links into the construction of query results. Hence --in contrast to traditional query execution paradigms-- this approach does not assume a fixed set of relevant data sources beforehand; instead, it discovers data on the fly and, thus, enables applications to tap the full potential of the Web.

While several authors study possibilities to implement the idea of link traversal based query execution and to optimize query execution in this context, no work exists that discusses the theoretical foundations of the approach in general. Our paper fills this gap.

We introduce a well-defined semantics for queries that may be executed using the link traversal based approach. Based on this semantics we formally analyze properties of such queries. In particular, we study the computability of queries as well as the implications of querying a potentially infinite Web of Linked Data. Our results show that query computation in general is not guaranteed to terminate and that for any given query it is undecidable whether the execution terminates. Furthermore, we define an abstract execution model that captures the integration of link traversal into the query execution process. Based on this model we prove the soundness and completeness of link traversal based query execution and analyze an existing implementation approach.

"A de-novo infrastructure for annotating and analysing protein structures" (Gerd Anders)

Proteins are biomolecules, i. e. chemical compounds facilitating a function in organisms. The Protein Data Bank (PDB) is a unique worldwide repository of structural data about proteins. Analyzing data in the PDB can help explain diseases, develop new drugs, or understand the interactions between proteins. However, one of the key challenges is to efficiently store and query this information to find and extract information and correlations of interest. Moreover, analyzing and annotating proteins is one of the ten most important tasks in structural biology and bioinformatics. Here, a de-novo infrastructure for storing and querying PDB data quickly and comprehensively is presented. Exploiting the potential of the infrastructure enabled the design and development of a computationally demanding application for mining secondary structure elements of proteins. Analyzed results lead into a promising hypothesis which could finally answer a long lasting biological question.

"Privacy Protocol for Linking Distributed Medical Data" (Daniel Janusz)

Health care providers need to exchange medical data to provide complex medical treatments. In general, regulations of privacy protection define strong constraints for exchanging such personal data within a distributed system. Privacy-preserving query protocols provide mechanisms for implementing and maintaining these privacy constraints. In this paper, we introduce a new two-phase protocol for protecting the privacy of patients. The first phase implements a private record linking. Thereby, the queried data provider links the received query with matching records in his data base. In the second phase, a requestor and a data provider perform an authorized exchange of matched patient data. Thus, our protocol provides a method for health care providers to exchange individual medical data in a privacy preserving manner. In contrast to other approaches, we actively involve patients in the exchange process. We apply the honest-but-curious adversary model to our protocol in order to evaluate our approach with respect to complexity and the degree of privacy protection.

"Performance Optimization of Dataflows in Distributed Streaming Systems" (Matthias Sax)

In this talk I will present the research project I worked on during my intership at HP Labs (Palo Alto, CA) to optimize the performance of dataflows in distributed streaming systems, and will illustrate our principles on Storm. Storm is a MapReduce inspired distributed intra-node-parallel streaming system executing data flows (called Topologies). Previous work at HP Labs showed that batching techniques can improve the throughput of Storm topologies by an order of magnitude. However, it is difficult to decide manually for which nodes in the data flow batching is beneficial and what the batch size should be. Additionally, Topologies must be annotated with a degree of parallelism for each node in the workflow. While it is difficult to choose the optimal degree of parallelism, this value also influences the batch size and vice versa. I will describe our optimization algorithm which computes the optimal degree of parallelism and optimal batch size for each node in the topology. Furthermore, the talk covers the transparent implementation of batching on top of Storm and some experimental results using our optimizer implementation.
Toward the end, I will also briefly describe the other project I worked on to integrate Hadoop as an execution engine for CLASP. CLASP (Cloud Application Service Platform) is a HP in-house built cloud-based distributed system that executes services. Right now, services are “plain” Java programs. However, it is intended that CLASP supports different types of services (e.g., Hadoop-Service, Strom-Service, Vertica-Service), using different execution engines. I will describe, how Hadoop got integrated into CLASP and how the programming API got extended.

"Generierung und Vergleich von Charakteristika von Umweltsystemen in Zeitmessreihen mittels Stratosphere" (Christian Fiebrig)

Diese Diplomarbeit setzt sich mit dem Vergleich von Zeitmessreihen mit Hilfe von Stratosphere auseinander. Zur Analyse von Umweltsystemen werden am Geoforschungszentrum in Potsdam (GFZ) große Simulationen durchgeführt, um das Verhalten der Umweltsysteme zu verstehen und Vorhersagen über diese treffen zu können. Die daraus resultierenden Messreihen sollen auf räumliche und zeitliche Muster untersucht werden, um die Charakteristika der Umweltsysteme aufzufinden. Dazu sind massive Vergleiche der Simulationswerte erforderlich, die einen Zusammenhang zum Eingabe-Ausgabe Verhältnis herstellen sollen. Mit der Wahl der Vergleichsstrategie des Pyramid Match Kernel wird, zwischen jeweils zwei räumlichen Modellen aus den Simulationen, nach räumlichen übereinstimmenden Mustern gesucht. Dazu benutzt der Pyramid Match Kernel eine Feature-Extraktion, so dass nur noch auf einer reduzierten Menge an Features Vergleiche durchgeführt werden. Anhand von Daten aus Hochwassersimulationen des GFZ Potsdam mit 90.000 Matrizen mit je ~860.000 Messpunkten sollen die Belastbarkeit und die wesentlichen Eigenschaften des Systems erkundet werden.



[Punkt]  Sommersemester 2019

[Punkt]  Wintersemester 2018/19

[Punkt]  Sommersemester 2018

[Punkt]  Wintersemester 2017/18

[Punkt]  Sommersemester 2017

[Punkt]  Sommersemester 2016

[Punkt]  Wintersemester 2015/16

[Punkt]  Sommersemester 2015

[Punkt]  Wintersemester 2014/15

[Punkt]  Sommersemester 2014

[Punkt]  Wintersemester 2013/14

[Punkt]  Sommersemester 2013

[Punkt]  Wintersemester 2012/13

[aktiver Punkt]  Sommersemester 2012

[Punkt]  Wintersemester 2011/12

[Punkt]  Sommersemester 2011

[Punkt]  Wintersemester 2010/11

[Punkt]  Sommersemester 2010

[Punkt]  Wintersemester 2009/10

[Punkt]  Sommersemester 2009

[Punkt]  Wintersemester 2008/09

[Punkt]  Sommersemester 2008

[Punkt]  Wintersemester 2007/08

[Punkt]  Sommersemester 2007

[Punkt]  Wintersemester 2006/07

[Punkt]  Sommersemester 2006

[Punkt]  Wintersemester 2005/06

[Punkt]  Sommersemester 2005

[Punkt]  Wintersemester 2004/05



Ansprechpartner