[dbis logo]

[Institut fuer Informatik] [Leerraum] [Humboldt-Universitaet zu Berlin]

Forschungsseminar: Neue Entwicklungen im Datenbankbereich

Dieses Seminar wird von Mitgliedern der Lehr- und Forschungseinheit als Forum der Diskussion und des Austauschs genutzt. Studierende und Gäste sind herzlich eingeladen.


Das Seminar findet in RUD25, Raum 4.112 statt. Wer per E-Mail Einladungen zum Forschungsseminar erhalten möchte, sendet bitte eine Email an Thomas Morgenstern um sich in die Mailingliste ein- bzw. austragen zu lassen.


Datum Beginn Raum Vortragende(r) Titel
28.04.2016 11:00 Uhr (s.t.!) RUD25, 4.210 Michael Kraemer Textueller Ähnlichkeitsjoin auf HPCC am Beispiel von MassJoin
12.05.2016 11:00 Uhr (s.t.!) RUD25, 4.112 Saliha Irem Besik Design and Development of Medical Recommendation System for Home Care Service for Geriatrics
19.05.2016 11:00 Uhr (s.t.!) RUD25, 4.112 Thomas Bodner Towards Scalable Real-time Analytics: An Architecture for Scale-out of OLXP Workloads
21.07.2016 11:00 Uhr (s.t.!) RUD25, 4.112 Fabian Fier MapReduce Frameworks: Comparing Hadoop and HPCC with Textual Similarity Joins (Work in Progress)
21.07.2016 11:00 Uhr (s.t.!) RUD25, 4.112 Matthias J Sax Cost-based Parallelization for Stream-based Data Flows
29.07.2016 10:15 Uhr RUD25, 4.210 Jörg Bachmann TBA
29.07.2016 10:15 Uhr RUD25, 4.210 Daniel Janusz TBA
04.08.2016 11:00 Uhr (s.t.!) RUD25, 4.112 Steffen Zeuch TBA
04.08.2016 11:00 Uhr (s.t.!) RUD25, 4.112 Mathias Peter TBA


MapReduce Frameworks: Comparing Hadoop and HPCC with Textual Similarity Joins - Work in Progress (Fabian Fier)

MapReduce and Hadoop are often used synonymously. For optimal runtime performance, Hadoop users have to consider various implementation details and configuration parameters. When conducting performance experiments with Hadoop on different algorithms, it is hard to choose a set of such implementation optimizations and configuration options which is fair to all algorithms. HPCC is a promising alternative open source implementation of MapReduce. We show that HPCC provides sensible default configuration values allowing for fairer experimental comparisons. On the other hand, we show that HPCC users still have to consider implementing optimizations known from Hadoop.

Cost-based Parallelization for Stream-based Data Flows (Matthias Sax)

A fundamental challenge in data stream processing is the requirement to "keep up" with data ingestion rate while providing results with low latency. Solving this challenge has become even more important as available data streams grow in numbers and volume continuously (e.g., Internet of Things). At the same time, the demand for low latency processing has increased, resulting in (near) real-time requirements. During the last years, new distributed data parallel stream processing systems like Apache S4, Apache Storm, Apache Samza, Apache Flink, or Apache Beam (aka Google Dataflow) emerged. All of them require educated tuning of streaming programs from expert users: one of the most important parameters is the degree of parallelism a program gets executed with. Furthermore, batching techniques are often employed in those systems to increase the throughput at the same time dealing with the drawback of an increased processing latency. Depending on the use-case, a trade-off exists between low resource consumption (i.e., overall number of parallel running tasks) and low latency. However, state-of-the-art systems lack any model to predict or quantify this trade-off resulting in manual trial-an-error configuration and tuning. The goal of this work is to provide a holistic cost-model for streaming data flows taking into account the two aspects of batching and parallelization. It can be used to automatically compute a data flow configuration for given throughput and/or latency goals. Furthermore, this holistic view on a data flow program may improve dynamic scaling approaches that use local information only. Thus, this work contributes to the overall vision of a fully self-managed (i.e., tuning-free) and distributed stream processing system with elastic behavior.

"Towards Scalable Real-time Analytics: An Architecture for Scale-out of OLXP Workloads" (Thomas Bodner)

In this talk, we present the architecture of the SAP HANA Scale-out Extension (SOE). The goal of HANA SOE is to complement the scale-up oriented HANA core data platform with massive scale-out capabilities for large-scale analytics over real-time data. This is achieved by decoupling the main database components and providing them as services in a distributed landscape, a design choice made possible by recent advances in high-throughput, low-latency networks and storage devices. We detail three central components of HANA SOE and their interplay: a distributed shared log, a transaction broker and a distributed query executor. Furthermore, we report on the ongoing integration of HANA SOE with the Apache big data ecosystem.

Thomas Bodner is a developer in the HANA Vora team at SAP. He received a M.S. from Technische Universität Berlin and a B.S. from Duale Hochschule Baden-Württemberg Stuttgart. His current work involves building parts of a new system software stack for large-scale data management to support the growing data-related needs of SAP’s enterprise applications.

"Design and Development of Medical Recommendation System for Home Care Service for Geriatrics" (Saliha Irem Besik)

Demands and expectations for health care have gradually increased with the longer life expectancy and decline in birth rate, however the resources reserved for health services are relatively limited. The countries with aging population problems are trying to develop new systems to obtain more effective usage of current resources. The aging population and resultant chronic illnesses has become a real problem for Turkey as well. The increase in elderly population results in more demand for health care because of aging-associated physical or mental limitations and chronic illnesses. Research illustrates that home care services for seniors speed up the healing process. The aim of the thesis is developing a medical recommendation system (RHCS) which generates treatment and care plan recommendations to assist health professionals to make decisions on treatment process of geriatrics. This developed recommendation system will be a part of an integrated patient based e-health platform which provides a home health care for those elderly people who need care, including all of the actors (particularly relatives of elderly people) involved in the nursing period. One of the distinctive points of this study lies in the methodology used which is empowering collaborative filtering recommendation approach with historical data of geriatric patients. Its ontological-based approach, electronic health record structure, compatibility with ICD-10 and ATC clinical classification systems also makes this v study prominent. RHCS has evaluated by both offline experiments with historical patient data taken by Ankara Numune Hospital and user studies conducted with 13 doctors. The results are measured by three different types of evaluation metrics, and it is showed that in each case RHCS is a successful system to generate reliable and relevant recommendations. As a future work, RHCS will be adapted to integrate with a rule-based clinical decision support system.

"Textueller Ähnlichkeitsjoin auf HPCC am Beispiel von MassJoin" (Michael Kraemer)

In bestehenden Arbeiten, wie auch dem MassJoin Algorithmus, hat sich das Programmiermodell MapReduce als viel genutzte Technologie etabliert. MapReduce überlässt es dem Anwender, die Problemstellung sinnvoll aufzuteilen und eine Anwendungslogik für die Map und Reduce Prozesse zu entwickeln. Eine automatische Optimierung der Map und Reduce Phasen erfolgt nicht. Untersuchungen zeigen, dass der auf MapReduce basierende MassJoin Algorithmus ein unbefriedigendes Skalierungs- und Laufzeitverhalten aufweist. Es besteht die Vermutung, dass dieses ungünstige Verhalten in den Eigenschaften des MapReduce Programmierparadigmas begründet ist. Dies gibt Anlass zur weiteren Forschung. Im Rahmen dieser Arbeit soll das HPCC Clustersystem zusammen mit der Anfragesprache ECL genutzt werden, das aufgrund seiner Systemarchitektur Vorteile im Vergleich zu MapReduce besitzt. Eine Kombination der Techniken des MassJoin Algorithmus mit dem HPCC System scheint ein vielversprechender Ansatz zu sein, um die Performanceprobleme zu lösen.

[Punkt]  Sommersemester 2019

[Punkt]  Wintersemester 2018/19

[Punkt]  Sommersemester 2018

[Punkt]  Wintersemester 2017/18

[Punkt]  Sommersemester 2017

[aktiver Punkt]  Sommersemester 2016

[Punkt]  Wintersemester 2015/16

[Punkt]  Sommersemester 2015

[Punkt]  Wintersemester 2014/15

[Punkt]  Sommersemester 2014

[Punkt]  Wintersemester 2013/14

[Punkt]  Sommersemester 2013

[Punkt]  Wintersemester 2012/13

[Punkt]  Sommersemester 2012

[Punkt]  Wintersemester 2011/12

[Punkt]  Sommersemester 2011

[Punkt]  Wintersemester 2010/11

[Punkt]  Sommersemester 2010

[Punkt]  Wintersemester 2009/10

[Punkt]  Sommersemester 2009

[Punkt]  Wintersemester 2008/09

[Punkt]  Sommersemester 2008

[Punkt]  Wintersemester 2007/08

[Punkt]  Sommersemester 2007

[Punkt]  Wintersemester 2006/07

[Punkt]  Sommersemester 2006

[Punkt]  Wintersemester 2005/06

[Punkt]  Sommersemester 2005

[Punkt]  Wintersemester 2004/05