Forschungsseminar: Neue Entwicklungen im Datenbankbereich
Dieses Seminar wird von Mitgliedern der Lehr- und Forschungseinheit als Forum der Diskussion und des Austauschs genutzt. Studierende und Gäste sind herzlich eingeladen.
Das Seminar findet in RUD25, Raum 4.112 statt. Wer per E-Mail Einladungen zum Forschungsseminar erhalten möchte, sendet bitte eine Email an Thomas Morgenstern um sich in die Mailingliste ein- bzw. austragen zu lassen.
MapReduce Frameworks: Comparing Hadoop and HPCC with Textual Similarity Joins - Work in Progress (Fabian Fier)
MapReduce and Hadoop are often used synonymously. For optimal runtime performance, Hadoop users have to consider various implementation details and configuration parameters. When conducting performance experiments with Hadoop on different algorithms, it is hard to choose a set of such implementation optimizations and configuration options which is fair to all algorithms. HPCC is a promising alternative open source implementation of MapReduce. We show that HPCC provides sensible default configuration values allowing for fairer experimental comparisons. On the other hand, we show that HPCC users still have to consider implementing optimizations known from Hadoop.
Cost-based Parallelization for Stream-based Data Flows (Matthias Sax)
A fundamental challenge in data stream processing is the requirement to "keep up" with data ingestion rate while providing results with low latency. Solving this challenge has become even more important as available data streams grow in numbers and volume continuously (e.g., Internet of Things). At the same time, the demand for low latency processing has increased, resulting in (near) real-time requirements. During the last years, new distributed data parallel stream processing systems like Apache S4, Apache Storm, Apache Samza, Apache Flink, or Apache Beam (aka Google Dataflow) emerged. All of them require educated tuning of streaming programs from expert users: one of the most important parameters is the degree of parallelism a program gets executed with. Furthermore, batching techniques are often employed in those systems to increase the throughput at the same time dealing with the drawback of an increased processing latency. Depending on the use-case, a trade-off exists between low resource consumption (i.e., overall number of parallel running tasks) and low latency. However, state-of-the-art systems lack any model to predict or quantify this trade-off resulting in manual trial-an-error configuration and tuning. The goal of this work is to provide a holistic cost-model for streaming data flows taking into account the two aspects of batching and parallelization. It can be used to automatically compute a data flow configuration for given throughput and/or latency goals. Furthermore, this holistic view on a data flow program may improve dynamic scaling approaches that use local information only. Thus, this work contributes to the overall vision of a fully self-managed (i.e., tuning-free) and distributed stream processing system with elastic behavior.
"Towards Scalable Real-time Analytics: An Architecture for Scale-out of OLXP Workloads" (Thomas Bodner)
In this talk, we present the architecture of the SAP HANA Scale-out
Extension (SOE). The goal of HANA SOE is to complement the scale-up
oriented HANA core data platform with massive scale-out capabilities for
large-scale analytics over real-time data. This is achieved by
decoupling the main database components and providing them as services
in a distributed landscape, a design choice made possible by recent
advances in high-throughput, low-latency networks and storage devices.
We detail three central components of HANA SOE and their interplay: a
distributed shared log, a transaction broker and a distributed query
executor. Furthermore, we report on the ongoing integration of HANA SOE
with the Apache big data ecosystem.
"Design and Development of Medical Recommendation System for Home Care Service for Geriatrics" (Saliha Irem Besik)
Demands and expectations for health care have gradually increased with the longer life expectancy and decline in birth rate, however the resources reserved for health services are relatively limited. The countries with aging population problems are trying to develop new systems to obtain more effective usage of current resources. The aging population and resultant chronic illnesses has become a real problem for Turkey as well. The increase in elderly population results in more demand for health care because of aging-associated physical or mental limitations and chronic illnesses. Research illustrates that home care services for seniors speed up the healing process. The aim of the thesis is developing a medical recommendation system (RHCS) which generates treatment and care plan recommendations to assist health professionals to make decisions on treatment process of geriatrics. This developed recommendation system will be a part of an integrated patient based e-health platform which provides a home health care for those elderly people who need care, including all of the actors (particularly relatives of elderly people) involved in the nursing period. One of the distinctive points of this study lies in the methodology used which is empowering collaborative filtering recommendation approach with historical data of geriatric patients. Its ontological-based approach, electronic health record structure, compatibility with ICD-10 and ATC clinical classification systems also makes this v study prominent. RHCS has evaluated by both offline experiments with historical patient data taken by Ankara Numune Hospital and user studies conducted with 13 doctors. The results are measured by three different types of evaluation metrics, and it is showed that in each case RHCS is a successful system to generate reliable and relevant recommendations. As a future work, RHCS will be adapted to integrate with a rule-based clinical decision support system.
"Textueller Ähnlichkeitsjoin auf HPCC am Beispiel von MassJoin" (Michael Kraemer)
In bestehenden Arbeiten, wie auch dem MassJoin Algorithmus, hat sich das Programmiermodell MapReduce als viel genutzte Technologie etabliert. MapReduce überlässt es dem Anwender, die Problemstellung sinnvoll aufzuteilen und eine Anwendungslogik für die Map und Reduce Prozesse zu entwickeln. Eine automatische Optimierung der Map und Reduce Phasen erfolgt nicht. Untersuchungen zeigen, dass der auf MapReduce basierende MassJoin Algorithmus ein unbefriedigendes Skalierungs- und Laufzeitverhalten aufweist. Es besteht die Vermutung, dass dieses ungünstige Verhalten in den Eigenschaften des MapReduce Programmierparadigmas begründet ist. Dies gibt Anlass zur weiteren Forschung. Im Rahmen dieser Arbeit soll das HPCC Clustersystem zusammen mit der Anfragesprache ECL genutzt werden, das aufgrund seiner Systemarchitektur Vorteile im Vergleich zu MapReduce besitzt. Eine Kombination der Techniken des MassJoin Algorithmus mit dem HPCC System scheint ein vielversprechender Ansatz zu sein, um die Performanceprobleme zu lösen.