[dbis logo]

[Institut fuer Informatik] [Leerraum] [Humboldt-Universitaet zu Berlin]
13.11.2012 23:00 Alter: 7 yrs

Talk at GIS Day 2012 at GFZ Potsdam: "Efficient Clustering of Geo-Spatial Timeseries using Parallel Dataflow Programs"

Von: Matthias J. Sax

In this talk, we present initial results of our joined work with Dr.Mike Sips from Geoforschungzentrum (GFZ) Potsdam (German Research Centre for Geosciences) about clustering large geo-spatial timeseries efficently on large compute clusters.

This research collaboration is stimmulated by the Research Training Group METRIK, additionally involving the research project Stratosphere.

Spatial time series are a common result of environmental process simulations. To gain insight into the processes described by the output of those simulations, geo-scientists need to asses the data’s spatial and temporal dimensions. One popular approach to assess the simulation output is to group spatial situations associated to each time step into clusters; the resulting clusters are considered to be characteristic spatial situations of the simulated process. The time steps of the time series are assigned to the detected characteristic spatial situations to understand temporal behavior of environmental processes. However, clustering consumes many computing resources. Furthermore, the size of simulation output data is usually hundreds of gigabytes of raw data and many algorithms cannot produce results in suitable time.

We propose an approach for computing a hierarchical clustering of large geo-spatial time series. Hierarchical clustering allows geo-scientists to explore the data; the exploration of a hierarchy of clusters helps to identify the characteristic spatial situations. In contrast to many other clustering approaches, we designed an algorithm performing features extraction to condense the data what results in reduced computational effort. We defined a similarity measure for spatial situations using those features. Furthermore, to address the challenge of high computation cost we developed a parallel algorithm. We implemented our algorithm as a dataflow program using the Stratosphere system and evaluated our approach on a selected data set.