Query Optimization in RDF Databases
The Resource Description Framework (RDF) defines a graph-based data model to represent arbitrary information. In RDF resources represent entities of the real world and are uniquely identified by URIs. The key structure to represent an information is a statement consisting of a subject, a predicate, and an object which are resources in most cases. Thus, a set of statements forms a directed labeled graph: Nodes represent subjects and objects and the edges represent predicates of the statements.
Relational database technology has been researched for decades and is widely used in production system. Utilizing this experience most approaches build on top of relational DBMS to manage RDF data. However, RDBMS have not been optimized to manage graph-like data, but tabular data. Additionally, RDF comes with its own query language (SPARQL) which especially allows to retrieve subgraphs from the RDF data.
On the basis of the Berlin SPARQL Benchmark researchers measured the query execution time of a mix of SPARQL queries and their corresponding SQL queries (cf. diagram). The results show that the SQL queries run faster on the relational database system than the SPARQL queries on RDF repositories (by factor 5 to 20). Considering these results we think that there is most likely room for improving the query processing within existing RDF repositories. Hereby, we focus on three areas: physical management of the RDF data, its indexing, and query optimization.
Native storage of RDF data
A native RDF database has the advantage that it can exploit the specific characteristics of the RDF data model, the RDF data, and the query language. We develop an RDF database that takes advantage of the following characteristics:
The main principle of the native RDF store is to achieve high data locality, e.g., we arrange statements on database pages so that a query engine needs to access only a few of them to answer a query. The placement works as follows:
Additional placement strategies can be applied to reduce further the number of read pages: If it is known that pieces of information are likely to be accessed together (e.g., the address of a person) then these pieces are stored together on the same database page – we refer to it as semantic closure.
Constructing a query execution plan
Storing the RDF data effectively is only one aspect of improving query execution. Another one is to optimize the query itself and to construct a good query execution plan (QEP). Typically the process of query execution is divided into four phases: query parsing, query rewriting, QEP generation, and QEP execution. Although we focus our research on query rewriting we developed the SPARQL Query Graph Model (SQGM) for representing SPARQL queries and supporting all phases of query processing. This model forms the key data structure for storing information that is relevant for query optimization and for transforming the query. The basic elements are the following:
PREFIX ub : <http://.../univ-bench.owl#>
In the query rewriting phase, an SQGM is transformed into a semantically equivalent one to achieve a better execution strategy when the query optimizer generates query execution plans. For instance, transformation rules may aim at simplifying complex queries by merging graph patterns, e.g., avoiding join operations, and eliminating redundant or contradicting restrictions.
Transformation rules change an SQGM only locally, i.e., an operator and its immediate neighbors are affected. For example, a transformation rule can merge two graph pattern operators and a join operator into a single graph pattern operator (see figure below).
Rewrite strategies are more complex and affect the complete SQGM. Every rewrite strategy follows a certain goal, e.g., merge as many graph pattern operators as possible. To reach a goal it may be necessary to apply a single or a sequence of transformation rules several times.
For example, query rewriting can be used to support the fast path algorithm implemented in the Jena Semantic Web Framework or the selection of indexes.
We assume that users create indexes for a basic graph pattern. Creating an index means that all occurrences of the pattern in the data graph are materialized. Thus, parts of a query that are equivalent to an existing index need not to be computed at run time and information about the frequency of matches of subgraphs may be used to determine an optimal plan for query evaluation.
CREATE INDEX I1 ON <http://example.com/university.owl>
Involving indexes the query processing works as follows:
Eligible Index: An index is eligible for a query if the index pattern is completely contained in the query pattern (see figure below: I is eligible while I is not). 2 1 A more interesting case arises if the optimizer chooses not only a single index but a couple of possibly overlapping indexes for query evaluation, because a larger part of the query need not to be computed.
Overlapping indexes: Intuitively, two indexes overlap if a part of an pattern of one index is contained in the pattern of other one.
Estimating costs: An influence factor for the costs is the selectivity of an index. The selectivity is calculated as the ratio between the number of occurrences selected by an index and the size of the input data. A small selectivity value of an index is better, because only a small number of occurrences is obtained from the index and has to be considered in further query processing, e.g., testing for subgraph isomorphism.
We implemented the SQGM on top of Jena and ARQ. (1) The ARQ query processor parses the query and generates a query model specific to ARQ. (2) Then, the generated query model is translated into an SQGM and (3) heuristics are applied to provide a good basis for an efficient query execution plan (QEP). (4) Finally, the restructured SQGM is translated back into an ARQ query model which is executed.
Based on the Lehigh University Benchmark, we generated three sets of RDF data with increasing size which were managed by a RDBMS. We developed 41 queries which combined basic graph patterns, OPTIONAL, FILTER, and UNION clauses in various ways. For each query we measured the query execution time two times: with and without applying heuristics.