Меренков Андрей Вадимович

RDF3X: a RISCstyle Engine for RDF

Thomas Neumann, Gerhard Weikum MaxPlanckInstitut fur Informatik Saarbr Brucken, Germany neumann@mpiinf.mpg.de weikum@mpiinf.mpg.de

ABSTRACT

RDF is a data representation format for schema-free structured information that is gaining momentum in the context of Semantic-Web corpora, life sciences, and also Web 2.0 platforms. The \pay-as-you-go" nature of RDF and the exible pattern-matching capabilities of its query language SPARQL entail e_ciency and scalability challenges for complex queries including long join paths. This paper presents the RDF-3X engine, an implementation of SPARQL that achieves excellent performance by pursuing a RISC-style architecture with a streamlined architecture and carefully designed, puristic data structures and operations. The salient points of RDF-3X are:

a generic solution for storing and indexing RDF triples that completely eliminates the need for physical-design tuning;
a powerful yet simple query processor that leverages fast merge joins to the largest possible extent;
a query optimizer for choosing optimal join orders using a cost model based on statistical synopses for entire join paths.

The performance of RDF-3X, in comparison to the previously best state-of-the-art systems, has been measured on several large-scale datasets with more than 50 million RDF triples and benchmark queries that include pattern matching and long join paths in the underlying data graphs.

1. INTRODUCTION

1.1 Motivation and Problem

The RDF (Resource Description Framework) data model has been around for a decade. It has been designed as a flexible representation of schema-relaxable or even schema-free information for the SemanticWeb [30]. In the commercial IT world, RDF has not received much attention until recently, but now it seems that RDF is building up a strong momentum. Semantic-Web-style ontologies and knowledge bases with millions of facts from Wikipedia and other sources have been created and are available online [4, 40, 47]. E-science data repositories support RDF as an import/export format and also for selective (thus, query-driven) data extractions, most notably, in the area of life sciences (e.g., [7, 43]). Finally, Web 2.0 platforms for online communities are considering RDF as a non-proprietary exchange format and as an instrument for the construction of information mash ups [21,22, 31].

In RDF, all data items are represented in the form of (subject; predicate; object) triples, also known as (subject; property; value) triples. For example, information about the movie "Sweeney Todd" could include the following triples:


(id1; hasTitle; "Sweeney Todd"),

(id1; producedInY ear; "2007"),

(id1; directedBy; "Tim Burton"),

(id1; hasCasting; id2),

(id2; RoleName; "Sweeney Todd"), (id2; Actor; id11),

(id1; hasCasting; id3),

(id3; RoleName; "Mrs: Lovett"), (id3; Actor; id12),

(id11; hasName; "Johnny Depp"),

(id12; hasName; "Helena Bonham Carter"), and so on.

Note that, although predicate names such as "producedInYear" resemble attributes, there is no database schema; the same database may contain triples about movies with a predicate "productionYear". A schema may emerge in the long run (and can then be described by the RDF Vocabulary Description Language). In this sense, the notion of RDF triples _ts well with the modern notion of data spaces and its "pay as you go" philosophy [18]. Compared to an entity-relationship model, both an entity's attributes and its relationships to other entities are represented by predicates. All RDF triples together can be viewed as a large (instance-level) graph.

The SPARQL query language is the official standard for searching over RDF repositories. It supports conjunctions (and also disjunctions) of triple patterns, the counterpart to select-project-join queries in a relational engine. For example, we can retrieve the titles of all movies with Johnny Depp by the SPARQL query:


select ?title Where {

?m  ?title. ?m  ?c.

?c  ?a. ?a  "Johnny Depp"}

Here each of the conjunctions, denoted by a dot, corresponds to a join. The whole query can also be seen as graph pattern that needs to be matched in the RDF data graph. In SPARQL, predicates can also be variables or wildcards, thus allowing schema-agnostic queries. RDF engines for storing, indexing, and querying have been around for quite a few years; especially, the Jena framework by HP Labs has gained significant popularity [46], and Oracle also provides RDF support for semantic data integration in life sciences and enterprises [11, 29]. However, with the exception of the VLDB 2007 paper by Abadi et al. [1] none of the prior implementations could demonstrate convincing efficiency, failing to scale up towards large datasets and high load. [1] achieves good performance by grouping triples with the same property name into property tables, mapping these onto a column store, and creating materialized views for frequent joins. Managing large-scale RDF data includes technical challenges for the storage layout, indexing, and query processing:

1. The absence of a global schema and the diversity of predicate names pose major problems for the physical database design. In principle, one could rely on an auto-tuning \wizard" to materialize frequent join paths; however, in practice, the evolving structure of the data and the variance and dynamics of the workload turn this problem into a complex sisyphus task.

2. By the fine-grained modeling of RDF data — triples instead of entire records or entities — queries with a large number of joins will inherently form a large part of the workload, but the join attributes are much less predictable than in a relational setting. This calls for specific choices of query processing algorithms, and for careful optimization of complex join queries; but RDF is meant for on-the-y applications over data spaces, so the optimization takes place at query run-time.

3. As join-order and other execution-plan optimizations require data statistics for selectivity estimation, an RDF engine faces the problem that a suitable granularity of statistics gathering is all but obvious in the absence of a schema. For example, single-dimensional histograms on all attributes that occur in the workload's where clauses — the state-of-the-art approach in relational systems — is unsuitable for RDF, as it misses the effects of long join chains or large join stars over many-to-many relationships.

4. Although RDF uses XML syntax and SPARQL involves search patterns that resemble XML path expressions, the fact that RDF triples form a graph rather than a collection of trees is a major difference to the more intensively researched settings for XML.

1.2 Contribution and Outline

This paper gives a comprehensive, scalable solution to the above problems. It presents a complete system, coined RDF-3X (for RDF Triple express), designed and implemented from scratch specifically for the management and querying of RDF data. RDF-3X follows the rationale advocated in [24, 39] that data-management systems that are designed for and customized to specific application domains can outperform generic mainstream systems by two orders of magnitude. The factors in this argument include:

1. tailored choices of data structures and algorithms rather than supporting a wide variety of methods;

2. much lighter software footprint and overhead, and as a result;

3. simplified optimization of system internals and easier configuration and self-adaptation to changing environments (e.g., data and workload characteristics).

RDF-3X follows such a RISC-style design philosophy [10], with "reduced instruction set" designed to support RDF. RDF-3X is based on a three key principles:

Physical design is workload-independent by creating appropriate indexes over a single "giant triples table". RDF-3X does not rely on the success (or limitation) of an auto-tuning wizard, but has effectively eliminated the need for physical-design tuning. It does so by building indexes over all 6 permutations of the three dimensions that constitute an RDF triple, and additionally, indexes over count-aggregated variants for all three two-dimensional and all three one-dimensional projections. Each of these indexes can be compressed very well; the total storage space for all indexes together is less than the size of the primary data.
The query processor is RISC-style by relying mostly on merge joins over sorted index lists. This is made possible by the "exhaustive" indexing of the triples table. In fact, all processing is index-only, and the triples table exists merely virtually. Operator trees are constructed so as to preserve interesting orders [17] for subsequent joins to the largest possible extent; only when this is no longer possible, RDF-3X switches to hash-based join processing. This approach can be highly optimized at the code level, and has much lower overhead than traditional query processors. At the same time, it is sufficiently versatile to support also the various duplicate-elimination options of SPARQL, disjunctive patterns in queries, and all other features that SPARQL requires.
The query optimizer mostly focuses on join order in its generation of execution plans. It employs dynamic programming for plan enumeration, with a cost model based on RDF-specific statistical synopses. These statistics include counters of frequent predicate-sequences in paths of the data graph; such paths are potential join patterns. Compared to the query optimizer in a universal database system, the RDF-3X optimizer is simpler but much more accurate in its selectivity estimations and decisions about execution plans.

The scientific contributions of this work are:

a novel architecture for RDF indexing and querying, eliminating the need for physical database design,
an optimizer for large join queries over non-schematic RDF triples, driven by a new kind of selectivity estimation for RDF paths,
a comprehensive performance evaluation, based on real measurements with three large datasets, demonstrating large gains over the previously best engine [1] (by a typical factor of 5 and up to 20 for some queries). The source code of RDF-3X and the experimental data are available for non-commercial purposes upon request.