Topic and Relevance

In this tutorial, we focus our attention on benchmarking and we limit our scope to RDF, which is the latest data exchange format to gain traction for representing information in the Semantic Web. Our interest in RDF is well-timed for two reasons:

Benchmarks can be used to inform users of the strengths and weaknesses of competing software products, but more importantly, they encourage the advancement of technology by providing both academia and industry with clear targets for performance and functionality.

Given the multitude of usage scenarios of RDF systems, one can ask the following questions:

These are particularly hard questions whose answers require both an in-depth understanding of the domains where RDF is used, and an in-depth understanding of which benchmarks are appropriate for what domains. In this tutorial, we provide some guidance in this respect by discussing the state-of-the-art RDF, and if time permits, graph benchmarks.

In more detail, in this tutorial we are going to :

  • introduce the attendees to the design principles of RDF benchmarks
  • discuss the dimensions of an RDF benchmark namely query workloads, performance metrics and employed datasets or data generators (in the case of synthetic benchmarks), and rules that the RDF engines should follow to run the benchmark
  • provide a comprehensive overview of the existing RDF benchmarks with an analysis along the aforementioned dimensions
  • discuss the advantages and disadvantages of the existing benchmarks and the research directions that should be exploited for the creation of novel benchmarks to answer the needs of the Linked Data paradigm.

Tutorial Content

Principles of RDF Benchmarks

In this tutorial we discuss the principles a benchmark should adhere to; based on the existing ones we elaborate, in the concluding remarks of our tutorial, on an extended set of principles that can be used for designing new RDF benchmarks. We start with the principles that Jim Gray proposed in his book The Benchmark Handbook for Database and Transaction Systems [1] and discuss the newly introduced idea of choke-point based benchmark design based on the technical difficulties (coined choke points) that should be addressed by a query processing framework [2]. The recent work by Aluc et. al [3] that proposes a systematic way of evaluating the variability of datasets and workloads in a SPARQL benchmark by introducing query features is also discussed in the tutorial.

Dimensions of RDF Benchmarks

To provide a comprensive analysis of the state of the art benchmarks, we discuss a set of dimensions. The dimensions that we consider and that constitute a benchmark are:

In this tutorial, we distinguish between benchmarks that use real datasets and those that produce synthetic datasets using special purpose data generators. For each of the datasets we discuss the schemas employed, the data characteristics in terms of number of triples, distinct URIs and literals, as well as the distribution that these datasets follow. Other characteristics such as sparseness in terms of indegree and outdegree that characterize the RDF datasets when viewed as graphs are also presented. Regarding the workload we provide an analysis of the queries and, where appropriate, the updates supported by the benchmark. For this analysis, we focus on the number of SPARQL operators (join, union, optional) as well as filter expressions included in the SPARQL queries. Moreover, we discuss the features included in the workload (nested queries, aggregation, sorting etc.). Last, we present the metrics adopted by each benchmark to judge the performance of RDF engines, and the rules that must be followed when running the benchmark.

RDF Benchmarks

We discuss and compare the existing RDF benchmarks according to the aforementioned dimensions in order to derive a full and complete assessment thereof.

We first present the real benchmarks proposed over the last years. In this category fall the benchmarks that employ real datasets and workloads. We discuss DBPedia SPARQL Benchmark (DBPSB) that was proposed by the University of Leipzig [4] and introduces a query workload derived from the DBPedia query logs. We also present the UniProt KnowledgeBase (UniProtKB) [5] along with its set of queries [6]. UniProtKB is a high-quality dataset describing protein sequences and related functional information, expressed in RDF. In addition to the previous datasets, we also discuss the YAGO [7] knowledge base that integrates statements from Wikipedia, Wordnet, WordNet Domains, Universal WordNet and GeoNames ontologies. Similar to Uniprot and DBPedia, the YAGO dataset is not accompanied by a set of queries. However, we will discuss the queries proposed by Neumann et. al., who provided eight mostly lookup and join queries for an earlier version of the YAGO ontology, for benchmarking the RDF-3X engine [8].

In addition to benchmarks using real-world datasets we also elaborate on the state of the art synthetic RDF benchmarks. We start with the Lehigh University Benchmark (LUBM) [9] intended to evaluate the performance of Semantic Web repositories. In the tutorial we discuss the process employed by LUBM to generate the datasets that considers the specified query selectivity and expected query result size per query. We also elaborate on LUBM's workload which consists of mainly simple lookup and join queries that retrieve only data triples. Metrics that go beyond the standard query response time and include completeness and soundness of query results as well as a combined metric for query performance are also presented. The University Ontology Benchmark (UOBM) [10] based on LUBM is also addressed in this tutorial. This benchmark tackles complex inference and includes queries that address scalability issues in addition to those studied by LUBM. SP2Bench [11] is also included in the set of benchmarks studied in this tutorial since it is one of the most commonly used benchmarks for evaluating the performance of RDF engines. The benchmark contains both a data generator and a set of queries. The benchmark's generator produces arbitrarily large datasets by taking into account the constraints expressed in terms of this schema. The queries employ different SPARQL 1.0 operators and are designed in order to test the different approaches for SPARQL optimization. Finally, in the category of synthetic RDF benchmarks, we elaborate on the Berlin SPARQL Benchmark (BSBM) [12], a broadly accepted and used benchmark built around an e-commerce scenario. The latest version of the benchmark [13] that we discuss in the tutorial comes with a scalable data generator and a test driver, as well as a set of queries that measure the performance of RDF engines for very large datasets but not their ability to perform complex reasoning tasks. Special attention will be given to the performance metrics used by BSBM (that are very close to the ones proposed by TPC-H) that go beyond the metrics used by other benchmarks that are either absent, or focus on measuring only the query performance time.

Social Network Benchmarks

Another set of benchmarks that we intend to provide a thorough analysis of, are the ones that model social network graphs. We start with the Social Intelligence Benchmark (SIB) [14] a synthetic benchmark that simulates an RDF backend of a social network site (such as Facebook). SIB comes with a scalable data generator, a set of queries and a set of metrics. The synthetic data generation is done on the basis of a set of parameters used to produce the social graph.

In the same spirit as SIB is the LinkBench [15] benchmark which is based on the Facebook's social graph. The benchmark is a synthetic one with its sole objective being to predict the performance of a database when used for the persistent storage of FaceBook's data. In the tutorial we present the components of the benchmark including the graph store implementation, the graph generation, and the workload generation as well as the metrics employed for measuring the tested systems' performance.

Benchmark Generators

In addition to the aforementioned benchmarks, we also discuss benchmark generation as proposed by Duan et. al [16]. In this work, the authors introduce the new notions of coverage and coherence that are used (a) to characterize datasets (real or synthetically produced) as well as to (b) drive the generation of benchmark datasets of desired coherence and size. We also discuss the follow-up work of Arenas et. al [17] which provides a general framework for users to analyze their data and schemas and the relationships between the two. Such a framework can be central in both the selection and the generation of benchmarks for a particular system.

Duration and Sessions

  • Introduction to the topic (15 Minutes)
  • A short presentation of RDF and SPARQL (15 Minutes)
  • Principles and Dimensions of RDF Benchmarks (30 minutes)
  • Presentation of RDF Benchmarks (30 Minutes) - Part I
  • Coffee break (30 minutes)
  • Presentation of RDF Benchmarks (75 Minutes) - Part II
  • Conclusions, directions & discussion (15 minutes)


This tutorial is aimed at a broad range of attendants, ranging from senior undergraduate and graduate students to more experienced researchers who are unfamiliar with the existing RDF benchmarks, to scientists, data producers and consumers, in general, whose applications require RDF query processing. Attendants of this tutorial are expected to get familiarized with existing RDF benchmarks as well as the principles of benchmark development.


A knowledge of RDF and the SPARQL Query Language will be helpful to the audience.


[1] J. Gray, editor. The Benchmark Handbook for Database and Transaction Systems. Morgan Kaufmann, 1993.

[2]P. Boncz, T. Neumann, et al. TPC-H Analyzed: Hidden Messages and Lessons Learned from an Influential Benchmark. In TPCTC, 2013.

[3] G. Aluc, O. Hartig, et al. Diversi ed Stress Testing of RDF Data Management Systems. In ISWC, 2014.

[4] M. Morsey, J. Lehmann, et al. DBpedia SPARQL Benchmark - Performance assessment with real queries on real data. In ISWC, 2011.

[5] N. Redaschi and UniProt Consortium. UniProt in RDF: Tackling Data Integration and Distributed Annotation with the Semantic Web. In Biocuration Conference, 2009.

[6]UniProtKB Queries.

[7] Fabian M. Suchanek, Gjergji Kasneci, et al. Yago: a core of semantic knowledge. In WWW, 2007.

[8] Thomas Neumann and Gerhard Weikum. The RDF-3X engine for scalable management of RDF data. The VLDB Journal, 19(1), 2010.

[9] LUBM.

[10] L. Ma, Y. Yang, et al. Towards a Complete OWL Ontology Benchmark. In ESWC, 2006.

[11] M. Schmidt, T. Hornung, et al. SP2Bench: A SPARQL performance benchmark. In ICDE, 2009.

[12] C. Bizer and A. Schultz. The Berlin SPARQL Benchmark. Int. J. Semantic Web and Inf. Sys., 5(2), 2009.

[13] Berlin SPARQL Benchmark (BSBM) Speci cation - V3.1. http://wifo5-

[14] M-D. Pham, P.A. Boncz, et al. S3G2: a Scalable Structure-correlated Social Graph Generator. In TPCTC, 2012.

[15] T. Armstrong, V. Ponnekanti, et al. LinkBench: a database benchmark based on the Facebook social graph. In SIGMOD, 2013.

[16] S. Duan, A. Kementsietsidis, et al. Apples and oranges: a comparison of RDF benchmarks and real RDF datasets. In SIGMOD, 2011.

[17] M. Arenas, G. I. Diaz, et al. A Principled Approach to Bridging the Gap between Graph Data and their Schemas . PVLDB, 7, 2014.