Topic and Relevance
In this tutorial, we focus our attention on benchmarking and we limit our scope to RDF, which is the latest data exchange format to gain traction for representing information in the Semantic Web. Our interest in RDF is well-timed for two reasons:
- there is a proliferation of RDF systems, and identifying the strong and weak points of these systems is important to support users in deciding the system to use for their needs; and
- surprisingly, there is a similar proliferation of RDF bechmarks, a development that adds to the confusion since it is not clear which benchmark(s) should one use (or trust) to evaluate existing, or new, systems.
Benchmarks can be used to inform users of the strengths and weaknesses of competing software products, but more importantly, they encourage the advancement of technology by providing both academia and industry with clear targets for performance and functionality.
Given the multitude of usage scenarios of RDF systems, one can ask the following questions:
- how can one come up with the right benchmark that accurately captures all these use cases?
- How can a benchmark capture the fact that RDF data are used to represent the whole spectrum of data, from structured (with relational data converted to RDF), semi-structured (with XML data converted to RDF) and completely natively unstructured graph data?
- How can a benchmark capture the different data and query patterns and provide a consistent picture for system behavior across different application settings?
- When one benchmark does not suffice and multiple ones are needed, how can one pick the right set of benchmarks to try?
These are particularly hard questions whose answers require both an in-depth understanding of the domains where RDF is used, and an in-depth understanding of which benchmarks are appropriate for what domains. In this tutorial, we provide some guidance in this respect by discussing the state-of-the-art RDF, and if time permits, graph benchmarks.
In more detail, in this tutorial we are going to :
- introduce the attendees to the design principles of RDF benchmarks
- discuss the dimensions of an RDF benchmark namely query workloads, performance metrics and employed datasets or data generators (in the case of synthetic benchmarks), and rules that the RDF engines should follow to run the benchmark
- provide a comprehensive overview of the existing RDF benchmarks with an analysis along the aforementioned dimensions
- discuss the advantages and disadvantages of the existing benchmarks and the research directions that should be exploited for the creation of novel benchmarks to answer the needs of the Linked Data paradigm.
Tutorial Content
Principles of RDF Benchmarks
In this tutorial we discuss the principles a benchmark should adhere
to; based on the existing ones we elaborate, in the concluding remarks
of our tutorial, on an extended set of principles that can be used for
designing new RDF benchmarks. We start with the principles that Jim
Gray proposed in his book The Benchmark Handbook for Database and
Transaction Systems [1] and discuss the newly
introduced idea of choke-point based benchmark design based on the
technical difficulties (coined choke points) that should be addressed
by a query processing framework [2]. The recent work by Aluc et. al [3] that proposes a systematic way of evaluating the variability of
datasets and workloads in a SPARQL benchmark by introducing query
features is also discussed in the tutorial.
Dimensions of RDF Benchmarks
To provide a comprensive analysis of the state of the art benchmarks, we discuss a set of dimensions. The dimensions that we consider and that constitute a benchmark are:
- the datasets (including data generators in the case of synthetic benchmarks),
- query workloads,
- performance metrics and
- rules that should be followed when executing a benchmark.
In this tutorial, we distinguish between benchmarks that use real datasets and those that produce synthetic datasets using special purpose data generators. For each of the datasets we discuss the schemas employed, the data characteristics in terms of number of triples, distinct URIs and literals, as well as the distribution that these datasets follow. Other characteristics such as sparseness in terms of indegree and outdegree that characterize the RDF datasets when viewed as graphs are also presented. Regarding the workload we provide an analysis of the queries and, where appropriate, the updates supported by the benchmark. For this analysis, we focus on the number of SPARQL operators (join, union, optional) as well as filter expressions included in the SPARQL queries. Moreover, we discuss the features included in the workload (nested queries, aggregation, sorting etc.). Last, we present the metrics adopted by each benchmark to judge the performance of RDF engines, and the rules that must be followed when running the benchmark.
RDF Benchmarks
We discuss and compare the existing RDF benchmarks according to the aforementioned dimensions in order to derive a full and complete assessment thereof.
We first present the real benchmarks proposed over the last years. In this category fall the benchmarks that employ real datasets and workloads. We discuss DBPedia SPARQL Benchmark (DBPSB) that was proposed by the University of Leipzig [4] and introduces a query workload derived from the DBPedia query logs. We also present the UniProt KnowledgeBase (UniProtKB) [5] along with its set of queries [6]. UniProtKB is a high-quality dataset describing protein sequences and related functional information, expressed in RDF. In addition to the previous datasets, we also discuss the YAGO [7] knowledge base that integrates statements from Wikipedia, Wordnet, WordNet Domains, Universal WordNet and GeoNames ontologies. Similar to Uniprot and DBPedia, the YAGO dataset is not accompanied by a set of queries. However, we will discuss the queries proposed by Neumann et. al., who provided eight mostly lookup and join queries for an earlier version of the YAGO ontology, for benchmarking the RDF-3X engine [8].
In addition to benchmarks using real-world datasets we also elaborate on the state of the art synthetic RDF benchmarks. We start with the Lehigh University Benchmark (LUBM) [9] intended to evaluate the performance of Semantic Web repositories. In the tutorial we discuss the process employed by LUBM to generate the datasets that considers the specified query selectivity and expected query result size per query. We also elaborate on LUBM's workload which consists of mainly simple lookup and join queries that retrieve only data triples. Metrics that go beyond the standard query response time and include completeness and soundness of query results as well as a combined metric for query performance are also presented. The University Ontology Benchmark (UOBM) [10] based on LUBM is also addressed in this tutorial. This benchmark tackles complex inference and includes queries that address scalability issues in addition to those studied by LUBM. SP2Bench [11] is also included in the set of benchmarks studied in this tutorial since it is one of the most commonly used benchmarks for evaluating the performance of RDF engines. The benchmark contains both a data generator and a set of queries. The benchmark's generator produces arbitrarily large datasets by taking into account the constraints expressed in terms of this schema. The queries employ different SPARQL 1.0 operators and are designed in order to test the different approaches for SPARQL optimization. Finally, in the category of synthetic RDF benchmarks, we elaborate on the Berlin SPARQL Benchmark (BSBM) [12], a broadly accepted and used benchmark built around an e-commerce scenario. The latest version of the benchmark [13] that we discuss in the tutorial comes with a scalable data generator and a test driver, as well as a set of queries that measure the performance of RDF engines for very large datasets but not their ability to perform complex reasoning tasks. Special attention will be given to the performance metrics used by BSBM (that are very close to the ones proposed by TPC-H) that go beyond the metrics used by other benchmarks that are either absent, or focus on measuring only the query performance time.
Social Network Benchmarks
Another set of benchmarks that we intend to provide a thorough analysis of, are the ones that model social network graphs. We start with the Social Intelligence Benchmark (SIB) [14] a synthetic benchmark that simulates an RDF backend of a social network site (such as Facebook). SIB comes with a scalable data generator, a set of queries and a set of metrics. The synthetic data generation is done on the basis of a set of parameters used to produce the social graph.
In the same spirit as SIB is the LinkBench [15] benchmark which is based on the Facebook's social graph. The benchmark is a synthetic one with its sole objective being to predict the performance of a database when used for the persistent storage of FaceBook's data. In the tutorial we present the components of the benchmark including the graph store implementation, the graph generation, and the workload generation as well as the metrics employed for measuring the tested systems' performance.
Benchmark Generators
In addition to the aforementioned benchmarks, we also discuss benchmark generation as proposed by Duan et. al [16]. In this work, the authors introduce the new notions of coverage and coherence that are used (a) to characterize datasets (real or synthetically produced) as well as to (b) drive the generation of benchmark datasets of desired coherence and size. We also discuss the follow-up work of Arenas et. al [17] which provides a general framework for users to analyze their data and schemas and the relationships between the two. Such a framework can be central in both the selection and the generation of benchmarks for a particular system.
Duration and Sessions
- Introduction to the topic (15 Minutes)
- A short presentation of RDF and SPARQL (15 Minutes)
- Principles and Dimensions of RDF Benchmarks (30 minutes)
- Presentation of RDF Benchmarks (30 Minutes) - Part I
- Coffee break (30 minutes)
- Presentation of RDF Benchmarks (75 Minutes) - Part II
- Conclusions, directions & discussion (15 minutes)
Audience
This tutorial is aimed at a broad range of attendants, ranging from senior undergraduate and graduate students to more experienced researchers who are unfamiliar with the existing RDF benchmarks, to scientists, data producers and consumers, in general, whose applications require RDF query processing. Attendants of this tutorial are expected to get familiarized with existing RDF benchmarks as well as the principles of benchmark development.
Prerequisite
A knowledge of RDF and the SPARQL Query Language will be helpful to the audience.
References
[1] J. Gray, editor. The Benchmark Handbook for Database and Transaction Systems. Morgan Kaufmann, 1993.
[2]P. Boncz, T. Neumann, et al. TPC-H Analyzed: Hidden Messages and Lessons Learned from an Influential Benchmark. In TPCTC, 2013.
[3] G. Aluc, O. Hartig, et al. Diversied Stress Testing of RDF Data Management Systems. In ISWC, 2014.
[4] M. Morsey, J. Lehmann, et al. DBpedia SPARQL Benchmark - Performance assessment with real queries on real data. In ISWC, 2011.
[5] N. Redaschi and UniProt Consortium. UniProt in RDF: Tackling Data Integration and Distributed Annotation with the Semantic Web. In Biocuration Conference, 2009.
[6]UniProtKB Queries. http://www.uniprot.org/help/query-fields
[7] Fabian M. Suchanek, Gjergji Kasneci, et al. Yago: a core of semantic knowledge. In WWW, 2007.
[8] Thomas Neumann and Gerhard Weikum. The RDF-3X engine for scalable management of RDF data. The VLDB Journal, 19(1), 2010.
[9] LUBM. http://swat.cse.lehigh.edu/projects/lubm.
[10] L. Ma, Y. Yang, et al. Towards a Complete OWL Ontology Benchmark. In ESWC, 2006.
[11] M. Schmidt, T. Hornung, et al. SP2Bench: A SPARQL performance benchmark. In ICDE, 2009.
[12] C. Bizer and A. Schultz. The Berlin SPARQL Benchmark. Int. J. Semantic Web and Inf. Sys., 5(2), 2009.
[13] Berlin SPARQL Benchmark (BSBM) Specication - V3.1. http://wifo5- 03.informatik.unimannheim.de/bizer/berlinsparqlbenchmark/spec/index.html.
[14] M-D. Pham, P.A. Boncz, et al. S3G2: a Scalable Structure-correlated Social Graph Generator. In TPCTC, 2012.
[15] T. Armstrong, V. Ponnekanti, et al. LinkBench: a database benchmark based on the Facebook social graph. In SIGMOD, 2013.
[16] S. Duan, A. Kementsietsidis, et al. Apples and oranges: a comparison of RDF benchmarks and real RDF datasets. In SIGMOD, 2011.
[17] M. Arenas, G. I. Diaz, et al. A Principled Approach to Bridging the Gap between Graph Data and their Schemas . PVLDB, 7, 2014.