Tuning SPARQL Queries

by Arto

Here follows an excerpt from our upcoming Dydra Developer Guide, from a section that provides some simple tips on how to tune your queries for better application performance.

SPARQL is a powerful query language, and as such it is easy to write complex queries that require a great deal of computing power to execute. As both query execution time and billing directly depend on how much processing a query requires, it is useful to understand some of Dydra’s key performance characteristics. With larger datasets, simple changes to a query can result in a significant performance improvement.

This post describes several factors that strongly influence the execution time and cost of queries, and explains a number of tips and tricks that will help you tune your queries for optimal application performance and a reduced monthly bill.

Note that the following may contain too much detail if you are casually using Dydra for typical and straightforward use cases. You probably won’t need these tips until you are dealing with large datasets or complex queries. Nonetheless, you may still find it interesting to at least glance over this material.

SELECT Queries

A general tip for SELECT queries is to avoid unnecessarily projecting variables you won’t actually use. That is, if your query’s WHERE clause binds the variables ?a, ?b, and ?c, but you actually only ever use ?b when iterating over the solution sequence in your application, then you might want to avoid specifying the query in either of the following two forms:

SELECT * WHERE { ... }
SELECT ?a ?b ?c WHERE { ... }

Rather, it is better to be explicit and project just the variables you actually intend to use:

SELECT ?b WHERE { ... }

The above has two benefits. Firstly, Dydra’s query processing will apply more aggressive optimizations knowing that the values of the variables ?a and ?c will not actually be returned in the solution sequence. Secondly, the size of the solution sequence itself, and hence the network use necessary for your application to retrieve it, is reduced by not including superfluous values. The combination of these two factors can make a big performance difference for complex queries returning large solution sequences.

If you remember just one thing from this subsection, remember this: SELECT * is a useful shorthand when manually executing queries, but not something that you should much want to use in a production application dealing with complex queries on non-trivial amounts of data.

Remember, also, that SPARQL provides an ASK query form. If all you need to know is whether a query matches something or not, use an ASK query instead of a SELECT query. This enables the query to be optimized more aggressively, and instead of a solution sequence you will get back a simple boolean value indicating whether the query matched or not, minimizing the data transferred in response to your query.

The ORDER BY Clause

The ORDER BY clause can be very useful when you want your solution sequence to be sorted. It is important to realize, though, that ORDER BY is a relatively heavy operation, as it requires the query processing to materialize and sort a full intermediate solution sequence, which prevents Dydra from returning initial results to you until all results are available.

This does not mean that you should avoid using ORDER BY when it serves a purpose. If you need your query results sorted by particular criteria, it is best to let Dydra do that for you rather than manually sorting the data in your application. After all, that is why ORDER BY is there. However, if the solution sequence is large, and if the latency to obtain the initial solutions is important (sometimes known as the “time-to-first-solution” factor), you may wish to consider whether you in fact need an ORDER BY clause or not.

The OFFSET Clause

Dydra’s query processing guarantees that a query solution sequence has a consistent and deterministic ordering even in the absence of an ORDER BY clause. This has an important and useful consequence: the results of an OFFSET clause are always repeatable, whether or not the query has an ORDER BY clause.

Concretely, this means that if you have a query containing an OFFSET clause, and you execute that query multiple times in succession, you will get the same solution sequence in the same order each time. This is not a universal property of SPARQL implementations, but you can rely on it with Dydra.

This feature facilitates, for example, paging through a large solution sequence using an OFFSET and LIMIT clause combination, without needing ORDER BY. So, again, don’t use an ORDER BY clause unnecessarily if you merely want to page through the solution sequence (say) a hundred solutions at a time.

The LIMIT Clause

Always ensure that your queries include a LIMIT clause whenever possible. If your application only needs the first 100 query solutions, specify a LIMIT 100. This puts an explicit upper bound on the amount of work to be performed in answering your query.

Note, however, that if your query contains both ORDER BY and LIMIT clauses, query processing must always construct and examine the full solution sequence in order to sort it. Therefore the amount of processing needed is not actually reduced by a LIMIT clause in this case. Still, limiting the size of the ordered solution sequence with an explicit LIMIT improves performance by reducing network use.

blog comments powered by Disqus