SPARQL Nuts & Bolts

In Learn SPARQL, we introduced what SPARQL is, how it relates to other query languages, and went through basic SPARQL syntax.

This lesson builds on that foundation, primarily by example. We use many real-world SPARQL queries to illustrate the features of the query language as the quickest way to making you productive.

This is a technical lesson by nature.

Today’s Lesson

In this tutorial we will use the DBpedia SPARQL endpoint from DBpedia.org, which is a freely available, community-backed database filled with RDF data extracted from Wikipedia.

All sample queries from this less can be pasted directly into DBpedia’s SPARQL UI for testing. You are very much encouraged to play around with the queries to explore the language. It helps to keep it open in another browser tab so you can switch back and forth between the lesson and the queries.

Basic Graph Patterns

Let’s start out by querying just one triple pattern (an RDF triple with variables). Several triples with variables grouped together are called a Basic Graph Pattern. This is the most basic kind of query.

Recall that variables are prefixed with (?) a question mark.

Also recall that the words “URI” and “resource” are often used interchagably.

Query 1: This query returns all of the URIs that identify cities that are of type “Cities in Texas”. Recall that RDF resources are identified by URI; it is common to use URI and resource interchangeably when talking about queries.

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
SELECT * WHERE {
?city rdf:type <http://dbpedia.org/class/yago/WikicatCitiesInTexas>
}

Copy and paste this into DBpedia’s SPARQL UI. If you run the query, you’ll be rewarded with a table containing a rather long list of cities in Texas.

Now let’s add another triple into the query.

Query 2: This query returns the cities that are of type “Cities in Texas” as well as their total populations.

Notice that the variable ?city is used as the subject of both triples in the query, matching two statements on a single resource.

If you type this into the DBpedia SPARQL UI you’ll see that the results page is largely the same as the first query, but there is now a new column with the extra information requested.

SPARQL uses the Turtle syntax (described originally in the lesson RDF Nuts & Bolts), so the following query is the same:

Can you tell what’s different, and why the two queries are, in effect, identical?

Let’s add another triple pattern. This is a Basic Graph Pattern with three triple patterns.

Query 3: This query returns the cities that are of type “Cities in Texas” with their total populations and metro populations.

Try this in the DBpedia SPARQL UI. Now the size of the results has shrunk dramatically from the previous queries! Why would that be, given that the query itself is asking for more information, not less?

In fact, by asking for the additional information we are putting an implicit restriction on the query. Specifically, the query will return results only for cities that have values for both ?popTotal and ?popMetro. Cities that only have ?popTotal but not ?popMetro do not show up anymore.

That is, we are asking for cities that have values for both total population and metro population. If a city resource lacks or the other the graph pattern in the query won’t match the data.

But what if we want all cities to appear, regardless if they have a metro population? This is where the OPTIONAL clause comes in.

Dealing with Missing or Sparse Data using OPTIONAL

The idea of the OPTIONAL clause is to enable to you to bring data if it exists, but to ignore it if it does not. This is a key way in which SPARQL deals with sparse data and missing values elegantly.

If you are coming from the SQL world, the SPARQL OPTIONAL operator is equivalent to a Left Outer Join. In other words, the results will always include values from the “left part” of the query, even if there is nothing that matches the “right part” of the query.

Query 4: This query returns the cities that are of type “Cities in Texas” and their total population and optionally the metro population, if it exists.

Again, the reason why we need the OPTIONAL operator is because there can be missing information in RDF. If you are coming from the SQL world, missing information is represented with NULL. However, there are no NULLs in RDF.

Let’s say that again: there are no NULL values in RDF.

Either a triple exists or it does not. In fact, since RDF data is independent from its physical data representation, the whole idea of NULL is completely unnecessary.

NULL values in relational databases are just a manifestation of the tabular logical data model; there needed to be some way to represent an empty cell. In SQL, if in a record a value is NULL and you do SELECT * FROM table, you still get the NULL in the result.

This is not the case in SPARQL. Instead, you don’t get the record at all (as we illustrated in Query 3). If you still want to get the record, you need to use OPTIONAL.

In general you should use OPTIONAL for all data that you would like to get back, but that doesn’t help filter the appropriate result set by acting as a requirement. For example, you’ll often use OPTIONAL for predicates that you’re not sure will contain much data, or that you can use if present but ignore if not in an application context.

Solution Modifiers: ORDER BY, LIMIT, OFFSET

Basic graph pattern matching shown above enables you to select for data. However, you usually don’t want all data that might match the pattern to be returned for every query. You want data returned in a certain order, and usually only want to see a few results at a time.

This is where ORDER BY, LIMIT, and OFFSET come in hand. Together they enable you to pull down query results a page at a time. If you’re coming from the SQL world, these operators are equivalent to the same clauses in SQL.

Let’s start with ORDER BY. The ORDER BY clause establishes the order of the results and can be in ascending or descending order.

Query 5: This query returns the cities that are of type “Cities in Texas”, their total population, and optionally their metro populations. The results are returned in the order of their total populations (so big cities like Houston would be the first results).

You can also use asc() option to return the results in ascending order.

The LIMIT clause puts an upper bound on the number of results. The OFFSET clause causes the results to start after the specified number. These clauses are often used in conjunction with ORDER BY to implement result paging.

Query 6: This query returns the cities that are of type “Cities in Texas”, their total population, and optionally their metro populations. The results are returned in the order of their total populations (so big cities would be the top results). At most 10 results will be returned, starting with the 5^th result.

Remove results: FILTER

At this point we’ve shown how to match data and how to order the results. However, basic match patterns are not enough, as you also need to be able to filter out results that match the pattern but are not wanted.

A FILTER clause restricts which results are returned. You can use filters to do things like:

Don’t return cities with populations greater than 50,000
Don’t return cities with names that begin with the letter A
Don’t return cities with a mayor who has a son named Rob
Etc.

With graph patterns and filters, SPARQL becomes a very powerful language for selecting only data that matches very specific criteria.

Now let’s add some constraints to our query. We can do this with the FILTER operator, which uses Boolean conditions to filter out unwanted results. The following filters are allowed:

Logical: &&, ||, !
Mathematical: +, -, *, /
Comparison: =, !=, <, >, <=, >=
SPARQL tests: isURI, isBlank, isLiteral, bound
SPARQL accessors: str, lang, datatype
Other: sameTerm, langMatches, regex

Query 7: This is the same as Query 6, but returns only cities that have a total population of more than 50,000.

Query 8: This query is the same as query 7, but brings back the human readable name of each city with the results.

rdfs:label is an RDFS predicate commonly used to represent the human-readable name of a resource. You will likely see either this or dc:title from the Dublin Core Metadata Initiative used in most ontologies.

What happened?! There are now several result rows for each city resource. Why is that?

This is a common occurrence in RDF. Unlike SQL, it is very easy to assign multiple values to a resource for a specific property. In this case, there is an rdfs:label for multiple languages in the dataset.

In our simple tabular result format therefore we’ll get multiple results for each repeated value.

Since we don’t need all the results for all languages, we can simplify the query by requesting only the English values be returned. In this way, RDF and SPARQL naturally support internationalization.

Query 9: Query 8, but requesting only English labels for the matching patterns.

The lang operator extracts the language tag of the value that is bound to ?name. The langmatches operator matches the first language tag with the second language range.

The previous query can be rewritten equivalently without the langmatches operator and using “=” and “en” (lowercase) instead of “EN” (uppercase):

Query 10: This query shows how to use regular expression filters. It is the same as Query 9, but matching only cities with “El” in their names.

The str operator extracts the string of the value that is bound to ?name. The regex operator allows regular expressions, and specifically any regular expressions accepted by XQuery. The syntax should be familiar to anyone who has used standard regular expressions before.

One common mistake beginners make with regular expressions is to forget that they are case sensitive. Run the previous query with “el” instead of “El” and see what happens.

Negation: where is the NOT operator?

An important feature in any query language is negation. However, in SPARQL 1.0, there is no explicit negation operator.

Nevertheless negation is possible through Negation as Failure and is written using the OPTIONAL clause, BOUND operator, and the logical not (!) operator. The OPTIONAL operator binds variables to the triples that it wants to exclude, and the filter removes those cases.

This sounds pretty complicated, but it’s really quite simple once you see it in action.

Query 11: This query is the same as before, except that it returns only cities that do not have a metro population.

To understand how this works, first realize that you need the OPTIONAL clause to prevent yourself from incorrectly requiring a value for dbp:populationMetro to be bound to a query result.

Next, let’s look at the new filter clause. The BOUND operator is simply a boolean test that returns whether or not a specific property is bound in the result being returned. You can think of “bound” as “matched” or “not null.” So when combined with the logical not operator (!) the filter is “match cities where the total population is not bound to any value.”

Success!

Finally, note also that it is possible to have different FILTER clauses. You do not need to and all of them into the same FILTER clause. For example, this query, in which I specify each filter condition in its own FILTER clause, is exactly equivalent:

UNION

The UNION clause is a disjunction between two basic graph patterns. In other words, it is an OR.

Query 12: This is much the same as the queries that we’ve been seeing, only it returns cities that are of type “Cities in Texas” or of type “Cities in California”.

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dbp: <http://dbpedia.org/ontology/>
SELECT * WHERE {
{
?city rdf:type <http://dbpedia.org/class/yago/WikicatCitiesInTexas> ;
dbp:populationTotal ?popTotal ;
rdfs:label ?name
OPTIONAL {?city dbp:populationMetro ?popMetro. }
FILTER (?popTotal > 50000 && langmatches(lang(?name), “EN”))
}
UNION
    {
        ?city rdf:type <http://dbpedia.org/class/yago/CitiesInCalifornia>;
        dbp:populationTotal ?popTotal ;
        rdfs:label ?name
        OPTIONAL {?city dbp:populationMetro ?popMetro. }
        FILTER (?popTotal > 50000 && langmatches(lang(?name), “EN”))
    }
}
ORDER BY desc(?popTotal)

This is a very simple, naïve first attempt at writing this expression. As you can see, the previous query has several redundant triple patterns and can be simplified. The following query is equivalent to the previous one but simpler:

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dbp: <http://dbpedia.org/ontology/>
SELECT * WHERE {
?city dbp:populationTotal ?popTotal ;
rdfs:label ?name
OPTIONAL {?city dbp:populationMetro ?popMetro. }
FILTER (?popTotal > 50000 && langmatches(lang(?name), “EN”))
{ ?city rdf:type <http://dbpedia.org/class/yago/WikicatCitiesInTexas> . }
UNION
{ ?city rdf:type <http://dbpedia.org/class/yago/CitiesInCalifornia>. }
}
ORDER BY desc(?popTotal)

Note that the repeated triple patterns (e.g. those common to both sets of results) are outside of the UNION clause.

Named Graphs and the GRAPH Clause

Up to now, we have been querying a single RDF dataset. However, as explained in RDF 101 RDF data, even within a single RDF database, is broken up into subsets called named graphs.

Up until now in this tutorial, all of our queries have assumed that all the data we care about is in the same RDF graph, which is called the default graph. Depending on the RDF database implementation, the default graph might contain nothing at all, or metadata about the database as a whole, or it might server as a proxy for all data within the database.

With DBpedia specifically, the default graph serves as a proxy for all data. However, it does have a number of named graphs we can use to scope the query if desired.

Each named graph is identified by a URI.

Query 13: This query returns the cities that are of type “Cities in Texas” and the graph in which each city resource is contained.

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
SELECT * WHERE {
GRAPH ?g {
?city rdf:type <http://dbpedia.org/class/yago/WikicatCitiesInTexas> .
}
}

The results show that all of the data is in the same graph, the default graph.

In general, if you wanted to query a specific named graph, replace the variable ?g with the specific URI of the named graph.

What else can I do with SPARQL?

Up to now, we have just been executing SELECT queries. However, as we mentioned in SPARQL 101 there are three more types of queries: ASK, DESCRIBE, CONSTRUCT.

ASK

ASK queries checks if there is at least one result for a given query pattern. The result is true or false.

Query 14: This query asks if Austin is a city in Texas.

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
ASK WHERE {
<http://dbpedia.org/resource/Austin,_Texas> rdf:type
<http://dbpedia.org/class/yago/WikicatCitiesInTexas> .
}

Now let’s make a more complicated question using ASK.

Query 15: This query asks if there exists a city in Texas that has a total population greater than 600,000 and a metro population less than 1,800.000

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX dbp: <http://dbpedia.org/ontology/>
ASK WHERE {
?city rdf:type <http://dbpedia.org/class/yago/WikicatCitiesInTexas> ;
dbp:populationTotal ?popTotal ;
dbp:populationMetro ?popMetro.
FILTER (?popTotal > 600000 && ?popMetro < 1800000)
}

DESCRIBE

DESCRIBE queries returns an RDF graph that describes a resource. The implementation of this return form is up to each query engine.

Query 16: This query returns an RDF graph that describes Austin.

DESCRIBE <http://dbpedia.org/resource/Austin,_Texas>

Note that if you enter this query into DBpedia’s SPARQL UI you will be promped to download a file. The reason for this is that the result of DESCRIBE is a graph, and in this specific case a graph that is represented in the N3 serialization format described in RDF Nuts & Bolts.

As mentioned before, the behavior of DESCRIBE is implementation dependent. Virtuoso is the triple store that powers DBpedia, and its implementation of DESCRIBE is to return an RDF graph where the resource is in the subject or in the object position. Other triple stores do not necessary need to have this same behavior.

It is also possible to have a DESCRIBE query with triple patterns in a WHERE clause.

Query 17: This query returns an RDF graph that describes all the cities in Texas that have a total population greater than 600,000 and a metro population less than 1,800.000.

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX dbp: <http://dbpedia.org/ontology/>
DESCRIBE ?city WHERE {
?city rdf:type <http://dbpedia.org/class/yago/WikicatCitiesInTexas> ;
dbp:populationTotal ?popTotal ;
dbp:populationMetro ?popMetro.
FILTER (?popTotal > 600000 && ?popMetro < 1800000)
}

CONSTRUCT

A CONSTRUCT query returns an RDF graph that is created from a graph template specified in the CONSTRUCT query. More specifically, the result RDF graph is created by taking the results of a query pattern and filling in the values of variables that occur in the construct template.

Using CONSTRUCT, you can transform RDF data into different graph structures, with different vocabularies. This may be useful if you have RDF data that was automatically generated and would like to transform it using well-known vocabularies, or if you are merging RDF data from multiple vocabularies. As such, CONSTRUCT is a powerful tool for consuming RDF from various sources.

Query 18: This query constructs a new RDF graph for cities in Texas that have a metro population greater than 500,000.

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dbp: <http://dbpedia.org/ontology/>
CONSTRUCT {
?city rdf:type <http://myvocabulary.com/LargeMetroCitiesInTexas> ;
<http://myvocabulary.com/cityName> ?name ;
<http://myvocabulary.com/totalPopulation> ?popTotal ;
<http://myvocabulary.com/metroPopulation> ?popMetro .
} WHERE {
?city rdf:type <http://dbpedia.org/class/yago/WikicatCitiesInTexas> ;
dbp:populationTotal ?popTotal ;
rdfs:label ?name ;
dbp:populationMetro ?popMetro .
FILTER (?popTotal > 500000 && langmatches(lang(?name), “EN”))
}

Note that in the CONSTRUCT clause above we have used our own, made up vocabulary!

SPARQL Result Syntax

For SPARQL endpoints, SELECT and ASK queries return XML (application/sparql-results+xml) as the standard query result format for a SPARQL query. There is also a non-standard JSON syntax. To see what the raw SPARQL results look like, toggle the “Results Format” field in the DBpedia SPARQL UI.

That said, you will rarely if ever have to parse SPARQL results yourself, as you’ll typically be issuing SPARQL queries through an RDF client library that takes care of that problem for you.

Both DESCRIBE and CONSTRUCT return RDF graphs directly, and not in the standard SPARQL query results format. For example, DBpedia will return results in N3. As mentioned in RDF Nuts & Bolts there are several RDF serializations that may be used to represent RDF data, and the result serialization format will be implementation dependent.

What’s next: SPARQL 1.1

Up to now, we have covered SPARQL 1.0, which is a read-only language and lacks many important features you may be used to from SQL, such as the ability to update data in a database.

SPARQL 1.1 is in the process of being standardized and will contain the following, much needed, features:

Aggregates: ability to group results and calculate aggregate values (e.g. count, min, max, avg, sum, …).
Projected expressions: ability for query results to contain values derived from constants, function calls, or other expressions in the SELECT list.
Sub-queries: allows a query to be embedded within another.
Negation: includes two negation operators: NOT EXIST and MINUS
Update: an update language for RDF

But that is not all! SPARQL 1.1 will also contain other features such as:

Property paths: query arbitrary length paths of a graph via a regular-expression-like syntax
Query Federation: ability to split a single query and send parts of it to different SPARQL endpoints and then combining the results from each one
Service Description: a vocabulary and discovery mechanism that describes the capabilities of a SPARQL endpoint.
Entailment Regimes: defines conditions under which SPARQL queries can be used for inference under RDF, RDF Schema, OWL, or RIF entailment.

Many popular triple stores already implement some or all of the features in SPARQL 1.1.

Next Lesson

Learn SPARQL