SPARQL vs. SQL - Intro

SPARQL vs. SQL - Intro

This lesson compares the SPARQL and SQL query languages, which are designed to query RDF and relational data, respectively. You may be reading this lesson because you know one and want to learn the other or because you need to make some decisions about which to use for some purpose. First, let's look at their data models; that is, the way we consider their data to be structured.

Comparing RDF and SQL data

Many people ask what can be done with SPARQL that can't be done with SQL, when in fact they care about what can be done in RDF that can't be done with relational databases. Both of these languages give the user access to create, combine, and consume structured data. SQL does this by accessing tables in relational databases, and SPARQL does this by accessing a web of Linked Data. (Of course, SPARQL can be used to access relational data as well, but it was designed to merge disparate sources of data.)

Relational data is made up of rows of data collected into tables (also called a "relations" in formal relational literature). The rows in a table conform to a set data types and constraints called a schema. The subset of SQL called DDL (data definition language) asserts that schema:

CREATE TABLE Person (
     ID INT,
     fname CHAR(10),
     addr INT,
     FOREIGN KEY(addr) REFERENCES Address(ID)
);

CREATE TABLE Address (
     ID INT,
     city CHAR(10),
     state CHAR(2)
)

This constrains the rows in the Person table in some database to have three columns—"ID", "fname" and "addr"—which are respectively an integer, 10 characters and another integer. It similarly defines an Address table and requires that the non-NULL values in the "addr" column in the Person table correspond to the value of the "ID" column in the Address table. This allows the database to capture relationships between the real world entities (things) being represented in the database. The values below, for instance, state that Bob lives in Cambridge, MA, and that we don't know where Sue lives:

Person
ID fname addr
7 Bob 18
8 Sue NULL
Addresses
ID city state
18 Cambridge MA

The numbers 7, 8 and 18 were invented to identify the rows and capture linkages between them. SQL queries about these relations will generally not mention the specific numbers but will recapitulate the constraints of the relationship, i.e. that Person.addr=Address.ID. This same sort of expression can capture constraints which aren't intrinsic relationships, e.g. that a Person's addr number be the same as their shoe size.

RDF captures both entity attributes and relationships between entities as statements of the form entity1 has propertyA relationship to entity2. Using a language called Turtle, we can say that there is a person named "Bob" with an address in Cambridge, MA:

<PersonA> a <Person> .
<AddressB> a <Address> .
<PersonA> <Person#fname> "Bob" .
<AddressB> <Address#city> "Cambridge" .
<PersonA> <Person#addr> <AddressB> .
<AddressB> <Address#state> "MA" .

and that there's another person, "Sue", but we won't say anything about her address because we don't know it:

<PersonF> a <Person> .
<PersonF> <Person#fname> "Sue" .

We call the three terms in every RDF statement the subject, predicate and object. The terms used in the statements above are relative URLs in angle brackets (<>s), literals in quotation marks (""s) and the keyword "a" which is just a shortcut for the URL used to identify a "has type" relationship. There is no concept in RDF corresponding to SQL's NULL as there is no RDF requirement corresponding to SQL's structural constraint that every row in a relational database must conform to the same schema. The object of one assertion, e.g.<AddressB> above, may be the subject or object of other assertions. In this way, a set of RDF statements connect up to create a "graph" (in the mathematical sense). You will frequently hear the term "RDF graph". These graphs may be cyclic, for example stating that Bob lives someplace where he is also the owner:

<PersonC> <Person#homeAddress> <AddressK> .
<PerconC> <Person#fname> "Bob" .
<AddressK> <Address#owner> <PersonC> .

The examples above illustrate some of the structural similarities and differences between RDF and relational data. A core philosophical difference is that RDF is a post-Web language; that is, it allows one to use web identifiers for the entities we want to describe, and for the attributes and relationships we use to describe them. If I trust the publishers not to lie to me, I can merge information from different parties trivially. An example of this trust can be voiced as "Do I trust that whenever these parties assert that someone has a<http://xmlns.com/foaf/0.1/givenName>, that value will indeed be that person's given name?" This means that RDF data is mergable in a way which for relational data would require an intermediate process of mapping the terms between databases and assuring that no tables used the same column name to mean different things.

Introduction to SPARQL and SQL

Experts that we now are on RDF and relational data, we can now examine their query languages. A SQL query to get the addresses for each person living in MA could look like:

SELECT Person.fname, Address.city
FROM Person, Address
WHERE Person.addr=Address.ID
AND Address.state="MA"

Conceptually, we are SELECTing a list of attributes FROM a set of tables WHERE certain constraints are met. These constraints capture the relationships implicit in the scheme, Person.addr=Addresses.ID, and the selection criteria, e.g. Address.state="MA".

A SPARQL query of the same data could look like

SELECT ?name ?city
WHERE {
     ?who <Person#fname> ?name ;
     <Person#addr> ?adr .
     ?adr <Address#city> ?city ;
     <Address#state> "MA"
}

For better or worse, SPARQL reuses some key words familiar to SQL users: SELECT, FROM, WHERE, UNION, GROUP BY, HAVING and most aggregate function names.

Looking at the examples above, we see this general form for SQL queries:

SELECT <attribute list>
FROM <table list>
WHERE <test expression>

The test expressions capture both the rows relevent to a particular query (the rows for people who live in the state "MA") and the structure of the database (Person.addr references Address.ID). Documentation and literature tends to avoid the ambiguous word "selection", instead using restriction for selecting rows and projection for selecting specific columns from those rows. The result of the query is a list of rows, each with the SELECTed list of attributes. Executing the example query over the sample database at the top of the lesson yields one solution, corresponding to the one person living in "MA":

fname city
Bob Cambridge

The SPARQL query above has a similar structure:

SELECT <variable list>
WHERE {<graph pattern> }

The variables in the variable list are bound by the graph pattern. (Remember, "graph" just means that three is a set of potentially interconnected statements.) The graph pattern looks like the data statements, but the subject, predicate or object may be replaced with a variable (terms beginning with a "?"). The pattern above finds all values of ?name, ?city, ?who and ?addr which match the data, projecting out only ?name and ?city:

?fname ?city
"Bob" "Cambridge"

The query had one solution with the variables fname and city bound to the literals "Bob" and "Cambridge". The terms in SPARQL solution sets are represented the same way as in SPARQL queries or Turtle statements. Note that the column headings in the SPARQL solution set are variables which appeared in the WHERE { <graph pattern> } while the column headings in the SQL results are the names of attributes (columns) in the SQL schema.

NULLs, OPTIONALS and LEFT OUTER JOINs

SQL uses the token NULL to indicate that data is not available or not applicable. SELECTs match table rows even if the selected attributes are NULL. However, join constraints will typically eliminate rows if there are no rows with corresponding values. The LEFT OUTER JOIN operator performs a regular ("inner") join but does not eliminate solutions if the join constraints are not met. The following query would select each person's fname and, if available, their city:

SELECT Person.fname, Address.city
FROM Person
LEFT OUTER JOIN Address ON Person.addr=Address.ID
WHERE Address.state="MA"

fname city
Bob Cambridge
Sue NULL

SPARQL uses the key word OPTIONAL instead of LEFT OUTER JOIN, but the effect is similar:

SELECT ?name ?city
WHERE {
    ?who < Person#fname> ?name .
    OPTIONAL {
         ?who < Person#addr> ?adr .
         ?adr < Address#city> ?city ;
         < Address#state> "MA"
    }
}

Though the join semantics are analogous between the two languages, there's a noticeable difference in the treatment of missing data. Missing data is simply not expressed in RDF. Also (and consequentially), SPARQL graph patterns will not bind if there are missing attributes, e.g. Sue's addr above. A SPARQL query selecting each person's name and the identifier for their address record must make the addr attribute OPTIONAL in order to match Sue's record:

SELECT ?name ?city
WHERE {
    ?who < Person#fname> ?name .
    OPTIONAL {?who < Person#addr> ?adr }
}

A truly analogous SQL query must prevent bindings to NULL, i.e.:

SELECT Person.fname, Person.addr
FROM Person
LEFT OUTER JOIN Address ON Person.addr=Address.ID
WHERE Address.state="MA" AND Person.addr IS NOT NULL

Other Operators Over Solution Sets

SPARQL and SQL have very similar UNION and MINUS operators, which respectively add and remove solutions from a solution set.

Because the datatypes of an SQL table are assumed to be uniform across all rows, care must be taken to align the datatypes of the SELECT. Some SQL databases enforce this rule, leaving the user with some helpful error messages to find the misalignment. Others amiably return heterogeneous columns, with e.g. the latter rows with values like 2012-05-28 in a column of floats.

Queries on the Web

SQL databases are respositories of data, with a set of tables populated by rows of data. SQL queries operate over a given database. SPARQL services vary in whether or not they have a pre-determined RDF database. The example query above presumes that a set of triples is available for querying, which one would expect from service which queries an already populated database. If the RDF database were by default empty, or didn't contain the data needed for the query, we would need to specify where to load that data. SPARQL re-uses the SQL keyword "FROM" to identify web resources which need to be loaded in order to complete a query:

SELECT ?name ?city
FROM<http://example.org/AddressBook>
WHERE { ... }

Exploration

Combining navigation with exploration is very easy in SPARQL. If I want to find out what my organization knows about reactions involving illudium phosdex, I can write a query which finds those reactions and explores their attributes:

SELECT ?reaction ?p ?o
WHERE {
    ?compound ex:name "illudium phosdex" ;
    ?reaction ex:involves ?compound ;
    ?reaction ?p ?o
}

In SQL, this would be like:

SELECT reactions.*
FROM reactions, compounds
WHERE reactions.compoundID=compounds.ID
AND compounds.name="illudium phosdex"

Generic Linked Data browsers currently leverage SPARQL to explore data, and will likely lead to a generation of purpose-built interfaces which enable knowledge users to understand and capitalize on information assets.

Federation

Probably the main feature of SPARQL which will impress SQL users is the ability to federate queries across different repositories. RDF theory provides a foundation to integrate graphs of data, and RDF tools put that power in the hands of users by allowing them to trivially retrieve multiple documents of data. Integrating large databases is also trivial, but instead of retrieving the data and merging it locally, one writes SPARQL queries which delegate portions of the query to remote query services, e.g. this week's unshipped orders:

SELECT ?order ?handler
WHERE {
     SERVICE<sales> {
         ?order ex:soldBy ?handler ;
         ex:dueDate ?due
         FILTER (?due> "2012-02-22"&& ?due< "2012-02-29"
     } MINUS {
         SERVICE<fulfillment> { ?order ex:shipped ?shipped }
     }
}

SQL has no standard system for query federation. Various products offer tools which more or less leverage the SQL syntax to manage access to a pre-assigned set of databases. MySQL FROM directives can join data from different virtual databases running in the same MySQL server, Oracle Database Streams and SQL Server Integration Services use that syntax to connect to databases which have been mapped to local schema names by some manual configuration.

Schema Re-use

The bold assertion that SPARQL can be used to trivially integrate data requires that the data of course be in intersecting domains. To make the integration truly trivial, it helps that the data be expressed in a similar fashion. If we want to connect one database's information about the physiological impact of certain chemicals to another database's ingredients of medications, it is certainly easier if the chemicals are represented the same way in both databases. This may seem like a tall order, but SPARQL and RDF's foundation in the web make it easy to explore, embrace and extend existing schemas. Where the collaborative spirit may be insufficient to inspire folks to use the same schema, we still have laziness to help us fight entropy.

Summary

This lesson introduced the relational and RDF data models and highlighted the structural differences between them. These differences become apparent in the capabilities of their query languages. We discussed the culture of re-use and integration that pervades the Semantic Web and mentioned that SPARQL can be used to access relational data as well as RDF. Two recent specifications, A Direct Mapping of Relational Data to RDF and R2RML: RDB to RDF Mapping Language, define this process. The next lesson will contrast SPARQL and SQL queries over two databases containing exactly the same information.