Data Management Design Patterns

Introduction

This is the fourth tutorial on Linked Data patterns. In the last tutorial we looked at several modelling patterns that can help guide the creation of a good graph model for our data. Getting the right amount of detail into our model ensures that we have a framework which can be later extended to cover more use cases.

In this tutorial we will be looking at some data management patterns that can help organise our data. The emphasis is on making it easier to compile and organise our RDF data, no matter what the source.

Prerequisites

Semantic Web Design Patterns—Introduction

Today’s Lesson

Depending on our needs we may be managing data that all comes from a single source, or we might be collating data from different sources across an enterprise or from Linked Open Data. In both cases we may need some flexibility in how we organise and describe our data.

The good news is that there are some simple patterns that can be applied which deliver a great deal of power.

Data Management in RDF

The basic RDF model consists of the subject-predicate-object triple. While the statements themselves describe a graph of relationships between resources, within a triple store the statements themselves are managed as a simple set. That means that we have no way to identify the origin of a particular triple or record when it was asserted. Adding triples into a triple store loses context that is useful for many applications.

To fix this, instead of dealing with triples we can instead use quads. A quad is a triple plus an extra URI. This identifier associates the triple with a collection of one or more triples known as a Named Graph.

Essentially, Named Graphs allow us to organise our RDF stores as a collection of documents. Each document has a URI and contains some RDF statements.

By grouping triples together into a Named Graph we can then work with that collection as a whole rather than at the triple level. For example we can describe their origin, or update and delete the whole group.

Most RDF stores are actually quad stores and so support managing RDF data as Named Graphs. There are also data formats (e.g. TRiG and NQuads) that can be used to exchange quads between stores. SPARQL too provides support for querying Named Graphs, allowing us to target our queries to either a single graph or a collection of graphs.

Quad stores also typically allow us to ignore the graph identifier when it is useful to do so. This allows us to run SPARQL queries across the entire collection as if they were in a single “Union Graph“. In other words we can view our quad store as if it were just a triple store. This means that we can design our applications to be agnostic to the graph structure of our dataset, effectively decoupling querying and data management.

Quad stores therefore exploit the power of RDF to provide an interesting hybrid of graph and document databases that delivers a great deal of flexibility in how we organise and query our data.

Named Graph Patterns

While Named Graph provide the basic mechanism for data management in RDF there are different patterns for applying it. The patterns vary based on the scope of the data in each graph.

Graph Per Source

How can we track the source of some triples in an RDF dataset?

Solution
Our RDF data might be derived from a conversion process mapping a database into RDF. Or it could result from harvesting data from the web. In both cases being able to identify the source of the data allows us to, e.g. replace all triples that derive from that source as part of an update process. To support this we can create a Named Graph for each source, with a suitable graph URI.

The graph URI could be an agreed “well-known” URI for a database or conversion process or, in the case of harvesting, it might simply be the URL from which the data was retrieved.

#TriG example
#URI for Named Graph is the source URL
<http://www.example.org/person.rdf> {
#These are the triples from source document
<http://www.example.org/person/joe> foaf:name “Joe Bloggs”.
}

Discussion
Using a Graph Per Source makes it easy to query a dataset using SPARQL to:

check whether a specific source is represented in a dataset
list out all the sources used to create a particular dataset
identify the source(s) that contributed a specific triple or set of triples

We can also easily update or replace all triples associated with a specific source, without having to remove individual triples.

The Graph Per Source pattern is best applied when provenance of data is important to an application. This is useful both for assembling a dataset but also understanding the source of individual data items.

Graph Per Resource

How can we organise a triple store in order to make it easy to manage the statements about an individual resource?

Solution
In some cases we are more interested in managing descriptions of individual resources, rather than a collection of different sources. This is useful for web applications where a user may be viewing or editing a resource in a browser.

In this case we use one graph per resource in our application. The graph URI is derived from the URI of the resource. In the simplest case, the graph URI is often just the URI of the resource whose description it contains. In others it is derived from the resource URI.

#TriG example
#URI for Named Graph is derived from resource URI, using different base URI
<http://graphs.example.org/person/joe> {
#Description of resource
<http://www.example.org/person/joe> foaf:name “Joe Bloggs”.
}

Discussion
The SPARQL Graph Protocol provides a simple HTTP interface for manipulating graphs in a quad store, using simple GET, PUT and DELETE operations. For typical web application scenarios we often need to retrieve a description of a resource and later replace it after an update by a user. By organising our store so that it uses a Graph Per Resource, rather than a Graph Per Source, we can use the Graph Protocol as the main interface for managing data in our application.

While using the resource URI as the graph URI can be convenient, it is better to apply a simple transformation to the resource URI to create a new URI for the graph. Having separate URIs for the resource and the graph means that we can safely store statements about both the graph and the resource in the same store, without any potential for confusion. This is especially useful in the annotation pattern described below.

Graph Annotation

How can we capture some metadata about a collection of triples?

Solution
Named Graph identifiers are URIs. This means that we can treat graphs the same as any other resource and make statements about them in RDF. We can then capture whatever metadata we need to help describe the graph:

#TriG example
#Named Graph created by retrieving data from a specific URL
#The graph URI is the source of the data
<http://www.example.org/person.rdf> {
#These are the triples from the source
<http://www.example.org/person/joe> foaf:name “Joe Bloggs”.
}

#A second Named Graph containing descriptions of the other graphs in our data
<http://graphs.example.org> {
#A triple describing the first graph, specifically when it was created.
<http://www.example.org/person.rdf> dct:created “2012-08-08”.
}

Discussion
As we saw in the introduction, Named Graphs allow us to treat a quad store as a collection of documents. Each of these documents is a resource with a unique URI, and using RDF we can describe that resource.

Graph Annotation is useful any time we need to capture a description of a graph. This metadata might include publication metadata (e.g. when it was created); provenance data (e.g. who created it) or even access control statements (e.g. who can access the graph).

Graph Annotation can be applied in conjunction with the other Named Graph patterns, so we are still free to scope the contents of those graphs as required.

Conclusion

In this tutorial we’ve introduced the concept of a Named Graph an extension of the RDF model that allows us to associate some context with our triples. Named Graphs allows us to organise triples into identifiable collections that can be individually annotated and managed within a quad store. There are different ways that Named Graphs can be applied. We can choose to scope documents based on resources, data sources, or some other useful partition.

Ultimately Named Graphs let us gain the benefits of a document-oriented approach for managing RDF data whilst still allowing us to query data in various ways.

So far in this series we’ve created an identifier scheme, modelled and then organised our data. In the next tutorial in we will move on to looking at some publishing patterns that are used when sharing our data with others.

Previous Lesson

Next Lesson

Learn Semantic Web Design Patterns