This is the sixth tutorial on Linked Data patterns. In the last tutorial we looked at a number of patterns that can be applied to help support the publishing of Linked Data.
In this final tutorial we will be looking at some application patterns that can be applied when developing applications that consume RDF data using SPARQL or Linked Data.
Semantic Web Design Patterns—Introduction
Application architecture is a rich source of design patterns. We can apply many of these well-known patterns when designing RDF applications. But there are some useful features of semantic web technologies that can help us create even more flexible application architectures.
RDF Application Patterns
Architectures vary from one RDF application to the next in several ways. One common point of variation is the type(s) of data source involved. Some applications are implemented around a single triplestore. The application handles all data access to the triplestore which fulfils the role of the primary database.
Other RDF applications may draw on data from several different locations, perhaps via retrieving Linked Data or interacting with multiple SPARQL endpoints. Whether they draw on Linked Open Data, private enterprise endpoints, or a mixture of the two, these applications are closer to “mashups” that combine several data sources on the fly.
Applications also vary depending on which schemas they support. An application might be designed as a SKOS editor or a FOAF social network browser, with the UI tailored to working with a particular type of data. However other RDF applications are more generic data browsers capable of querying and displaying a wide variety of different data sources and schemas.
These variations raise a number of important design questions that impact application architecture:
How can data quality be managed if data is coming from multiple sources?
How can the performance impacts of reliance on multiple, remote data sources be minimized?
How can an application be made extensible to take advantage of new data from existing sources?
There are a number of ways that specific features of RDF, SPARQL and HTTP can help address these design issues, allowing us to create flexible, robust software. The following patterns highlight some of these features.
Missing Isn’t Broken
How do we handle the potentially messy or incomplete data we use from the web?
Data that comes from remote sources—particularly from the web—may be incomplete or might not conform to a schema an application expects. To address this we should create applications that are more flexible, such as by making a “best effort” to process or display at least some data. Designing applications to work on a minimal set of data will make them more robust in the face of varying data quality.
The “Missing Isn’t Broken” pattern is really a restatement of Postel’s Law: Be conservative in what you send; be liberal in what you accept. This advice is highly relevant when working with Linked Open Data, as data models and data quality can vary widely.
Unlike other approaches (e.g. when using a relational database), when data is found to be missing Linked Data offers opportunities for finding more data. For example, the initial dataset can be extended by supplementing it with additional sources by applying the “Follow Your Nose” pattern.
The following pattern describes a method for testing data to see whether it conforms to expectations.
How can a dataset be tested for known patterns?
An application may want to probe a SPARQL endpoint to test whether it contains processable data. Similarly, an application might want to validate some data to ensure that it contains a minimally useful set of properties. In both cases a simple SPARQL ASK query can be used test an RDF graph for the desired patterns.
SPARQL has four different types of query: SELECT, DESCRIBE, CONSTRUCT and ASK. An ASK query returns a boolean response to indicate whether a pattern can be found in a dataset. These Assertion Queries can be used to probe a dataset to test whether it contains data of interest. For example, a social network application might probe a SPARQL endpoint to test if it contains terms from the FOAF vocabulary.
ASK queries can also be used as simple assertions when validating data, such as before storing data submitted by a client application. An ASK query is also a useful way to unit test a data conversion to confirm that the output data contains the expected results.
While developers often focus initially on SELECT or CONSTRUCT queries, each of the different forms of SPARQL query has its own role to play in creating RDF applications.
How can we generate a useful default description of a resource without having to enumerate all the properties or relations that are of interest?
Applications that use tightly defined queries can be brittle. They are not tolerant of missing data and are incapable of discovering new RDF properties. For example a Linked Data browser might want to fetch all available properties of a resource without knowing in advance which properties a given data source might contain.
Rather than use prescriptive queries, an application can extract data based on general patterns. The simplest way to achieve this is via a SPARQL DESCRIBE query which delegates responsibility for describing a resource to the SPARQL endpoint.
There are many different ways that a subset of a larger RDF graph can be extracted. An application may want to extract all properties of a single resource or just its relationships to other resources. A Bounded Description is a method of slicing up an RDF graph in order to create a useful view of a resource or resources. Bounded Descriptions rely on general patterns in the data, rather than specific properties, in order to partition the graph. Different forms of Bounded Description vary based on how much of the graph is traversed to extract the data.
The most common form is known as a Concise Bounded Description (CBD), but there are several alternatives, including:
Datatype Property Description — retrieve all properties of a resource whose values are literals
Object Property Description — retrieve all properties of a resource whose values are resources, typically eliminating blank nodes
Concise Bounded Description — effectively the above two descriptions, but recursively include all properties of any blank nodes present in object properties
Symmetric Concise Bounded Description — as above but include statements where the resource being described is the object, rather than the subject
Bounded descriptions can be implemented using SPARQL CONSTRUCT queries. SPARQL DESCRIBE queries are typically implemented using a Bounded Description; SPARQL processors most commonly generate results using a CBD.
How can we improve performance of an application dynamically retrieving Linked Data?
RDF applications often create in-memory graphs of data for local processing, e.g. to render a user interface. This data might be fetched over the web from several SPARQL endpoints or Linked Data sources. Rather than fetch data in series, we can parallelise the HTTP requests, reducing the overall response time.
Parallel Retrieval of resources over HTTP can greatly reduce the performance overheads of using multiple sources. Most HTTP client libraries support some form of parallel retrieval.
The overall response time for the retrieval is reduced to that of the slowest resource. Data requested asynchronously is then parsed and added to the working graph as it arrives. We can rely on RDF’s default merging model to ensure that the data from the different queries is merged into a consistent graph.
Combined with HTTP caching of commonly used sources this pattern can greatly improve performance for “mashup” type applications or generic data browsers.
This pattern is typically applied to divide up the task of creating a local copy of some data across a number of application threads. However we can also decompose creating the process of creating a local mirror across a number of processes as the following pattern illustrates.
How can the task of compiling or constructing a dataset be divided up into smaller tasks?
Constructing a dataset from remote sources often involves mirroring data into a local triplestore to reduce dependency on unreliable remote sources. Retrieval is often combined with other steps, such as transforming data to meet a specific schema. Rather than using a single application pipeline, the steps of mirroring, transforming, and enriching a dataset can be divided up into separate processes. The individual processes collaborate around a shared triplestore.
The “Blackboard” pattern is an existing architectural design pattern that has been used in a number of existing systems. It is a particularly effective way to decompose a complex data processing architecture into a number of simpler processes.
The decomposition of the data aggregation & conversion tasks into smaller units makes them easier to individually implement, using whatever tools or technologies are most applicable. The overall result of the processes cooperating to compile and enrich the dataset can be extremely complex but do not require any overall coordination effort. The independent processes can run on their own schedules and be driven by the presence or absence of necessary data in the shared store.
Additional processing steps (e.g. to acquire data from additional sources), can easily be added without impact to the overall system, making the architecture easy to extend.
In this tutorial we’ve looked at several useful RDF application patterns. Applications might use only a single managed data source or might rely on data from a mixture of public or private data sources. In both of these scenarios applications need to address the issue that data published in a distributed way might have varying quality, conform to varying schemas, and will evolve over time.
This is the last tutorial in the design pattern series. Over the entire series we’ve looked at a number of patterns relevant to semantic web technologies. Beginning with guidance on creating good identifiers, we’ve looked at patterns for modelling and managing our data, approaches for publishing data for re-use by others and finally some techniques for building flexible applications using RDF and SPARQL. In total these patterns should provide a good foundation for guiding your future work with semantic web technologies.