Take Advantage of the Metadata Benefits of Graph Databases

It’s well known that graph databases represent new categories of analytics capability and potential for machine learning. If you want to create a knowledge graph, understand buyer intent, or create a recommendation engine with PageRank, graph databases and the algorithms that they offer simplify this process. In addition, a knowledge graph lends itself well to delivering machine learning insights in both training the algorithms and the deployment of them. We expect these benefits from our graph database systems and they deliver.

The Less-Obvious Reason

However, there is another big benefit that’s often overlooked - one of metadata management and schemas. In certain situations, I’ve noticed that some analytics teams have little time to manage schemas for incoming data. You are given data and asked to produce analytics from it. Handling the schemas and potential changes can be a time-consuming challenge. NoSQL databases have been popular for their ease-of-use due to their flexible schemas. However, graph databases and the power of triples can work to simplify metadata management, too. By configuring all of your data into triples, you limit your need to have to set up rigid schemas, complicated ETL and data transformation, multiple tables and tricky, expensive JOINs.

Graph databases, specifically RDF triple stores like AnzoGraph^™, deal with data that’s almost always the same SUBJECT-PREDICATE-OBJECT, also known as triples. Of course, the facts are in a format that is specified by the RDF specification, but essentially, you’ll see facts like: schema/ ontology simplicity

John is a person
John is married to Sue
John buys a BMW
John resides in New York
John is the son of Andrew

In this system, you don’t have to know anything ahead of time about what you want to store and what type of analytics you want to run. You can add any facts about John at any time. If new data comes along about John on any subject, you can store it in a triple and not a separate table. You don’t need to create separate tables and joins with graph databases. You can get a lot out of triple stores.

If you need more complexity, you can use quads instead of triples and properties like those on a labelled property graph. AnzoGraph supports both of these features. If you want, you can use properties to identify, for example, when John bought the BMW, or how much he likes the brand. You can use quad stores when you want to manage multiple lists of facts, since all your facts about John may be coming from different places. It’s complexity is available to you, but not necessary to get value.

Contrast this with more rigid solutions. In a Relational Database Management System (RDBMS), I have to know what data I’m going to store about each person. It’s also a good idea to know what kind of analysis I will run so that the queries will run fast. Only then can I design a schema and factor the database correctly. Unlike with graph databases, managing schemas in an RDBMS is rigid and unforgiving.

NoSQL document stores have a similar openness. They store all data on a given entity within a single document. Any associated data is stored inside that one document. A serious indexing system lets the analysis run relatively fast. However, like RDBMSs, they can lack in storing and performing analysis on “connected” information. Pagerank and shortest-path algorithms may be available in the NoSQL world, but it’s a lot more work.

It’s About Analytics and Metadata

Graph databases do some interesting things to analyze your data, but don’t overlook the metadata simplicity of graph databases. A recent survey by CrowdFlower, a provider of a “data enrichment” platform for data scientists, determined that data scientists spend 60% of their time on cleaning and organizing data. It’s amazing how little time is spent on actual analysis and how much is spent just getting data to fit into the right buckets. If you find yourself in this position, consider a graph database such as AnzoGraph. It is a graph database built from the ground up to support Massive Parallel Processing (MPP) and advanced analytics at scale. AnzoGraph is unique in that it is a distributed Graph Online Analytical Processing (GOLAP) database, allowing users to load and perform analysis on billions or even trillions of triples. By supporting analytical queries, AnzoGraph can perform deeper analysis on more data than other graph databases. Graph algorithms in AnzoGraph support use cases such as recommendation engines, buyer intent, knowledge graphs, managing complex paths and inferencing and is available on-premises and in the cloud. If you'd like to try it out, you can download a free 60 day trial.

The Smart Data Blog