NLP and the Semantic Web

NLP and the Semantic Web

Natural language processing (NLP) and Semantic Web technologies are both Semantic Technologies, but with different and complementary roles in data management. In fact, the combination of NLP and Semantic Web technologies enables enterprises to combine structured and unstructured data in ways that are simply not practical using traditional tools.

This lesson will introduce NLP technologies and illustrate how they can be used to add tremendous value in Semantic Web applications.

Objectives

After completing this lesson, you will know:

What NLP technologies are and what they do.

Common uses of NLP technologies.

How NLP is used in Semantic Web applications to help manage unstructured data.

Prerequisites

Semantic Web vs. Semantic Technologies

Today's Lesson

Apple's Siri, IBM's Watson, Nuance's Dragon… there is certainly have no shortage of hype at the moment surrounding NLP. Truly, after decades of research, these technologies are finally hitting their stride, being utilized in both consumer and enterprise commercial applications.

NLP Defined

NLP strives to enable computers to make sense of human language.

Take just a moment to think about how hard that task actually is. Have you ever misunderstood a sentence you've read and had to read it all over again? Have you ever heard a jargon term or slang phrase and had no idea what it meant? How are your grammar skills, personally? Understanding what people are saying can be difficult even for us homo sapiens. Clearly, making sense of human language is a legitimately hard problem for computers.

The Turing Test

Of course, researchers have been working on these problems for decades. In 1950, the legendary Alan Turing created a test—later dubbed the Turing Test—that was designed to test a machine's ability to exhibit intelligent behavior, specifically using conversational language. C3P0 would pass this test. Unfortunately, however, Siri would not.

Although no actual computer has truly passed the Turing Test yet, we are at least to the point where computers can be used for real work. Apple's Siri accepts an astonishing range of instructions with the goal of being a personal assistant. IBM's Watson is even more impressive, having beaten the world's best Jeopardy players in 2011.

Using a variety of techniques (e.g. statistical modeling, lexical and grammatical parsing, and machine learning, among others), NLP technologies deconstruct words, sentences, paragraphs, and entire documents expressed in human language and map them onto a semantic structure that can be used by a computer.

Consider a question that you might ask of Siri: "What is the temperature in Boston, Massachusetts?" The task can be further simplified by assuming that the voice recognition technology maps out the correct spoken words, down to their correct spellings (if you have a non-American accent, you may already have been frustrated by Siri's limitations in this regard).

Once she has the text, Siri needs to extract certain elements of the question in order to have even a prayer of answering it—the two key elements being:

1. What needs to be returned (e.g., a temperature value of some kind)

2. The fact that Boston, Massachusetts is a location that can be referenced

Therefore, this information needs to be extracted and mapped to a structure that Siri can process.

How NLP Works

Computers require structure to accomplish anything. At their very root they are number crunchers, not poets. Computer languages are designed to be unambiguous, so embedded somewhere in Siri's code is probably a rule that says, "If they ask for the temperature, then my response is, ‘The temperature is N degrees.'"

Although human languages do have a structure (called grammar), that structure is highly ambiguous in every language on the planet. Words can have different meanings depending on context. Consider the English phrase, "I like oranges."

Does this phrase refer to a range of colors, or to the fruit?

Who is "I" in this context?

And, to be honest, grammar is in reality more of a set of guidelines than a set of rules that everyone follows.

To delve even deeper into the linguistic abyss, consider a sentence like, "Do you see the man with the binoculars?"

Does that mean, "Using the binoculars, do you see the man?"

Or does it mean, "Do you see the man who is holding binoculars?"

Grammatically, either interpretation is correct. Which one was intended depends on the context of the sentence. Therefore, NLP begins by look at grammatical structure, but guesses must be made wherever the grammar is ambiguous or incorrect.

Contextual clues must also be taken into account when parsing language. If the overall document is about orange fruits, then it is likely that any mention of the word "oranges" is referring to the fruit, not a range of colors.

Finally, NLP technologies typically map the parsed language onto a domain model. That is, the computer will not simply identify temperature as a noun but will instead map it to some internal concept that will trigger some behavior specific to temperature versus, for example, locations.

Common Industry Uses of NLP

These difficulties mean that general-purpose NLP is very, very difficult, so the situations in which NLP technologies seem to be most effective tend to be domain-specific. For example, Watson is very, very good at Jeopardy but is terrible at answering medical questions (IBM is actually working on a new version of Watson that is specialized for health care).

Similarly, some tools specialize in simply extracting locations and people referenced in documents and do not even attempt to understand overall meaning. Others effectively sort documents into categories, or guess whether the tone—often referred to as sentiment—of a document is positive, negative, or neutral.

The following are examples of some of the most common applications of NLP today.

Search - Semantic Search often requires NLP parsing of source documents. The specific technique used is called Entity Extraction, which basically identifies proper nouns (e.g., people, places, companies) and other specific information for the purposes of searching. For example, consider the query, "Find me all documents that mention Barack Obama." Some documents might contain "Barack Obama," others "President Obama," and still others "Senator Obama." When used correctly, extractors will map all of these terms to a single concept.

Auto-categorization - Imagine that you have 100,000 news articles and you want to sort them based on certain specific criteria. That would take a human ages to do, but a computer can do it very quickly.

Sentiment Analysis - Sentiment Analysis measures the "sentiment" of an article, typically meaning whether the article's tone is positive, negative, or neutral. This application of NLP technology is often used in conjunction with search, but it can also be used in other contexts, such as alerting. For example, a business owner might ask an application to "alert me when someone says something negative regarding my company on Facebook."

Summarization - Often used in conjunction with research applications, summaries of topics are created automatically so that actual people do not have to wade through a large number of long-winded articles (perhaps such as this one!).

Question Answering - This is the new hot topic in NLP, as evidenced by Siri and Watson. However, long before these tools, we had Ask Jeeves (now Ask.com), and later Wolfram Alpha, which specialized in question answering. The idea here is that you can ask a computer a question and have it answer you (Star Trek-style! "Computer…").

Many other applications of NLP technology exist today, but these five applications are the ones most commonly seen in modern enterprise applications.

Applying NLP in Semantic Web Projects

So how can NLP technologies realistically be used in conjunction with the Semantic Web? The answer is that the combination can be utilized in any application where you are contending with a large amount of unstructured information, particularly if you also are dealing with related, structured information stored in conventional databases.

Clearly, then, the primary pattern is to use NLP to extract structured data from text-based documents. These data are then linked via Semantic technologies to pre-existing data located in databases and elsewhere, thus bridging the gap between documents and formal, structured data.

Consider the example of Competitive Intelligence. In this field, professionals need to keep abreast of what's happening across their entire industry. Most information about the industry is published in press releases, news stories, and the like, and very little of this information is encoded in a highly structured way. However, most information about one's own business will be represented in structured databases internal to each specific organization.

Due to the lack of structure in news clippings, it is very difficult for a pharmaceutical competitive intelligence officer to get answers to questions such as, "Which companies have published information in the last 6 months referencing compounds that target a specific pathway that we're targeting this year?" At the moment, the most common approach to this problem is for certain people to read thousands of articles and keep  this information in their heads, or in workbooks like Excel, or, more likely, nowhere at all.

Fortunately, however, things are changing. Combining Semantic Web technologies with NLP technologies, we can now extract relevant information from news clippings and link that information to more rigorous scientific data (e.g., on biological compounds, pathways, which compounds relate to which pathways, etc.) that typically have already been defined and are being stored in structured databases. The combination of NLP and Semantic Web technology enables the pharmaceutical competitive intelligence officer to ask such complicated questions and actually get reasonable answers in return.

In fact, this is one area where Semantic Web technologies have a huge advantage over relational technologies. By their very nature, NLP technologies can extract a wide variety of information, and Semantic Web technologies are by their very nature created to store such varied and changing data. In cases such as this, a fixed relational model of data storage is clearly inadequate.

Together with the flexibility of RDF (the data model of the Semantic Web), NLP technologies can go to work doing what they do best: pulling whatever structure is possible from textual data. At the same time, this setup allows consumption to evolve along with NLP capabilities, without requiring a complete redesign of any applications that use those data. For example, if today your NLP implementation cannot extract protein data from documents, but next month it can, a Semantic Web application will embrace that change without requiring heavy re-implementation.

Conclusion

The combination of NLP and Semantic Web technologies provide the capability of dealing with a mixture of structured and unstructured data that is simply not possible using traditional, relational tools.